/*
This documents the processing of zea mays Clusters and ESTs to the
rice genome.
Lenny Teytelman
Mon Mar 25 13:52:19 2002
*/

The BACs/PACs are from the GenBank Entrez Nucleotide query:

"Oryza [ORGN] AND (30000 [SLEN]:250000 [SLEN]) AND ((htg [KYWD] OR BAC
[ALL] OR chromosome [TITL] OR PAC [ALL]) NOT (marker [TITL] OR cDNA
[TITL] OR mRNA [TITL] OR RAPD [TITL] OR GSS [KYWD] OR telomere [TITL]
OR protein[TITL]))" for BACs, and

The Dupont Unigene set is from
http://www.agron.missouri.edu/files_dl/MMP/Cornsensus.fasta  The average
Clusters and ESTs length is 1,000.



1,847 sequences were compared to  10,678 Clusters and ESTs using BLAT
with mindIdentity=50.   The 13,242 BLAT hits were filtered using
pslReps utility with -minAli=0.85 -nearTop=0.01.  This resulted in
7,638 alignments.

The lengths of the matches are distributed as follows:

Length of
hits	     Count
--------     ------
0-100		1106
100-150		784
150-200		749
200-250		703
250-300		641
300-350		566
350-400		516
400-450		386
450-500		312
500-550		268
550-600		203
600-650		200
650-700		176
700-750		142
750-800		119
>800		767



Removing matches with less than 150bp match-length leaves 5,734 hits.

Many of the Clusters and ESTs hit more than once.  The distribution of
the hit frequencies is:

# Of
Hits per
Feature       Count
-------       -----
01		2704
02		1050
03		218
04		44
05		12
06		4
07		1
09		1



Clusters and ESTs that hit more than three times are removed, with 5,458
hits remaining.  These matches have the following distribution of the
percent identity per hit:

% Identity	 Count
----------	 ----------
82		57
83		225
84		382
85		504
86		607
87		704
88		703
89		714
90		605
91		407
92		263
93		143
94		71
95		41
96		18
97		7
98		6
99		1


The distribution of the sequenced clone gaps is:

Bac Gap
Length	      Count
------	      ----------
01000		3403
02000		1153
03000		512
04000		179
05000		80
06000		25
07000		35
08000		10
09000		9
10000		6
20000		25
30000		6
40000		5
50000		4
60000		2
70000		1
80000		1
>90000		2


The hits represent 3,972 unique Clusters and ESTs and 1,488 sequenced
clones.  Those having at least one gap of length 50 or above,
are considered multi-exon hits.  3,599 are multi-exon and 471 are
single-exon hits.

