/*
This documents the processing of zea mays ESTs to the rice genome.
Lenny Teytelman
Mon Mar 25 11:48:03 2002
*/

The BACs/PACs are from the GenBank Entrez Nucleotide query:

"Oryza [ORGN] AND (30000 [SLEN]:250000 [SLEN]) AND ((htg [KYWD] OR BAC
[ALL] OR chromosome [TITL] OR PAC [ALL]) NOT (marker [TITL] OR cDNA
[TITL] OR mRNA [TITL] OR RAPD [TITL] OR GSS [KYWD] OR telomere [TITL]
OR protein[TITL]))" for BACs, and

The ESTs are from the GenBank nucleotide query "txid4577[orgn]  AND
gbdiv_est[PROP]"  The average ESTs length is 462.



1,847 sequences were compared to  147,657 ESTs using BLAT with
mindIdentity=50.   The 194,774 BLAT hits were filtered using pslReps
utility with -minAli=0.85 -nearTop=0.01.  This resulted in 113,774
alignments.

The lengths of the matches are distributed as follows:

Length of
hits	     Count
--------     ------
0-100		22936
100-150		16744
150-200		15256
200-250		13538
250-300		14785
300-350		11878
350-400		9652
400-450		4258
450-500		2373
500-550		1501
550-600		579
600-650		182
650-700		56
700-750		11
750-800		12
>800		13



Removing matches with less than 150bp match-length leaves 73,771 hits.

Many of the ESTs hit more than once.  The distribution of the hit
frequencies is:

# Of
Hits per
Feature       Count
-------       -----
01		23982
02		11165
03		3571
04		1135
05		459
06		418
07		274
08		139
09		90
10		58
20		204
40		15



ESTs that hit more than three times are removed, with 57,025 hits
remaining.  These matches have the following distribution of the percent
identity per hit:

% Identity	 Count
----------	 ----------
82		445
83		1602
84		2512
85		3606
86		4654
87		5710
88		6861
89		7459
90		7263
91		5769
92		4022
93		2744
94		1975
95		1193
96		508
97		245
98		345
99		100
100		12


The distribution of the sequenced clone gaps is:

Bac Gap
Length	      Count
------	      ----------
01000		48983
02000		5755
03000		1064
04000		250
05000		168
06000		108
07000		45
08000		76
09000		43
10000		43
20000		201
30000		111
40000		116
50000		20
60000		24
70000		15
>90000		3


The hits represent 38,718 unique ESTs and 1,625 sequenced clones.
Those having at least one gap of length 50 or above, are considered
multi-exon hits.  30,186 are multi-exon and 10,181 are single-exon hits.

