/*
This documents the processing of oryza sativa ESTs to the rice genome.
Lenny Teytelman
Fri Apr 19 13:26:09 2002
*/
The BACs/PACs are from the GenBank Entrez Nucleotide query:

"Oryza [ORGN] AND (30000 [SLEN]:250000 [SLEN]) AND ((htg [KYWD] OR BAC
[ALL] OR chromosome [TITL] OR PAC [ALL]) NOT (marker [TITL] OR cDNA
[TITL] OR mRNA [TITL] OR RAPD [TITL] OR GSS [KYWD] OR telomere [TITL]
OR protein[TITL]))" for BACs, and

The ESTs are from BGI,http://btn.genomics.org.cn/rice/.  For each EST,
15bp have been removed from the front and 50bp from the back.  The average
ESTs length is 417.


1,858 sequences were compared to  86,623 ESTs using BLAT with
minScore=120.   The 120,872 BLAT hits were filtered using pslReps utility.
This resulted in 80,156 alignments.



True hits should extend over all of the ESTs, unless the hit is at the
very end or beginning of a BAC/PAC. The following is the distribution
of the percentage coverage for the ESTs:

% of ESTs
matched		    Count
--------	    ------
10-19		1
20-29		406
30-39		1158
40-49		1258
50-59		1244
60-69		1347
70-79		2464
80-84		2491
85-89		4272
90		1325
91		1567
92		2086
93		2790
94		3330
95		4298
96		5143
97		6570
98		11383
99		18952
=100		8071



Discarding the hits that cover less than 96 percent of the ESTs  and
start/stop more than 20bp away from the BAC/PAC edge leaves 55,483 hits.
These have the following BAC gaps:

Bac Gap
Length	   Count
------ ----------
01000		44913
02000		7839
03000		1731
04000		561
05000		160
06000		81
07000		66
08000		41
09000		29
10000		13
20000		43
30000		5
40000		1



For the sake of clean displays, the hits with gaps greater than 3,000
are removed.  This results in 54,483 entries.  The percent identity for
the remaining matches is:

Percent
Identity	Count
---------- ----------
93		4
94		18
95		130
96		363
97		1273
98		6139
99		29087
100		17469


Some of the ESTs hit more than once.  The distribution of the hits is:

# Of
Hits per
ESTs	  # of EST
---	-----
01		29092
02		9410
03		1065
04		116
05		29
06		64
07		203
08		15
09		3
10		4
20		44
30		2
60		1
>90		1



The hits represent 40,049 unique ESTs and 1,810 sequenced clones.  Those
having at least one gap of length 50 or above, are considered multi-exon
ESTs hits.  23,692 are multi-exon and 16,383 are single-exon hits.

