/*
This documents the processing of oryza sativa Clusters to the rice genome.
Lenny Teytelman
Thu Apr 25 17:59:35 2002
*/
The BACs/PACs are from the GenBank Entrez Nucleotide query:

"Oryza [ORGN] AND (30000 [SLEN]:250000 [SLEN]) AND ((htg [KYWD] OR BAC
[ALL] OR chromosome [TITL] OR PAC [ALL]) NOT (marker [TITL] OR cDNA
[TITL] OR mRNA [TITL] OR RAPD [TITL] OR GSS [KYWD] OR telomere [TITL]
OR protein[TITL]))" for BACs, and

The EST clusters are from BGI,http://btn.genomics.org.cn/rice/.  For each
cluster, 15bp have been removed from the front and 50bp from the back.
The average Clusters length is 573.


1,858 sequences were compared to  24,179 Clusters using BLAT with
minScore=120.   The 40,954 BLAT hits were filtered using pslReps utility.
This resulted in 20,531 alignments.



True hits should extend over all of the Clusters, unless the hit is at
the very end or beginning of a BAC/PAC. The following is the distribution
of the percentage coverage for the Clusters:

% of Clusters
matched		    Count
--------	    ------
0-9		8
10-19		125
20-29		186
30-39		367
40-49		304
50-59		361
60-69		332
70-79		478
80-84		290
85-89		436
90		147
91		159
92		204
93		262
94		442
95		540
96		872
97		1505
98		2901
99		7437
=100		3175



Discarding the hits that cover less than 96 percent of the Clusters  and
start/stop more than 20bp away from the BAC/PAC edge leaves 16,597 hits.
These have the following BAC gaps:

Bac Gap
Length	   Count
------ ----------
01000		12219
02000		2655
03000		1041
04000		356
05000		129
06000		76
07000		45
08000		18
09000		19
10000		12
20000		19
30000		4
40000		2
50000		1
>90000		1



For the sake of clean displays, the hits with gaps greater than 3,000
are removed.  This results in 15,915 entries.  The percent identity for
the remaining matches is:

Percent
Identity	Count
---------- ----------
94		8
95		68
96		117
97		346
98		1548
99		8850
100		4978


Some of the Clusters hit more than once.  The distribution of the hits is:

# Of
Hits per
Clusters	  # of EST
---	-----
01		9035
02		2747
03		291
04		40
05		8
06		4
07		1
08		1
20		3
30		3
60		1
>90		1



The hits represent 12,135 unique Clusters and 1,801 sequenced clones.
Those having at least one gap of length 50 or above, are considered
multi-exon Clusters hits.  7,158 are multi-exon and 4,984 are single-exon
hits.

