/*
This documents the processing of oryza sativa TCs to the rice genome.
Lenny Teytelman
Fri Apr 26 02:48:25 2002
*/
The BACs/PACs are from the GenBank Entrez Nucleotide query:

"Oryza [ORGN] AND (30000 [SLEN]:250000 [SLEN]) AND ((htg [KYWD] OR BAC
[ALL] OR chromosome [TITL] OR PAC [ALL]) NOT (marker [TITL] OR cDNA
[TITL] OR mRNA [TITL] OR RAPD [TITL] OR GSS [KYWD] OR telomere [TITL]
OR protein[TITL]))" for BACs, and

TIGR Maize Gene Index(ZmGI) is from http://www.tigr.org/tdb/ogi/  The
average TCs length is 850.


1,858 sequences were compared to  12,354 TCs using BLAT with
minScore=120.   The 56,146 BLAT hits were filtered using pslReps
utility.  This resulted in 12,971 alignments.



True hits should extend over all of the TCs, unless the hit is at the
very end or beginning of a BAC/PAC. The following is the distribution
of the percentage coverage for the TCs:

% of TCs
matched             Count
--------            ------
0-9             25
10-19           101
20-29           106
30-39           130
40-49           117
50-59           176
60-69           257
70-79           331
80-84           217
85-89           408
90              130
91              140
92              163
93              250
94              332
95              470
96              870
97              1077
98              1752
99              3664
=100            2255



Discarding the hits that cover less than 96 percent of the TCs  and
start/stop more than 20bp away from the BAC/PAC edge leaves 10,194
hits.  These have the following BAC gaps:

Bac Gap
Length     Count
------ ----------
01000           6769
02000           1701
03000           922
04000           358
05000           182
06000           86
07000           55
08000           34
09000           24
10000           12
20000           44
30000           3
40000           1
50000           2
>90000          1



For the sake of clean displays, the hits with gaps greater than 3,000
are removed.  This results in 9,392 entries.  The percent identity for
the remaining matches is:

Percent
Identity        Count
---------- ----------
93              1
94              1
95              9
96              47
97              283
98              691
99              4285
100             4075


Some of the TCs hit more than once.  The distribution of the hits is:

# Of
Hits per
TCs       # of EST
---     -----
01              4221
02              1778
03              204
04              28
05              16
06              4
07              10
08              5
09              2
10              3
20              6
30              5
40              1
50              2
60              3
70              1
90              1



The hits represent 6,290 unique TCs and 1,730 sequenced clones.  Those
having at least one gap of length 50 or above, are considered
multi-exon TCs hits.  4,195 are multi-exon and 2,101 are single-exon
hits.

