/*
This documents the processing of zea mays TCs and ESTs to the rice genome.
Lenny Teytelman
Mon Mar 25 14:23:35 2002
*/

The BACs/PACs are from the GenBank Entrez Nucleotide query:

"Oryza [ORGN] AND (30000 [SLEN]:250000 [SLEN]) AND ((htg [KYWD] OR BAC
[ALL] OR chromosome [TITL] OR PAC [ALL]) NOT (marker [TITL] OR cDNA
[TITL] OR mRNA [TITL] OR RAPD [TITL] OR GSS [KYWD] OR telomere [TITL]
OR protein[TITL]))" for BACs, and

TIGR Maize Gene Index(ZmGI) is from http://www.tigr.org/tdb/zmgi/  The
average TCs and ESTs length is 613.



1,847 sequences were compared to  27,642 TCs and ESTs using BLAT with
mindIdentity=50.   The 30,064 BLAT hits were filtered using pslReps
utility with -minAli=0.85 -nearTop=0.01.  This resulted in 16,202
alignments.

The lengths of the matches are distributed as follows:

Length of
hits	     Count
--------     ------
0-100		3274
100-150		2137
150-200		1968
200-250		1639
250-300		1538
300-350		1266
350-400		1164
400-450		752
450-500		536
500-550		371
550-600		249
600-650		203
650-700		175
700-750		129
750-800		96
>800		705



Removing matches with less than 150bp match-length leaves 10,756 hits.

Many of the TCs and ESTs hit more than once.  The distribution of the
hit frequencies is:

# Of
Hits per
Feature       Count
-------       -----
01		4729
02		1778
03		434
04		106
05		43
06		22
07		19
08		7
09		5
10		2
20		7
30		1
40		1



TCs and ESTs that hit more than three times are removed, with 9,587
hits remaining.  These matches have the following distribution of the
percent identity per hit:

% Identity	 Count
----------	 ----------
82		100
83		336
84		570
85		779
86		904
87		1118
88		1206
89		1220
90		1078
91		855
92		580
93		332
94		250
95		115
96		69
97		45
98		18
99		9
100		3


The distribution of the sequenced clone gaps is:

Bac Gap
Length	      Count
------	      ----------
01000		7201
02000		1470
03000		513
04000		166
05000		55
06000		37
07000		27
08000		21
09000		7
10000		17
20000		35
30000		16
40000		10
50000		7
60000		3
70000		1
>90000		1


The hits represent 6,941 unique TCs and ESTs and 1,576 sequenced clones.
Those having at least one gap of length 50 or above, are considered
multi-exon hits.  5,884 are multi-exon and 1,254 are single-exon hits.

