/*
This documents the processing of triticum aestivum ESTs to the rice
genome.
Lenny Teytelman
Mon Mar 25 10:54:39 2002
*/

The BACs/PACs are from the GenBank Entrez Nucleotide query:

"Oryza [ORGN] AND (30000 [SLEN]:250000 [SLEN]) AND ((htg [KYWD] OR BAC
[ALL] OR chromosome [TITL] OR PAC [ALL]) NOT (marker [TITL] OR cDNA
[TITL] OR mRNA [TITL] OR RAPD [TITL] OR GSS [KYWD] OR telomere [TITL]
OR protein[TITL]))" for BACs, and

The ESTs are from the GenBank nucleotide query "txid4565[orgn]  AND
gbdiv_est[PROP]"  The average ESTs length is 475.



1,847 sequences were compared to  73,395 ESTs using BLAT with
mindIdentity=50.   The 129,990 BLAT hits were filtered using pslReps
utility with -minAli=0.85 -nearTop=0.01.  This resulted in 72,132
alignments.

The lengths of the matches are distributed as follows:

Length of
hits	     Count
--------     ------
0-100		15737
100-150		9448
150-200		9050
200-250		7614
250-300		8630
300-350		7024
350-400		6043
400-450		3999
450-500		2314
500-550		1162
550-600		567
600-650		214
650-700		165
700-750		122
750-800		14
>800		29



Removing matches with less than 150bp match-length leaves 46,799 hits.

Many of the ESTs hit more than once.  The distribution of the hit
frequencies is:

# Of
Hits per
Feature       Count
-------       -----
01		15224
02		6984
03		2393
04		672
05		173
06		190
07		198
08		85
09		54
10		56
20		110
30		31
40		2
50		1
60		7



ESTs that hit more than three times are removed, with 36,371 hits
remaining.  These matches have the following distribution of the percent
identity per hit:

% Identity	 Count
----------	 ----------
82		294
83		1004
84		1597
85		2208
86		2859
87		3344
88		3895
89		4341
90		4544
91		3907
92		3308
93		2175
94		1520
95		745
96		331
97		111
98		64
99		77
100		47


The distribution of the sequenced clone gaps is:

Bac Gap
Length	      Count
------	      ----------
01000		29972
02000		4602
03000		1010
04000		266
05000		109
06000		81
07000		47
08000		22
09000		11
10000		6
20000		101
30000		50
40000		31
50000		30
60000		10
70000		15
80000		1
90000		2
>90000		5


The hits represent 24,601 unique ESTs and 1,635 sequenced clones.
Those having at least one gap of length 50 or above, are considered
multi-exon hits.  19,406 are multi-exon and 6,163 are single-exon hits.

