/*
This documents the processing of hordeum vulgare ESTs to the rice genome.
Lenny Teytelman
Mon Mar 25 10:15:45 2002
*/

The BACs/PACs are from the GenBank Entrez Nucleotide query:

"Oryza [ORGN] AND (30000 [SLEN]:250000 [SLEN]) AND ((htg [KYWD] OR BAC
[ALL] OR chromosome [TITL] OR PAC [ALL]) NOT (marker [TITL] OR cDNA
[TITL] OR mRNA [TITL] OR RAPD [TITL] OR GSS [KYWD] OR telomere [TITL]
OR protein[TITL]))" for BACs, and

The ESTs are from the GenBank nucleotide query "txid4558[orgn]  AND
gbdiv_est[PROP]"  The average ESTs length is 586.



1,847 sequences were compared to  148,651 ESTs using BLAT with
mindIdentity=50.   The 302,684 BLAT hits were filtered using pslReps
utility with -minAli=0.85 -nearTop=0.01.  This resulted in 156,816
alignments.

The lengths of the matches are distributed as follows:

Length of
hits	     Count
--------     ------
0-100		26422
100-150		14299
150-200		15200
200-250		14398
250-300		14936
300-350		14400
350-400		12691
400-450		10209
450-500		8313
500-550		7644
550-600		7357
600-650		4951
650-700		3022
700-750		1605
750-800		973
>800		396



Removing matches with less than 150bp match-length leaves 115,830 hits.

Many of the ESTs hit more than once.  The distribution of the hit
frequencies is:

# Of
Hits per
Feature       Count
-------       -----
01		30843
02		14684
03		5052
04		1425
05		726
06		1922
07		2171
08		127
09		65
10		48
20		130
30		15
40		6
50		1
60		2



ESTs that hit more than three times are removed, with 75,367 hits
remaining.  These matches have the following distribution of the percent
identity per hit:

% Identity	 Count
----------	 ----------
82		605
83		2212
84		3711
85		5016
86		6550
87		7570
88		8746
89		9476
90		9201
91		7465
92		5414
93		3701
94		2452
95		1598
96		918
97		329
98		162
99		165
100		76


The distribution of the sequenced clone gaps is:

Bac Gap
Length	      Count
------	      ----------
01000		62260
02000		9418
03000		1979
04000		480
05000		246
06000		182
07000		64
08000		59
09000		21
10000		36
20000		302
30000		150
40000		47
50000		47
60000		43
70000		13
80000		7
90000		4
>90000		9


The hits represent 50,579 unique ESTs and 1,705 sequenced clones.
Those having at least one gap of length 50 or above, are considered
multi-exon hits.  40,400 are multi-exon and 12,482 are single-exon hits.

