/*
This documents the processing of sorghum bicolor ESTs to the rice genome.
Lenny Teytelman
Mon Mar 25 12:40:01 2002
*/

The BACs/PACs are from the GenBank Entrez Nucleotide query:

"Oryza [ORGN] AND (30000 [SLEN]:250000 [SLEN]) AND ((htg [KYWD] OR BAC
[ALL] OR chromosome [TITL] OR PAC [ALL]) NOT (marker [TITL] OR cDNA
[TITL] OR mRNA [TITL] OR RAPD [TITL] OR GSS [KYWD] OR telomere [TITL]
OR protein[TITL]))" for BACs, and

The ESTs are from the GenBank nucleotide query "txid4558[orgn]  AND
gbdiv_est[PROP]"  The average ESTs length is 475.



1,847 sequences were compared to  84,711 ESTs using BLAT with
mindIdentity=50.   The 130,295 BLAT hits were filtered using pslReps
utility with -minAli=0.85 -nearTop=0.01.  This resulted in 68,126
alignments.

The lengths of the matches are distributed as follows:

Length of
hits	     Count
--------     ------
0-100		13040
100-150		8259
150-200		8372
200-250		7866
250-300		7945
300-350		7464
350-400		6186
400-450		4337
450-500		2527
500-550		1400
550-600		500
600-650		177
650-700		42
700-750		2
750-800		4
>800		5



Removing matches with less than 150bp match-length leaves 46,660 hits.

Many of the ESTs hit more than once.  The distribution of the hit
frequencies is:

# Of
Hits per
Feature       Count
-------       -----
01		17961
02		7662
03		2421
04		442
05		221
06		239
07		204
08		33
09		3
10		6
20		2



ESTs that hit more than three times are removed, with 40,548 hits
remaining.  These matches have the following distribution of the percent
identity per hit:

% Identity	 Count
----------	 ----------
82		507
83		1296
84		1824
85		2652
86		3175
87		4008
88		4402
89		4842
90		4934
91		4484
92		3142
93		1889
94		1225
95		1139
96		479
97		154
98		107
99		230
100		59


The distribution of the sequenced clone gaps is:

Bac Gap
Length	      Count
------	      ----------
01000		35394
02000		3717
03000		683
04000		241
05000		110
06000		73
07000		36
08000		10
09000		19
10000		14
20000		126
30000		57
40000		42
50000		9
60000		10
70000		3
80000		3
>90000		1


The hits represent 28,044 unique ESTs and 1,634 sequenced clones.
Those having at least one gap of length 50 or above, are considered
multi-exon hits.  21,150 are multi-exon and 8,020 are single-exon hits.

