/*
This documents the processing of Oryza Sativa BAC-EST alignments
Lenny Teytelman
Tue Apr  2 08:53:38 2002
*/
The BACs/PACs are from the GenBank Entrez Nucleotide query:

"Oryza [ORGN] AND (30000 [SLEN]:250000 [SLEN]) AND ((htg [KYWD] OR BAC
[ALL] OR chromosome [TITL] OR PAC [ALL]) NOT (marker [TITL] OR cDNA
[TITL] OR mRNA [TITL] OR RAPD [TITL] OR GSS [KYWD] OR telomere [TITL]
OR protein[TITL]))" for BACs, and

The ESTs are from the Genbank Entrez Nucleotide query "txid4530[orgn]
AND  gbdiv_est[PROP]".  The average EST length is 367.



1,847 sequences were compared to  104,549 ESTs using BLAT with
minScore=120.   The 173,370 BLAT hits were filtered using pslReps utility.
This resulted in 103,766 alignments.



True hits should extend over all of the est, unless the hit is at the
very end or beginning of a BAC/PAC. The following is the distribution
of the percentage coverage for the ests:

% of EST
matched    Count
--------   ------
10-19		37
20-29		1093
30-39		3387
40-49		6760
50-59		3070
60-69		2683
70-79		3030
80-84		2443
85-89		3728
90		1123
91		1374
92		1688
93		2205
94		3336
95		4430
96		6327
97		9403
98		15082
99		20489
=100		12078



Discarding the hits that cover less than 96 percent of the ESTs  and
start/stop more than 20bp away from the BAC/PAC edge leaves 68,938 hits.
These have the following BAC gaps:

Bac Gap
Length	   Count
------ ----------
01000		61356
02000		5799
03000		1207
04000		272
05000		137
06000		39
07000		50
08000		13
09000		10
10000		7
20000		27
30000		9
40000		4
50000		4
60000		2
>90000		2



For the sake of clean displays, the hits with gaps greater than 3,000
are removed.  This results in 68,362 entries.  The percent identity for
the remaining matches is:

Percent
Identity	Count
---------- ----------
92		1
94		20
95		136
96		571
97		1899
98		6785
99		24088
100		34862


Some of the ESTs hit more than once.  The distribution of the hits is:

# Of
Hits per
EST	  # of EST
---	-----
01		35605
02		12433
03		1174
04		197
05		55
06		118
07		123
08		14
09		5
10		2
20		21
30		8
40		5
50		2
60		6
70		1
>90		4



The hits represent 49,773 unique ESTs and 1,821 sequenced clones.  Those
having at least one gap of length 50 or above, are considered multi-exon
est hits.  21,707 are multi-exon and 28,106 are single-exon hits.

The above hits went into the Sequence Viewer as two separate tracks.
