/*
This documents the processing of oryza sativa BACends to the rice genome.
Lenny Teytelman
Sun Mar 24 21:03:14 2002
*/

The BACs/PACs are from GenBenk Entrez Nucleotide query:

"Oryza [ORGN] AND (30000 [SLEN]:250000 [SLEN]) AND ((htg [KYWD] OR BAC
[ALL] OR chromosome [TITL] OR PAC [ALL]) NOT (marker [TITL] OR cDNA
[TITL] OR mRNA [TITL] OR RAPD [TITL] OR GSS [KYWD] OR telomere [TITL]
OR protein[TITL]))" for BACs, and

The CUGI BACends are  from the GenBank nucleotide query "(CUGI Rice BAC
end) AND (oryza [ORGN])"  The average BACends length is 620.

The BACends were masked using RepeatMasker for arabidopsis and grasses
repeat libraries.

1,847 sequences were compared to  88,053 BACends using BLAT with
minScore=160.   The results containted a total of 2,065,620 hits.

Real matches should not have huge gaps.  The distribution of the sequenced
clone gaps is:

Bac Gap
Length        Count
------        ----------
<010		1027211
<020		234484
<030		133041
<040		89206
<050		62271
<060		49414
<070		42328
<080		35388
<090		29602
<100		23989
<200		126542
<300		51897
<400		31779
<500		17379
<600		6109
<700		4286
<800		3755
<900		2731
>900		94208


Selecting only matches with gap length up to 20 base pairs leaves
1,277,263 entries.  The true hits should also extend over the whole
BACends, unless the hit is at the very end or beginning of a BAC/PAC.
The following is the distribution of the percentage coverage for the
BACends:

% of BACends
matched    Count
--------   ------
10-19		899
20-29		81238
30-39		135453
40-49		134496
50-59		117458
60-69		116940
70-79		137645
80-84		117684
85-89		143694
90		35733
91		36600
92		38391
93		38663
94		35556
95		29782
96		22894
97		18363
98		16277
99		15883
=100		3614


Discarding entries that have BACends match coverage less than 96% and
start/stop more than 20bp away from the BAC/PAC edge, 138,880 entries
are left.  In these, the percent identity of the matching blocks is
distributed as follows:

% Identity	 Count
----------	 ----------
88		169
89		1667
90		3761
91		5643
92		6985
93		7944
94		9020
95		17224
96		19312
97		18551
98		19234
99		20152
100		9218


Filtering out the matches with less than 97% identity leaves 67,155.

Many of the BACends hit more than once.  The distribution of the hit
frequencies is:

# Of
Hits per
Feature       Count
-------       -----
01		11686
02		3287
03		1129
04		530
05		243
06		192
07		210
08		317
09		161
10		159
20		493
30		261
40		147
50		49
60		29
70		50
80		31
90		28
>90		44


BACends that hit more than three times are removed, with 18,260 hits
remaining.  These went into the Gramene Sequence Viewer.

