/*
This documents the processing of Cornell and JRGP oryza sativa genetic
markers to the rice genome.
Lenny Teytelman
Sun Mar 24 20:22:26 2002
*/


The Bacs are from the GenBenk Entrez Nucleotide query:

"Oryza [ORGN] AND (30000 [SLEN]:250000 [SLEN]) AND ((htg [KYWD] OR BAC
[ALL] OR chromosome [TITL] OR PAC [ALL]) NOT (marker [TITL] OR cDNA
[TITL] OR mRNA [TITL] OR RAPD [TITL] OR GSS [KYWD] OR telomere [TITL]
OR protein[TITL]))" for BACs, and

The genetic markers are from the Cornell and JRGP genetic maps.  Using
accessions of the markers, sequences were retrieved for each marker
from GenBank.

1,847 sequences were compared to  3,965 marker sequences from a total of
2,682 genetic markers, using BLAT with -minScore=120.   The BLAT results
contained 4,753 alignments.

For the genomic markers, the distribution of the sequenced clone gaps is:

Bac Gap
Length        Count
------        ----------
<010		339
<020		33
<030		15
<040		8
<050		3
<060		6
<070		5
<090		1
<200		3
<300		8
<400		2
<500		4
<800		1
>900		8


For the cDNA markers, the distribution of BAC gap lengths is:
Bac Gap
Length	      Count
------	      ----------
<010		2059
<020		121
<030		50
<040		37
<050		17
<060		14
<070		18
<080		52
<090		79
<100		95
<200		296
<300		226
<400		163
<500		121
<600		94
<700		73
<800		80
<900		64
>900		511


In the case of genomic markers, the true hits should not have large
gaps just as for BACend to BAC alignments.  And similarly to BACends,
genomic markers aligning with a gap of greater than 20bp were discarded.
For cDNA markers, hits with gaps greater that 3000 bp were filtered out.
This left 4,677 hits.

For a true match, the match should extend over most of the marker length.
The following is the distribution of the percentages of match legths
relative to total marker length:

% of Marker
matched    Count
--------   ------
=100		262
99		383
98		293
97		277
96		379
95		344
94		266
93		173
92		167
91		122
90		108
85-89		263
80-84		187
70-79		264
60-69		194
50-59		178
40-49		135
30-39		121
20-29		340
10-19		218
0-9		2


Deleting the matches that cover less than 90% of the marker results in
2,775 hits. The percent identity for the remaining matches is:

Percent
Identity	Count
---------- ----------
95		2
96		14
97		38
98		183
99		1081
100		1457


Of the above matches, 2,531 are from markers that align to a BAC from
the same chromosome as indicated on the marker map.  These hits represent
1,320 markers on 1,062 sequenced clones and they went into the Sequence
Viewer.

