/*
This documents the processing of oryza sativa CDSs to the rice genome.
Lenny Teytelman
Fri Apr 26 00:18:05 2002
*/
The BACs/PACs are from the GenBank Entrez Nucleotide query:

"Oryza [ORGN] AND (30000 [SLEN]:250000 [SLEN]) AND ((htg [KYWD] OR BAC
[ALL] OR chromosome [TITL] OR PAC [ALL]) NOT (marker [TITL] OR cDNA
[TITL] OR mRNA [TITL] OR RAPD [TITL] OR GSS [KYWD] OR telomere [TITL]
OR protein[TITL]))" for BACs, and

The Oryza Sativa coding sequences are from the GenBank Entrez
nucleotide query: "(txid4530[ORGN] AND complete[TITL] AND cds[TITL])
NOT (Mitochondrion[ALL] OR Chloroplast[ALL] OR Mitochondrial[ALL]) )
"  The average CDSs length is 2,091.


1,858 sequences were compared to  1,358 CDSs using BLAT with minScore=120.
The 28,064 BLAT hits were filtered using pslReps utility.  This resulted
in 1,790 alignments.



True hits should extend over all of the CDSs, unless the hit is at the
very end or beginning of a BAC/PAC. The following is the distribution
of the percentage coverage for the CDSs:

% of CDSs
matched		    Count
--------	    ------
0-9		237
10-19		36
20-29		71
30-39		16
40-49		12
50-59		11
60-69		27
70-79		31
80-84		29
85-89		22
90		8
91		9
92		13
93		6
94		45
95		82
96		25
97		40
98		72
99		349
=100		195



Discarding the hits that cover less than 96 percent of the CDSs  and
start/stop more than 20bp away from the BAC/PAC edge leaves 803 hits.
These have the following BAC gaps:

Bac Gap
Length	   Count
------ ----------
01000		517
02000		129
03000		73
04000		30
05000		21
06000		11
07000		3
08000		2
09000		7
10000		3
20000		5
30000		2



For the sake of clean displays, the hits with gaps greater than 3,000
are removed.  This results in 719 entries.  The percent identity for
the remaining matches is:

Percent
Identity	Count
---------- ----------
92		1
94		1
95		3
96		64
97		21
98		48
99		344
100		237


Some of the CDSs hit more than once.  The distribution of the hits is:

# Of
Hits per
CDSs	  # of EST
---	-----
01		374
02		108
03		20
04		1
05		1
70		1



The hits represent 505 unique CDSs and 413 sequenced clones.  Those having
at least one gap of length 50 or above, are considered multi-exon
CDSs hits.  287 are multi-exon and 220 are single-exon hits.

