#### README #### ----------------------- GFF FLATFILE DUMPS ----------------------- Gene annotation is provided in GFF3 format. Detailed specification of the format is maintained by the Sequence Ontology: https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md GFF3 files are validated using GenomeTools: http://genometools.org For chromosomal assemblies, in addition to a file containing all genes, there are per-chromosome files. If a predicted geneset is available (generated by Genscan and other ab initio tools), these genes are in a separate 'abinitio' file. The 'type' of gene features is: * "gene" for protein-coding genes * "ncRNA_gene" for RNA genes * "pseudogene" for pseudogenes The 'type' of transcript features is: * "mRNA" for protein-coding transcripts * a specific type or RNA transcript such as "snoRNA" or "lnc_RNA" * "pseudogenic_transcript" for pseudogenes All transcripts are linked to "exon" features. Protein-coding transcripts are linked to "CDS", "five_prime_UTR", and "three_prime_UTR" features. Attributes for feature types: (square brackets indicate data which is not available for all features) * region types: * ID: Unique identifier, format ":" * [Alias]: A comma-separated list of aliases, usually including the INSDC accession * [Is_circular]: Flag to indicate circular regions * gene types: * ID: Unique identifier, format "gene:" * biotype: Ensembl biotype, e.g. "protein_coding", "pseudogene" * gene_id: Ensembl gene stable ID * version: Ensembl gene version * [Name]: Gene name * [description]: Gene description * transcript types: * ID: Unique identifier, format "transcript:" * Parent: Gene identifier, format "gene:" * biotype: Ensembl biotype, e.g. "protein_coding", "pseudogene" * transcript_id: Ensembl transcript stable ID * version: Ensembl transcript version * [Note]: If the transcript sequence has been edited (i.e. differs from the genomic sequence), the edits are described in a note. * exon * Parent: Transcript identifier, format "transcript:" * exon_id: Ensembl exon stable ID * version: Ensembl exon version * constitutive: Flag to indicate if exon is present in all transcripts * rank: Integer that show the 5'->3' ordering of exons * CDS * ID: Unique identifier, format "CDS:" * Parent: Transcript identifier, format "transcript:" * protein_id: Ensembl protein stable ID * version: Ensembl protein version Metadata: * genome-build - Build identifier of the assembly e.g. GRCh37.p11 * genome-version - Version of this assembly e.g. GRCh37 * genome-date - The date of the release of this assembly e.g. 2009-02 * genome-build-accession - Genome accession e.g. GCA_000001405.14 * genebuild-last-updated - Date of the last genebuild update e.g. 2013-09 ----------- FILE NAMES ------------ The files are consistently named following this pattern: ..<_version>.gff3.gz : The systematic name of the species. : The assembly build name. : The version of Ensembl from which the data was exported. gff3 : All files in these directories are in GFF3 format gz : All files are compacted with GNU Zip for storage efficiency. e.g. Homo_sapiens.GRCh38.81.gff3.gz For the predicted gene set, an additional abinitio flag is added to the name file. ...abinitio.gff3.gz e.g. Homo_sapiens.GRCh38.81.abinitio.gff3.gz ------------------ Example GFF3 output ------------------ ##gff-version 3 #!genome-build Pmarinus_7.0 #!genome-version Pmarinus_7.0 #!genome-date 2011-01 #!genebuild-last-updated 2013-04 GL476399 Pmarinus_7.0 supercontig 1 4695893 . . . ID=supercontig:GL476399;Alias=scaffold_71 GL476399 ensembl gene 2596494 2601138 . + . ID=gene:ENSPMAG00000009070;Name=TRYPA3;biotype=protein_coding;description=Trypsinogen A1%3B Trypsinogen a3%3B Uncharacterized protein [Source:UniProtKB/TrEMBL%3BAcc:O42608];logic_name=ensembl;version=1 GL476399 ensembl transcript 2596494 2601138 . + . ID=transcript:ENSPMAT00000010026;Name=TRYPA3-201;Parent=gene:ENSPMAG00000009070;biotype=protein_coding;version=1 GL476399 ensembl exon 2596494 2596538 . + . Name=ENSPMAE00000087923;Parent=transcript:ENSPMAT00000010026;constitutive=1;ensembl_end_phase=1;ensembl_phase=-1;rank=1;version=1 GL476399 ensembl exon 2598202 2598361 . + . Name=ENSPMAE00000087929;Parent=transcript:ENSPMAT00000010026;constitutive=1;ensembl_end_phase=2;ensembl_phase=1;rank=2;version=1 GL476399 ensembl exon 2599023 2599282 . + . Name=ENSPMAE00000087937;Parent=transcript:ENSPMAT00000010026;constitutive=1;ensembl_end_phase=1;ensembl_phase=2;rank=3;version=1 GL476399 ensembl exon 2599814 2599947 . + . Name=ENSPMAE00000087952;Parent=transcript:ENSPMAT00000010026;constitutive=1;ensembl_end_phase=0;ensembl_phase=1;rank=4;version=1 GL476399 ensembl exon 2600895 2601138 . + . Name=ENSPMAE00000087966;Parent=transcript:ENSPMAT00000010026;constitutive=1;ensembl_end_phase=-1;ensembl_phase=0;rank=5;version=1 GL476399 ensembl CDS 2596499 2596538 . + 0 ID=CDS:ENSPMAP00000009982;Parent=transcript:ENSPMAT00000010026 GL476399 ensembl CDS 2598202 2598361 . + 2 ID=CDS:ENSPMAP00000009982;Parent=transcript:ENSPMAT00000010026 GL476399 ensembl CDS 2599023 2599282 . + 1 ID=CDS:ENSPMAP00000009982;Parent=transcript:ENSPMAT00000010026 GL476399 ensembl CDS 2599814 2599947 . + 2 ID=CDS:ENSPMAP00000009982;Parent=transcript:ENSPMAT00000010026 GL476399 ensembl CDS 2600895 2601044 . + 0 ID=CDS:ENSPMAP00000009982;Parent=transcript:ENSPMAT00000010026 GL476399 ensembl five_prime_UTR 2596494 2596498 . + . Parent=transcript:ENSPMAT00000010026 GL476399 ensembl three_prime_UTR 2601045 2601138 . + . Parent=transcript:ENSPMAT00000010026