Difference between revisions of "GFF"

From GMOD
Jump to: navigation, search
m (GTF)
m (GFF2 Syntax)
Line 54: Line 54:
  
 
See the [http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml GFF2 Specification] for full details on GFF2.
 
See the [http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml GFF2 Specification] for full details on GFF2.
 +
 +
== Converting GFF2 to GFF3 ==
 +
 +
Converting a file from GFF2 to GFF3 format is problematic for reasons given below.  There are several GFF2 to GFF3 converters available on the web, but they make specific assumptions about the GFF2 data that are not likely to apply to your data.  GMOD does not endorse (or disparage) any particular converter.  If you have GFF2 data from an external source, and they don't also provide it in GFF3 format, then you may be stuck with GFF2.
 +
 +
Some areas that need to be addressed by any GFF2 to GFF3 converter:
 +
 +
=== Column 3: Feature Type ===
 +
 +
If the GFF2 file does not use Sequence Ontology terms in column 3 then some sort of translation will need to be done on the types in the GFF2 to convert them to be SO terms.
 +
 +
=== Column 9: Attributes ===
 +
 +
Column 9 has a slightly different format and is much more tightly defined in GFF3 than GFF2.  Both require attention.  GFF2 does not have any reserved attribute names, uses C style encoding/escaping of special characters, and has many other small differences.
 +
 +
=== Nesting ===
 +
 +
Another big problem is that GFF2 supports only one level of feature nesting.  While you can certainly reproduce this minimal nesting in GFF3, it would be better to also convert your feature representations to be multi-level at the time you migrate the data to GFF3.  This is non-trivial.
  
 
== GTF ==
 
== GTF ==

Revision as of 20:10, 22 December 2008

{{#icon: WorkInProgressTools.gif|||}}

This page is under construction.


GFF is a standard file format for storing genomic features in a text file. GFF stands for Generic Feature Format. GFF files are plain text, 9 column, tab-delimited files. They are frequently used in GMOD as a data exchange format.

Versions

GFF has several versions, the most recent of which is GFF3. GFF3 addresses several shortcomings in its predecessor, GFF2. GFF3 is the preferred format in GMOD, but data is not always available in GFF3 format, so you may have to use GFF2. The two versions are similar but are not compatible and scripts usually only work with one of the other format.

Unfortunately, people, documentation, and even this web site are not always clear about what version of GFF is being discussed. This web page will always specify which version it is referring to.

Finally, GTF is another file format that is very similar to GFF and is sometimes referred to as GFF2.5.

GFF3

The formal specification of GFF3 is on the Sequence Ontology web site. It completely describes the format, including column definitions, metadata and directives. It also contains lengthy sections explaining how to represent different situations in GFF3, including:

  • canonical genes non-coding transcripts
  • parent (part-of) relationships
  • alignments
  • ontology association and database cross references
  • single exon genes
  • polycistronic transcripts
  • genes containing inteins
  • trans-spliced transcripts
  • programmed frameshifts
  • operons

This section covers just the basics. If you want the full and definitive explanation of GFF3 then see the standard.

GFF3 Annotation Section

GFF3 format is a flat tab-delimited file. The first line of the file is a comment that identifies the file format and version. This is followed by a series of data lines, each one of which corresponds to an annotation.Here is a miniature GFF3 file:

##gff-version 3
ctg123  .  exon  1300  1500  .  +  .  ID=exon00001
ctg123  .  exon  1050  1500  .  +  .  ID=exon00002
ctg123  .  exon  3000  3902  .  +  .  ID=exon00003
ctg123  .  exon  5000  5500  .  +  .  ID=exon00004
ctg123  .  exon  7000  9000  .  +  .  ID=exon00005

The ##gff-version 3 line is required and must be the first line of the file. It introduces the annotation section of the file.

The 9 columns of the annotation section are as follows:

Column 1: "seqid"

The ID of the landmark used to establish the coordinate system for the current feature. IDs may contain any characters, but must escape any characters not in the set [a-zA-Z0-9.:^*$@!+_?-|]. In particular, IDs may not contain unescaped whitespace and must not begin with an unescaped ">".
To escape a character in this, or any of the other GFF3 fields, replace it with the percent sign followed by its hexadecimal representation. For example, ">" becomes "%E3". See URL Encoding (or: 'What are those "%20" codes in URLs?') for details.

Column 2: "source"

The source is a free text qualifier intended to describe the algorithm or operating procedure that generated this feature. Typically this is the name of a piece of software, such as "Genescan" or a database name, such as "Genbank." In effect, the source is used to extend the feature ontology by adding a qualifier to the type creating a new composite type that is a subclass of the type in the type column. It is not necessary to specify a source. If there is no source, put a "." (a period) in this field.

Column 3: "type"

The type of the feature (previously called the "method"). This is constrained to be either: (a) a term from the "lite" sequence ontology, SOFA; or (b) a SOFA accession number. The latter alternative is distinguished using the syntax SO:000000. This field is required.

Columns 4 & 5: "start" and "end"

The start and end of the feature, in 1-based integer coordinates, relative to the landmark given in column 1. Start is always less than or equal to end.
For zero-length features, such as insertion sites, start equals end and the implied site is to the right of the indicated base in the direction of the landmark. These fields are required.

Column 6: "score"

The score of the feature, a floating point number. As in earlier versions of the format, the semantics of the score are ill-defined. It is strongly recommended that E-values be used for sequence similarity features, and that P-values be used for ab initio gene prediction features. If there is no score, put a "." (a period) in this field.

Column 7: "strand"

The strand of the feature. + for positive strand (relative to the landmark), - for minus strand, and . for features that are not stranded. In addition, ? can be used for features whose strandedness is relevant, but unknown.

Column 8: "phase"

For features of type "CDS", the phase indicates where the feature begins with reference to the reading frame. The phase is one of the integers 0, 1, or 2, indicating the number of bases that should be removed from the beginning of this feature to reach the first base of the next codon. In other words, a phase of "0" indicates that the next codon begins at the first base of the region described by the current line, a phase of "1" indicates that the next codon begins at the second base of this region, and a phase of "2" indicates that the codon begins at the third base of this region. This is NOT to be confused with the frame, which is simply start modulo 3. If there is no phase, put a "." (a period) in this field.
For forward strand features, phase is counted from the start field. For reverse strand features, phase is counted from the end field.
The phase is required for all CDS features.

Column 9: "attributes"

A list of feature attributes in the format tag=value. Multiple tag=value pairs are separated by semicolons. URL escaping rules are used for tags or values containing the following characters: ",=;". Spaces are allowed in this field, but tabs must be replaced with the %09 URL escape. This field is not required.

Column 9 Tags

Column 9 tags have predefined meanings:

ID
Indicates the unique identifier of the feature. IDs must be unique within the scope of the GFF file.
Name
Display name for the feature. This is the name to be displayed to the user. Unlike IDs, there is no requirement that the Name be unique within the file.
Alias
A secondary name for the feature. It is suggested that this tag be used whenever a secondary identifier for the feature is needed, such as locus names and accession numbers. Unlike ID, there is no requirement that Alias be unique within the file.
Parent
Indicates the parent of the feature. A parent ID can be used to group exons into transcripts, transcripts into genes, and so forth. A feature may have multiple parents. Parent can *only* be used to indicate a partof relationship.
Target
Indicates the target of a nucleotide-to-nucleotide or protein-to-nucleotide alignment. The format of the value is "target_id start end [strand]", where strand is optional and may be "+" or "-". If the target_id contains spaces, they must be escaped as hex escape %20.
Gap
The alignment of the feature to the target if the two are not collinear (e.g. contain gaps). The alignment format is taken from the CIGAR format described in the Exonerate documentation. http://cvsweb.sanger.ac.uk/cgi-bin/cvsweb.cgi/exonerate?cvsroot=Ensembl). See the GFF3 specification for more information.
Derives_from
Used to disambiguate the relationship between one feature and another when the relationship is a temporal one rather than a purely structural "part of" one. This is needed for polycistronic genes. See the GFF3 specification for more information.
Note
A free text note.
Dbxref
A database cross reference. See the GFF3 specification for more information.
Ontology_term
A cross reference to an ontology term. See the GFF3 specification for more information.

Multiple attributes of the same type are indicated by separating the values with the comma "," character, as in:

Parent=AF2312,AB2812,abc-3

Note that attribute names are case sensitive. "Parent" is not the same as "parent".

All attributes that begin with an uppercase letter are reserved for later use. Attributes that begin with a lowercase letter can be used freely by applications. You can stash any semi-structured data into the database by using one or more unreserved (lowercase) tags.

GFF3 Sequence Section

GFF3 files can also include sequence in FASTA format at the end of the file. The FASTA sequences are preceded by a ##FASTA line. This sequence section is optional. If present, the sequence section can define sequence for any landmark used in column 1 (the frame of reference). For example: For example:

##gff-version 3
ctg123 . exon            1300  1500  .  +  .  ID=exon00001
ctg123 . exon            1050  1500  .  +  .  ID=exon00002
ctg123 . exon            3000  3902  .  +  .  ID=exon00003
ctg123 . exon            5000  5500  .  +  .  ID=exon00004
ctg123 . exon            7000  9000  .  +  .  ID=exon00005
##FASTA
>ctg123
cttctgggcgtacccgattctcggagaacttgccgcaccattccgccttg
tgttcattgctgcctgcatgttcattgtctacctcggctacgtgtggcta
tctttcctcggtgccctcgtgcacggagtcgagaaaccaaagaacaaaaa
aagaaattaaaatatttattttgctgtggtttttgatgtgtgttttttat
aatgatttttgatgtgaccaattgtacttttcctttaaatgaaatgtaat
cttaaatgtatttccgacgaattcgaggcctgaaaagtgtgacgccattc
...

When the GFF3 file is processed the IDs on the header line of FASTA entries are matched with IDs used in column 1 in the annotation section of the file.

You don't have to store the FASTA in the GFF file. You can also store your sequences in a separate file containing only FASTA entries.

GFF2

GFF2 is a supported format in GMOD, but it is now deprecated and if you have a choice you should use GFF3. Unfortunately, data is sometimes only available in GFF2 format. GFF2 has a number of shortcomings compared to GFF3. GFF2 can only represent 2 level feature hierarchies, while GFF3 can support arbitrary levels. GFF2 also does not require that column 3, the feature type, be part of the sequence ontology. It can be any string. This often led to quality control and data exchange problems.

GFF2 Syntax

The GFF2 format is very similar to GFF3. Therefore this section only covers the parts that are different. GFF2 has the same 9 columns as the GFF3 Annotation Section. Significant differences are

  • Column 3: Feature Type
    The feature type does not have to be a valid Sequence Ontology term in GFF2,
  • Column 9: Attributes
    The attributes column is still a list of key value pairs. In GFF2, attribute keys and values are separated by spaces instead of "=". Values containing spaces must be double quoted. Multiple attributes are separated by semicolons.

See the GFF2 Specification for full details on GFF2.

Converting GFF2 to GFF3

Converting a file from GFF2 to GFF3 format is problematic for reasons given below. There are several GFF2 to GFF3 converters available on the web, but they make specific assumptions about the GFF2 data that are not likely to apply to your data. GMOD does not endorse (or disparage) any particular converter. If you have GFF2 data from an external source, and they don't also provide it in GFF3 format, then you may be stuck with GFF2.

Some areas that need to be addressed by any GFF2 to GFF3 converter:

Column 3: Feature Type

If the GFF2 file does not use Sequence Ontology terms in column 3 then some sort of translation will need to be done on the types in the GFF2 to convert them to be SO terms.

Column 9: Attributes

Column 9 has a slightly different format and is much more tightly defined in GFF3 than GFF2. Both require attention. GFF2 does not have any reserved attribute names, uses C style encoding/escaping of special characters, and has many other small differences.

Nesting

Another big problem is that GFF2 supports only one level of feature nesting. While you can certainly reproduce this minimal nesting in GFF3, it would be better to also convert your feature representations to be multi-level at the time you migrate the data to GFF3. This is non-trivial.

GTF

GTF, is another file format that is very similar to GFF2 and is sometimes referred to as GFF2.5. GTF is not a supported format in GMOD so if you have a GTF file you'll need to convert it to GFF3. The gft2gff3 script does this conversion, with some caveats. See also this BioPerl-l posting.

GFF in GMOD

See Also