Difference between revisions of "GFF Tutorial"

From GMOD
Redirect page
Jump to: navigation, search
m (tagging)
m (Redirecting to most recent)
 
Line 1: Line 1:
This tutorial on [[GFF]] was given by [[User:Scott|Scott Cain]] at the [[2012 GMOD Summer School]].
+
#REDIRECT [[GFF Tutorial 2012]]
 
+
'''GFF''' is a standard file format for storing genomic features in a text file.  GFF stands for ''Generic Feature Format''.  GFF files are plain text, 9 column, tab-delimited files.  GFF ''[[Databases and GMOD|databases]]'' also exist.  They use a {{GlossaryLink|Schema|schema}} custom built to represent GFF data.  GFF is [[#GFF in GMOD|frequently used in GMOD]] for data exchange and representation of genomic data.
+
 
+
== Versions ==
+
 
+
GFF has several versions, the most recent of which is [[#GFF3|GFF3]]. GFF3 addresses several shortcomings in its predecessor, [[#GFF2|GFF2]].  '''GFF3 is the preferred format in GMOD''', but data is not always available in GFF3 format, so you may have to use GFF2.  The two versions are similar but are not compatible and scripts usually only work with one of the other format.  This page discusses GFF3 in detail.  GFF2 details are covered on a [[GFF2|separate page]].
+
 
+
Unfortunately, people, documentation, and even this web site are not always clear about what version of GFF is being discussed.  This web page will always specify which version it is referring to.
+
 
+
Finally, [[#GTF|GTF]] is another file format that is very similar to GFF and is sometimes referred to as GFF2.5.
+
 
+
== GFF3 ==
+
 
+
The [http://www.sequenceontology.org/gff3.shtml formal specification of GFF3] is on the [http://www.sequenceontology.org/ Sequence Ontology] web site.  It completely describes the format, including column definitions, metadata and directives.  It also contains lengthy sections explaining how to represent different situations in GFF3, including:
+
* canonical genes non-coding transcripts
+
* parent (part-of) relationships
+
* alignments
+
* ontology association and database cross references
+
* single exon genes
+
* polycistronic transcripts
+
* genes containing inteins
+
* trans-spliced transcripts
+
* programmed frameshifts
+
* operons
+
 
+
Some of these cases are covered on this page as well.  If you want the full and definitive explanation of GFF3, see [http://www.sequenceontology.org/gff3.shtml the standard].
+
 
+
== GFF3 Annotation Section ==
+
 
+
This first describes the format of the annotation section, and then provides explanations of how to represent several different types of data.
+
 
+
=== GFF3 Format ===
+
{{GFF3Columns}}
+
 
+
=== Nesting Features ===
+
 
+
Many genomic features are discontinuous and have multiple subparts. GFF3 represents such features by linking the parts together with the Parent tag. For example, to represent an mRNA transcript that has five exons, we could write this:
+
 
+
##gff-version 3
+
ctg123 . mRNA            1300  9000  .  +  .  ID=mrna0001;Name=sonichedgehog
+
ctg123 . exon            1300  1500  .  +  .  ID=exon00001;Parent=mrna0001
+
ctg123 . exon            1050  1500  .  +  .  ID=exon00002;Parent=mrna0001
+
ctg123 . exon            3000  3902  .  +  .  ID=exon00003;Parent=mrna0001
+
ctg123 . exon            5000  5500  .  +  .  ID=exon00004;Parent=mrna0001
+
ctg123 . exon            7000  9000  .  +  .  ID=exon00005;Parent=mrna0001
+
 
+
The first feature is an mRNA that extends from position 1300 to 9000 in genomic coordinates. It has an ID of "mrna0001" and a human-readable name of "sonichedgehog" (note that the ID and the Name are '''not''' the same thing). This is followed by five exon features, each of which is linked to the mRNA using a Parent tag. When [[GBrowse]] displays this transcript, it will display each of the exons linked together by a solid line. The entire set can be found by searching for the name "sonichedgehog."
+
 
+
The ID is really only important for linking features together. If a feature does not have any subparts, then it does not formally need an ID. Thus, we could simplify this by removing all the exon IDs:
+
 
+
##gff-version 3
+
ctg123 . mRNA            1300  9000  .  +  .  ID=mrna0001;Name=sonichedgehog
+
ctg123 . exon            1300  1500  .  +  .  Parent=mrna0001
+
ctg123 . exon            1050  1500  .  +  .  Parent=mrna0001
+
ctg123 . exon            3000  3902  .  +  .  Parent=mrna0001
+
ctg123 . exon            5000  5500  .  +  .  Parent=mrna0001
+
ctg123 . exon            7000  9000  .  +  .  Parent=mrna0001
+
 
+
Multiple levels of nesting are allowed. If this transcript is part of an operon, then we can add another level of nesting:
+
 
+
##gff-version 3
+
ctg123 . operon          1300 15000  .  +  .  ID=operon001;Name=superOperon
+
ctg123 . mRNA            1300  9000  .  +  .  ID=mrna0001;Parent=operon001;Name=sonichedgehog
+
ctg123 . exon            1300  1500  .  +  .  Parent=mrna0001
+
ctg123 . exon            1050  1500  .  +  .  Parent=mrna0001
+
ctg123 . exon            3000  3902  .  +  .  Parent=mrna0001
+
ctg123 . exon            5000  5500  .  +  .  Parent=mrna0001
+
ctg123 . exon            7000  9000  .  +  .  Parent=mrna0001
+
ctg123 . mRNA          10000 15000  .  +  .  ID=mrna0002;Parent=operon001;Name=subsonicsquirrel
+
ctg123 . exon          10000 12000  .  +  .  Parent=mrna0002
+
ctg123 . exon          14000 15000  .  +  .  Parent=mrna0002
+
 
+
===Discontinuous Features===
+
 
+
In addition to nested features, another common type of genomic annotation is the ''discontinuous feature'' in which a single feature spans multiple discontinuous portions of the genome. The primary example is an alignment, such as a cDNA sequence that has been aligned to genomic sequence. GFF3 deals with these features by representing each continuous segment as a distinct row, and then giving each segment the same ID to tie them together. For example:
+
 
+
ctg123 example match 26122 26126 . + . ID=match001
+
ctg123 example match 26497 26869 . + . ID=match001
+
ctg123 example match 27201 27325 . + . ID=match001
+
ctg123 example match 27372 27433 . + . ID=match001
+
ctg123 example match 27565 27565 . + . ID=match001
+
 
+
Note that this is distinct from the nested features we looked at in the previous section. In the former case, there is a single parent feature and multiple child features that are linked to the parent via a Parent tag. The IDs of the children are distinct from each other (or absent altogether). In the latter case, each segment of the discontinuous feature has the same ID. There is no parent.
+
 
+
''Note that this method of grouping discontinuous features is not currently supported by the GMOD Chado bulk GFF3 loader.  Parent-child grouping is required.''
+
 
+
===Protein-Coding Genes===
+
 
+
We'll now look at how to represent several common cases, starting with protein-coding genes.
+
 
+
The most general way of representing a protein-coding gene is the so-called "three-level gene." The top level is a feature of type "gene" which bundles up the gene's transcripts and regulatory elements. Beneath this level are one or more transcripts of type "mRNA". This level can also accommodate promoters and other cis-regulatory elements. At the third level are the components of the mRNA transcripts, most commonly CDS coding segments and UTRs. This example shows how to represent a gene named "EDEN" which has three alternatively-spliced mRNA transcripts:
+
 
+
ctg123 example gene            1050 9000 . + . ID=EDEN;Name=EDEN;Note=protein kinase
+
 
+
ctg123 example mRNA            1050 9000 . + . ID=EDEN.1;Parent=EDEN;Name=EDEN.1;Index=1
+
ctg123 example five_prime_UTR  1050 1200 . + . Parent=EDEN.1
+
ctg123 example CDS            1201 1500 . + 0 Parent=EDEN.1
+
ctg123 example CDS            3000 3902 . + 0 Parent=EDEN.1
+
ctg123 example CDS            5000 5500 . + 0 Parent=EDEN.1
+
ctg123 example CDS            7000 7608 . + 0 Parent=EDEN.1
+
ctg123 example three_prime_UTR 7609 9000 . + . Parent=EDEN.1
+
 
+
ctg123 example mRNA            1050 9000 . + . ID=EDEN.2;Parent=EDEN;Name=EDEN.2;Index=1
+
ctg123 example five_prime_UTR  1050 1200 . + . Parent=EDEN.2
+
ctg123 example CDS            1201 1500 . + 0 Parent=EDEN.2
+
ctg123 example CDS            5000 5500 . + 0 Parent=EDEN.2
+
ctg123 example CDS            7000 7608 . + 0 Parent=EDEN.2
+
ctg123 example three_prime_UTR 7609 9000 . + . Parent=EDEN.2
+
 
+
ctg123 example mRNA            1300 9000 . + . ID=EDEN.3;Parent=EDEN;Name=EDEN.3;Index=1
+
ctg123 example five_prime_UTR  1300 1500 . + . Parent=EDEN.3
+
ctg123 example five_prime_UTR  3000 3300 . + . Parent=EDEN.3
+
ctg123 example CDS            3301 3902 . + 0 Parent=EDEN.3
+
ctg123 example CDS            5000 5500 . + 1 Parent=EDEN.3
+
ctg123 example CDS            7000 7600 . + 1 Parent=EDEN.3
+
ctg123 example three_prime_UTR 7601 9000 . + . Parent=EDEN.3
+
 
+
We start with a feature of type "gene" with the ID "EDEN". This has three alternative splice forms named EDEN.1, EDEN.2 and EDEN.3. To tell [[GBrowse]] that each of these splice forms are part of the same gene, we give each one a Parent attribute of "EDEN" corresponding to the ID of the parent gene. Now consider mRNA EDEN.1. It has a five_prime_UTR feature, a three_prime_UTR feature, and four CDS features. To indicate that the CDS and UTR features belong to the mRNA, we give the mRNA a unique ID of "EDEN.1" and give each of the subfeatures a corresponding parent. This pattern repeats for each of the other two splice forms. Note how the five_prime_UTR of EDEN.3 is split in two parts.
+
 
+
We use "Name" to give the gene and its alternative splice forms a human-readable name, and use Note to provide a description for the gene as a whole (you can add notes to the individual mRNAs but they won't display by default). The Index=1 attribute is a hint to some indexed database to make the mRNAs searchable by name. This lets users find the gene by searching for the mRNA names ("EDEN.1") as well as by the gene name ("EDEN"). However, it is usually unnecessary to do this. Also notice that we are using the Phase column for the CDS features to describe how the CDS is translated into protein. See the description of phase at the beginning of this section.
+
 
+
There are other ways of representing genes. Please see [http://www.sequenceontology.org/gff3.shtml the GFF3 Specification] and [http://gmod.svn.sourceforge.net/viewvc/gmod/Generic-Genome-Browser/branches/stable/docs/tutorial/tutorial.html?content-type=text%2Fhtml The GBrowse Administration Tutorial] for more information.
+
 
+
===Alignments===
+
 
+
Nucleotide to genome, and protein to genome alignments are a little tricky because they involve two coordinate systems, the coordinates of the alignment on the genome (known as the "source" coordinates), and the coordinates of the cDNA, EST or protein (known as the "target" coordinates). In GFF3, the target coordinates are specified using the '''Target''' tag.
+
 
+
ctg123 est EST_match 1050 1500 . + . ID=Match1;Name=agt830.5;Target=agt830.5 1 451
+
ctg123 est EST_match 3000 3202 . + . ID=Match1;Name=agt830.5;Target=agt830.5 452 654
+
 
+
ctg123 est EST_match 5410 5500 . - . ID=Match2;Name=agt830.3;Target=agt830.3 505 595
+
ctg123 est EST_match 7000 7503 . - . ID=Match2;Name=agt830.3;Target=agt830.3 1 504
+
 
+
ctg123 est EST_match 1050 1500 . + . ID=Match3;Name=agt221.5;Target=agt221.5 1 451
+
ctg123 est EST_match 5000 5500 . + . ID=Match3;Name=agt221.5;Target=agt221.5 452 952
+
ctg123 est EST_match 7000 7300 . + . ID=Match3;Name=agt221.5;Target=agt221.5 953 1253
+
 
+
This example shows three different alignment features of type "EST_match". Each alignment has a distinct ID, and all the discontinuous parts of the alignment have the same ID, as described earlier. In addition to the ID and Name tags, each segment also has a Target tag whose value has the format "<nowiki><target seqid> <target start> <target end></nowiki>." For example, the very first line indicates that the EST named agt830.5 aligns to genomic contig ctg123 such that positions 1 through 451 of agt830.5 aligns to bases 1050-1500 of ctg123.
+
 
+
Using the ##FASTA section of the GFF3 file, you can specify the sequence of the ESTs as well as of the contig, and [[GBrowse]] will display the DNA and/or protein sequences in the appropriate contexts.
+
 
+
See the [http://www.sequenceontology.org/gff3.shtml GFF3 specification] for instructions on how to represent gapped alignments.
+
 
+
===Quantitative Data===
+
 
+
GBrowse can plot quantitative data such as alignment scores, confidence scores from gene prediction programs, and microarray intensity data. There is a simple format that can be placed directly inside of a GFF3 file but does not scale to very large data sets, and a "WIG" format designed for very high-density quantitative data such as tiling arrays.
+
 
+
We first look at the simple format:
+
 
+
ctg123 affy microarray_oligo  1 100 281 . . Name=Expt1
+
ctg123 affy microarray_oligo 101 200 183 . . Name=Expt1
+
ctg123 affy microarray_oligo 201 300 213 . . Name=Expt1
+
ctg123 affy microarray_oligo 301 400 191 . . Name=Expt1
+
ctg123 affy microarray_oligo 401 500 288 . . Name=Expt1
+
ctg123 affy microarray_oligo 501 600 184 . . Name=Expt1
+
 
+
In this format, which can be embedded directly in the GFF3 file, each data point is a distinct feature with a start and end point. The features are grouped together by giving them a common experimental name so that they can be retrieved together. We use the '''score''' field (column 6) to represent the quantitative information (e.g. hybridization intensity).
+
 
+
In contrast, when using WIG format, the quantitative data is kept outside of the main database in a special-purpose binary file that is kept somewhere on the file system. In this case the GFF3 file contains a single line per experiment like this one:
+
 
+
ctg123 . microarray_oligo 1 50000 . . . Name=example;wigfile=/usr/data/ctg123.Expt1.wig
+
 
+
The .wig file is created and managed using a script called <tt>wiggle2gff3.pl</tt> that comes with [[GBrowse]]. Instructions on how to use this script is described in the [http://gmod.cvs.sourceforge.net/*checkout*/gmod/Generic-Genome-Browser/docs/tutorial/tutorial.html?pathrev=stable GBrowse Administration Tutorial].
+
 
+
== GFF3 Sequence Section ==
+
 
+
{{GFF3FASTA}}
+
 
+
== GFF3 Validation ==
+
 
+
You can validate reasonably large GFF3 files at the following sites:
+
* [http://modencode.oicr.on.ca/cgi-bin/validate_gff3_online modENCODE validator]
+
* [http://public.ecolihub.net/cgi-bin/validate_gff3_online/validate_gff3_online EcoliHub validator]
+
The validator code can be found in the [http://gmod.svn.sourceforge.net/viewvc/gmod/gff-validator/ GMOD Sourceforge SVN repository].
+
 
+
== GFF2 ==
+
 
+
[[GFF2]] is a supported format in GMOD, '''but it is now deprecated and if you have a choice you should use GFF3'''.  Unfortunately, data is sometimes only available in GFF2 format.  GFF2 has a number of shortcomings compared to GFF3.
+
 
+
See [[GFF2]] for more on this format.
+
 
+
== GTF ==
+
 
+
[http://mblab.wustl.edu/GTF2.html ''GTF''], is another file format that is very similar to [[GFF2|GFF2]] and is sometimes referred to as GFF2.5.  GTF is not a supported format in GMOD so if you have a GTF file you'll need to convert it to [[#GFF3|GFF3]].  The <tt>[http://song.cvs.sourceforge.net/song/software/scripts/gtf2gff3/ gft2gff3]</tt> script does this conversion, with some caveats.  See also [http://www.nabble.com/Hi-td17810093.html this BioPerl-l posting].
+
 
+
== GFF in GMOD ==
+
 
+
A number of [[GMOD Components]] support GFF files.  This section provides a brief description of that support.
+
 
+
=== Apollo ===
+
 
+
The [[Apollo]] genome annotation editor can read and write annotations in GFF3 format.  You can also load GFF3 data into [[Chado]] and have Apollo connect with the database.
+
 
+
=== Chado ===
+
 
+
GFF3 data can be loaded into and dumped from [[Chado]] databases.  See:
+
* [[Load GFF Into Chado]]
+
* [[Load BLAST Into Chado]] - by converting it to GFF3
+
* [[Load GenBank into Chado]] - by converting it to GFF3
+
* [[Load RefSeq Into Chado]] - by converting it to GFF3
+
* [[Chado Update via GFF]]
+
* [[GMODTools]] - Generate [[GFF3]] from a [[Chado]] database.
+
 
+
=== CMap ===
+
 
+
The [[CMap]] comparative mapping viewer can read data in GFF3 format.
+
 
+
=== GBrowse ===
+
 
+
The [[GBrowse]] genome viewer supports data in [[GBrowse Adaptors|many formats]], but in many ways GFF3 is its native data format.  GBrowse also supports [[GFF2|GFF2]] data.  See the [[GBrowse]] and [[GBrowse Adaptors|GBrowse Adaptors]] pages for details.
+
 
+
=== JBrowse ===
+
 
+
The [[JBrowse]] genome browser also supports data in many formats, and also tends to prefer GFF3.
+
 
+
== See Also ==
+
 
+
* [http://www.sequenceontology.org/gff3.shtml GFF3 Specification @ the Sequence Ontology]
+
* [http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml GFF2 Specification]
+
* [[GFF2]] - GFF2 in GMOD
+
* [http://www.broad.mit.edu/annotation/argo/help/gff.html Broad Institute's Argo File Formats GFF page]
+
* [http://www.bioperl.org/wiki/GFF BioPerl's GFF page]
+
  
 
[[Category:GFF3]]
 
[[Category:GFF3]]
 
[[Category:GFF]]
 
[[Category:GFF]]
 
[[Category:Tutorials]]
 
[[Category:Tutorials]]

Latest revision as of 21:04, 11 September 2012