Difference between revisions of "Load GFF Into Chado"

From GMOD
Jump to: navigation, search
m
m (Download the GFF Files)
Line 30: Line 30:
 
An easy way
 
An easy way
 
to load data into the database is to use a GFF3 file and the script
 
to load data into the database is to use a GFF3 file and the script
<code>load/bin/gmod_bulk_load_gff3.pl</code>.  A nice set of sample data is the GFF3 file prepared
+
<code>load/bin/gmod_bulk_load_gff3.pl</code>.  A good set of sample data is the GFF3 file prepared
 
by the nice folks at the Saccharomyces Genome Database:
 
by the nice folks at the Saccharomyces Genome Database:
  
Line 36: Line 36:
  
 
This file contains [http://geneontology.org Gene Ontology (GO)] anotations, so if you didn't load
 
This file contains [http://geneontology.org Gene Ontology (GO)] anotations, so if you didn't load
GO when you executed `make ontologies`, you will get many warning messages
+
GO when you executed <code>make ontologies</code> you will get many warning messages
 
about being unable to find entries in the [[Chado_Tables:Table:_dbxref|dbxref]] table.  If you want to
 
about being unable to find entries in the [[Chado_Tables:Table:_dbxref|dbxref]] table.  If you want to
load GO  you should be able to execute <code>make ontologies</code> and select 'Gene Ontology'
+
load GO  you should be able to execute <code>make ontologies</code> and select '''Gene Ontology'''
 
for installation.
 
for installation.
  
Then execute gmod_bulk_load_gff3.pl:
+
==Add an Entry for Your Organism==
  
>gmod_bulk_load_gff3.pl --organism yeast --gfffile saccharomyces_cerevisiae.gff
+
You will need to have an entry for your species in the [[Chado_Tables#Table:_organism|Chado organism table]]. If you are unsure if this entry exists log into your database and execute this SQL command:
 +
<sql>
 +
select common_name from organism;
 +
</sql>
 +
If you do not see your organism listed, execute a command equivalent to this:
 +
<sql>
 +
  insert into organism (abbreviation, genus, species, common_name, organism_id)
 +
                values ('S.cerevisiae', 'Saccharomyces', 'cerevisiae', 'yeast', 4932);
 +
</sql>
 +
Substitute in the appropriate values for your own organism.
  
This loads the GFF3 file.  The loading script requires GFF3 as it has  tighter control of the syntax and requires the use of a controlled  vocabulary (from Sequence Ontology Feature Annotation (SOFA)), allowing  mapping to the relational schema.  In addition to supplying the location  of the file with the --gfffile flag, the --organism tag uses the common  name (common_name field) from the organism table.  Do  <code>perldoc gmod_bulk_load_gff.pl</code> for  more information on adding other organisms and databases, as well as other available commandline flags.
 
  
GFF3 can also be generated via a script provided with Bioperl,  bp_genbank2gff.pl:
+
==Load the GFF==
  
>bp_genbank2gff.pl --stdout --file <genbank file> > <gff file>
+
Then execute gmod_bulk_load_gff3.pl:
  
Note the redirection of standard out. This method for generating GFF3 files is not completely satisfactory and development is ongoing to provide better translation.
+
  >gmod_bulk_load_gff3.pl --organism yeast --gfffile saccharomyces_cerevisiae.gff
  
Note that gmod_load_gff3.pl is also available, but is limited in how
+
This loads the GFF3 file.  The loading script requires [[bp:GFF3|GFF3]] as it has  tighter control of the syntax and requires the use of a controlled  vocabulary (from [http://sequenceontology.org Sequence Ontology Feature Annotation (SOFA)]), allowing  mapping to the relational schema.  In addition to supplying the location  of the file with the <code>--gfffile</code> flag, the <code>--organism</code> tag uses the common  name (<code>common_name</code> field) from the  [[Chado_Tables#Table:_organism|Chado organism table]].  Do  <code>perldoc gmod_bulk_load_gff.pl</code> for  more information on adding other organisms and databases, as well as other available command line flags.
 +
 
 +
Note that <code>gmod_load_gff3.pl</code> is also available, but is limited in how
 
much it has been supported and in how flexible it currently is.  It is
 
much it has been supported and in how flexible it currently is.  It is
 
a good example of how to write code using Class::DBI classes that are  
 
a good example of how to write code using Class::DBI classes that are  
 
created at the time of install.  For more information on using these
 
created at the time of install.  For more information on using these
classes, see http://sourceforge.net/projects/gmod-ware for a Class::DBI
+
classes, see http://sourceforge.net/projects/gmod-ware for a {{CPAN|Class::DBI}}-based middleware/API.
based middleware/API.
+
  
 +
 +
==Creating GFF3==
 +
 +
GFF3 can also be generated via a script provided with [http://bioperl.org Bioperl],  <code>scripts/Bio-DB-GFF/genbank2gff3.pl</code>:
 +
 +
>bp_genbank2gff3.pl --stdout --file <genbank file> > <gff file>
 +
 +
Note the redirection of standard out.  This method  for generating GFF3 files is not completely satisfactory and development is ongoing to provide better translation.
  
 
==More Information==
 
==More Information==

Revision as of 18:41, 16 March 2007

Abstract

This HOWTO describes a method for loading sequence annotation data in GFF format into the Chado database.

Authors


Copyright

This document is copyright Scott Cain , 2007. For reproduction other than personal use please contact <cain@cshl.edu>


Revision History

Revision 1.0 2007-03-16 BIO First version


Download the GFF Files

An easy way to load data into the database is to use a GFF3 file and the script load/bin/gmod_bulk_load_gff3.pl. A good set of sample data is the GFF3 file prepared by the nice folks at the Saccharomyces Genome Database:

   ftp://ftp.yeastgenome.org/pub/yeast/data_download/chromosomal_feature/saccharomyces_cerevisiae.gff

This file contains Gene Ontology (GO) anotations, so if you didn't load GO when you executed make ontologies you will get many warning messages about being unable to find entries in the dbxref table. If you want to load GO you should be able to execute make ontologies and select Gene Ontology for installation.

Add an Entry for Your Organism

You will need to have an entry for your species in the Chado organism table. If you are unsure if this entry exists log into your database and execute this SQL command: <sql> select common_name from organism; </sql> If you do not see your organism listed, execute a command equivalent to this: <sql>

 insert into organism (abbreviation, genus, species, common_name, organism_id)
               values ('S.cerevisiae', 'Saccharomyces', 'cerevisiae', 'yeast', 4932);

</sql> Substitute in the appropriate values for your own organism.


Load the GFF

Then execute gmod_bulk_load_gff3.pl:

>gmod_bulk_load_gff3.pl --organism yeast  --gfffile saccharomyces_cerevisiae.gff

This loads the GFF3 file. The loading script requires GFF3 as it has tighter control of the syntax and requires the use of a controlled vocabulary (from Sequence Ontology Feature Annotation (SOFA)), allowing mapping to the relational schema. In addition to supplying the location of the file with the --gfffile flag, the --organism tag uses the common name (common_name field) from the Chado organism table. Do perldoc gmod_bulk_load_gff.pl for more information on adding other organisms and databases, as well as other available command line flags.

Note that gmod_load_gff3.pl is also available, but is limited in how much it has been supported and in how flexible it currently is. It is a good example of how to write code using Class::DBI classes that are created at the time of install. For more information on using these classes, see http://sourceforge.net/projects/gmod-ware for a Class::DBI-based middleware/API.


Creating GFF3

GFF3 can also be generated via a script provided with Bioperl, scripts/Bio-DB-GFF/genbank2gff3.pl:

>bp_genbank2gff3.pl --stdout --file <genbank file> > <gff file>

Note the redirection of standard out. This method for generating GFF3 files is not completely satisfactory and development is ongoing to provide better translation.

More Information

See the related HOWTO Load RefSeq Into Chado.

Please send questions to the GMOD developers list:

gmod-devel@lists.sourceforge.net

Or contact the GMOD Help Desk