Difference between revisions of "Load RefSeq Into Chado"

From GMOD
Jump to: navigation, search
m (Load the GFF)
m
Line 39: Line 39:
  
 
This will convert the Genbank file to GFF version 3.  It may give several  warnings about ''unrecognized feature types''.  If the feature types are not part of [http://www.sequenceontology.org/ SOFA], you will have to hand edit the resulting GFF file to change the feature type.  Any skipped features will be printed at  the end.  If you want those to be part of the GFF file, you will have  to add those manually as well, fixing any non-SOFA feature types.
 
This will convert the Genbank file to GFF version 3.  It may give several  warnings about ''unrecognized feature types''.  If the feature types are not part of [http://www.sequenceontology.org/ SOFA], you will have to hand edit the resulting GFF file to change the feature type.  Any skipped features will be printed at  the end.  If you want those to be part of the GFF file, you will have  to add those manually as well, fixing any non-SOFA feature types.
 +
  
 
==Add an Entry for Your Organism==
 
==Add an Entry for Your Organism==
Line 52: Line 53:
 
</sql>
 
</sql>
 
You will substitute in the appropriate values for your organism.
 
You will substitute in the appropriate values for your organism.
 +
  
 
==Load the GFF==
 
==Load the GFF==
Line 60: Line 62:
  
 
This will load your data into the Chado database.  Note that if there are non-SOFA feature types remaining in the GFF file the load will fail when they  are encountered.  If that happens, edit the file to fix the incorrect  term and load again.  Only the previously unloaded data will load (i.e. you won't have duplicate rows).
 
This will load your data into the Chado database.  Note that if there are non-SOFA feature types remaining in the GFF file the load will fail when they  are encountered.  If that happens, edit the file to fix the incorrect  term and load again.  Only the previously unloaded data will load (i.e. you won't have duplicate rows).
 +
  
 
==More Information==
 
==More Information==

Revision as of 03:28, 8 February 2007

Abstract

This HOWTO describes a method for loading the sequence data in Genbank files into the Chado database.


Authors


Copyright

This document is copyright Scott Cain, 2007. For reproduction other than personal use please contact <cain@cshl.org>


Revision History

Revision 1.0 2007-02-07 BIO First version


Download the Sequence Files

Download the Genbank genome record of interest. A good source for Genbank files is NCBI's FTP site (ftp://ftp.ncbi.nih.gov/genomes/), look for the *.gbk files, they will probably be compressed (*.gbk.gz).


Convert Genbank to GFF

Use the BioPerl script genbank2gff3.pl, found in scripts/Bio-DB-GFF/genbank2gff3.PLS within the BioPerl distribution.

 >load/bin/genbank2gff3.pl <filename>

This will convert the Genbank file to GFF version 3. It may give several warnings about unrecognized feature types. If the feature types are not part of SOFA, you will have to hand edit the resulting GFF file to change the feature type. Any skipped features will be printed at the end. If you want those to be part of the GFF file, you will have to add those manually as well, fixing any non-SOFA feature types.


Add an Entry for Your Organism

You will need to have an entry for your species in the Chado organism table. If you are unsure if this entry exists log into your database and execute this SQL command:

 select * from organism;

If you do not see your organism listed, execute a command equivalent to this: <sql>

 insert into organism (abbreviation, genus, species, common_name)
               values ('H.sapiens', 'Homo', 'sapiens', 'Human');

</sql> You will substitute in the appropriate values for your organism.


Load the GFF

Run the load_gff3.pl script from the GMOD distribution:

  >load/bin/load_gff3.pl --organism <your org>  --srcdb DB:genbank --gfffile <your gfffile>

This will load your data into the Chado database. Note that if there are non-SOFA feature types remaining in the GFF file the load will fail when they are encountered. If that happens, edit the file to fix the incorrect term and load again. Only the previously unloaded data will load (i.e. you won't have duplicate rows).


More Information

Please send questions to the GMOD developers list:

gmod-devel@lists.sourceforge.net