Difference between revisions of "Load RefSeq Into Chado"

From GMOD
Jump to: navigation, search
m (Convert Genbank to GFF)
m (corrected a bad link)
 
(20 intermediate revisions by 5 users not shown)
Line 1: Line 1:
__TOC__
+
This [[:Category:HOWTO|HOWTO]] describes a method for loading the sequence data in Genbank RefSeq files into the [[Chado_-_Getting_Started|Chado database]].
 
+
==Abstract==
+
 
+
This HOWTO describes a method for loading the sequence data in Genbank files into the Chado database.
+
 
+
 
+
==Authors==
+
+
* [[Scott Cain]]
+
* [[bp:Brian_Osborne|Brian Osborne]]
+
 
+
 
+
==Copyright==
+
 
+
This document is copyright Scott Cain, 2007. For reproduction other than personal use please contact <cain@cshl.edu>
+
 
+
 
+
==Revision History==
+
 
+
{| border="1" cellspacing="0" cellpadding="4"
+
|-
+
| Revision 1.0 2007-02-07 BIO
+
| First version
+
|-
+
|}
+
 
+
  
 
==Download the Sequence Files==
 
==Download the Sequence Files==
  
Download the Genbank genome record of interest. A good source for Genbank files is NCBI's FTP site (ftp://ftp.ncbi.nih.gov/genomes/), look for the *.gbk files, they will probably be compressed (*.gbk.gz).
+
These steps have been used to load data from genomic RefSeq files. You can recognize these files by their <code>NC_</code> and <code>NT_</code> prefixes. First download the Genbank genome files of interest. A good source for RefSeq files is [ftp://ftp.ncbi.nih.gov/genomes NCBI's FTP site]. This website provides some files in GFF3 format (suffix <code>.gff</code>).
 +
Files in the Genbank format have the suffix <code>.gbk</code>.
  
 +
==Convert RefSeq to GFF3==
  
==Convert Genbank to GFF==
+
Use the [[BioPerl]] script <code>genbank2gff3.pl</code>, found in <code>scripts/Bio-DB-GFF/</code> within the BioPerl distribution. If you've actually installed BioPerl then the installed script will have been renamed <code>bp_genbank2gff3.pl</code>. Note that there's also an older <code>genbank2gff.pl</code> script, don't use it.
  
Use the [http://www.bioperl.org BioPerl] script <code>genbank2gff3.pl</code>, found in scripts/Bio-DB-GFF/ within the [http://bioperl.org BioPerl] distribution. If you've actually installed BioPerl then the installed script will have been renamed <code>bp_genbank2gff3.pl</code>. Note that there's also an older <code>genbank2gff.pl</code> script, don't use it.
 
 
 
   >bp_genbank2gff3.pl <filename>
 
   >bp_genbank2gff3.pl <filename>
  
This will create a [[bp:GFF|GFF version 3]] file.  It may give several  warnings about ''unrecognized feature types''.  If the feature types are not part of [http://www.sequenceontology.org/ SOFA], you will have to hand edit the resulting [[bp:GFF|GFF]] file to change the feature type.  Any skipped features will be printed at  the end.  If you want those to be part of the GFF file, you will have  to add those manually as well, fixing any non-SOFA feature types.
+
This will create a [[GFF3]] file.  It may give several  warnings about ''unrecognized feature types''.  If the feature types are not part of [http://www.sequenceontology.org/ SOFA], you will have to hand edit the resulting [[GFF3]] file to change the feature type.  Any skipped features will be printed at  the end.  If you want those to be part of the GFF3 file, you will have  to add those manually as well, fixing any non-SOFA feature types.
  
 
==Add an Entry for Your Organism==
 
==Add an Entry for Your Organism==
  
You will need to have an entry for your species in the [[Chado_Tables#Table:_organism|Chado organism table]]. If you are unsure if this entry exists log into your database and execute this SQL command:
+
You will need to have an entry for your species in the [[Chado_Tables#Table:_organism|Chado organism table]]. If you are unsure if this entry exists log into your database and execute this [[Glossary#SQL|SQL]] command:
 
+
<syntaxhighlight lang="sql">
  select * from organism;
+
select genus,species,common_name from organism;
 
+
</syntaxhighlight>
 
If you do not see your organism listed, execute a command equivalent to this:
 
If you do not see your organism listed, execute a command equivalent to this:
<sql>
+
<syntaxhighlight lang="sql">
   insert into organism (abbreviation, genus, species, common_name, organism_id)
+
   insert into organism (abbreviation, genus, species, common_name)
                 values ('H.sapiens', 'Homo', 'sapiens', 'Human', 9606);
+
                 values ('H.sapiens', 'Homo', 'sapiens', 'Human');
</sql>
+
</syntaxhighlight>
 
Substitute in the appropriate values for your own organism.
 
Substitute in the appropriate values for your own organism.
  
==Load the GFF==
+
==Load the GFF3==
  
Run the <code>load/bin/gmod_load_gff3.pl</code> script from the GMOD distribution:
+
Run the <code>load/bin/gmod_bulk_load_gff3.pl</code> script from the GMOD distribution:
  
   >gmod_load_gff3.pl --organism <your org common name>  --srcdb DB:genbank --gfffile <your gfffile>
+
   >gmod_bulk_load_gff3.pl --gfffile <your gfffile>
  
This will load your data into the [[Chado_-_Getting_Started|Chado database]].  Note that if there are non-[http://sequenceontology SOFA] feature types remaining in the GFF file the load will fail when they  are encountered.  If that happens, edit the file to fix the incorrect  term and load againOnly the previously unloaded data will load (i.e. you won't have duplicate rows).
+
If you didn't specify this organism when installing Chado, (in response to the question "What is the default organism (common name, or "none")"), then you'll need to add at least the <code>--organism <common_name></code> flag to the commandSee <code>perldoc gmod_bulk_load_gff3.pl</code> for an explanation of the other options this script supports.
  
 
+
This will load your data into the [[Chado_-_Getting_Started|Chado database]].  Note that if there are non-[http://www.sequenceontology.org/ SOFA] feature types remaining in the GFF3 file the load will fail when they  are encountered.  If that happens, edit the file to fix the incorrect  term and load again.  If that happens, the load will be stopped before the database is touched, so you won't have to worry about duplicate data.
''sc-note: I need to check that this works--I haven't tried GenBank to Chado in a very long time.''
+
  
 
==More Information==
 
==More Information==
Line 70: Line 43:
 
[mailto:gmod-devel@lists.sourceforge.net gmod-devel@lists.sourceforge.net]
 
[mailto:gmod-devel@lists.sourceforge.net gmod-devel@lists.sourceforge.net]
  
 +
 +
==Authors==
 +
 +
* [[User:Scott|Scott Cain]]
 +
* [[bp:Brian_Osborne|Brian Osborne]]
  
 
[[Category:HOWTO]]
 
[[Category:HOWTO]]
[[Category:To Do]]
+
[[Category:Chado]]

Latest revision as of 21:09, 15 July 2015

This HOWTO describes a method for loading the sequence data in Genbank RefSeq files into the Chado database.

Download the Sequence Files

These steps have been used to load data from genomic RefSeq files. You can recognize these files by their NC_ and NT_ prefixes. First download the Genbank genome files of interest. A good source for RefSeq files is NCBI's FTP site. This website provides some files in GFF3 format (suffix .gff). Files in the Genbank format have the suffix .gbk.

Convert RefSeq to GFF3

Use the BioPerl script genbank2gff3.pl, found in scripts/Bio-DB-GFF/ within the BioPerl distribution. If you've actually installed BioPerl then the installed script will have been renamed bp_genbank2gff3.pl. Note that there's also an older genbank2gff.pl script, don't use it.

 >bp_genbank2gff3.pl <filename>

This will create a GFF3 file. It may give several warnings about unrecognized feature types. If the feature types are not part of SOFA, you will have to hand edit the resulting GFF3 file to change the feature type. Any skipped features will be printed at the end. If you want those to be part of the GFF3 file, you will have to add those manually as well, fixing any non-SOFA feature types.

Add an Entry for Your Organism

You will need to have an entry for your species in the Chado organism table. If you are unsure if this entry exists log into your database and execute this SQL command:

SELECT genus,species,common_name FROM organism;

If you do not see your organism listed, execute a command equivalent to this:

  INSERT INTO organism (abbreviation, genus, species, common_name)
                VALUES ('H.sapiens', 'Homo', 'sapiens', 'Human');

Substitute in the appropriate values for your own organism.

Load the GFF3

Run the load/bin/gmod_bulk_load_gff3.pl script from the GMOD distribution:

  >gmod_bulk_load_gff3.pl --gfffile <your gfffile>

If you didn't specify this organism when installing Chado, (in response to the question "What is the default organism (common name, or "none")"), then you'll need to add at least the --organism <common_name> flag to the command. See perldoc gmod_bulk_load_gff3.pl for an explanation of the other options this script supports.

This will load your data into the Chado database. Note that if there are non-SOFA feature types remaining in the GFF3 file the load will fail when they are encountered. If that happens, edit the file to fix the incorrect term and load again. If that happens, the load will be stopped before the database is touched, so you won't have to worry about duplicate data.

More Information

Please send questions to the GMOD developers list:

gmod-devel@lists.sourceforge.net


Authors