Difference between revisions of "Load GenBank into Chado"

From GMOD
Jump to: navigation, search
m
m (Load GFF into Chado)
Line 75: Line 75:
 
Use the GMOD script <code>gmod_bulk_load_gff3.pl</code> for this. Note that <code>gmod_bulk_load_gff3</code> will only handle ONE organism at a time. Chose the best --dbxref per organism (WormBase, SGD, MGI, FLYBASE), depending on contents of GenBank annotations. The 'GeneID' dbxref is standard for most GenBank genomes.
 
Use the GMOD script <code>gmod_bulk_load_gff3.pl</code> for this. Note that <code>gmod_bulk_load_gff3</code> will only handle ONE organism at a time. Chose the best --dbxref per organism (WormBase, SGD, MGI, FLYBASE), depending on contents of GenBank annotations. The 'GeneID' dbxref is standard for most GenBank genomes.
  
   bin/gmod_bulk_load_gff3.pl  --dbname dev_chado_01c --dbxref GeneID --organism fromdata --gff data/NC_004353.gbk.gz.gff
+
   gmod_bulk_load_gff3.pl  --dbname dev_chado_01c --dbxref GeneID --organism fromdata --gff data/NC_004353.gbk.gz.gff
 
    
 
    
 
Check data:
 
Check data:
Line 85: Line 85:
 
   (select common_name from organism where organism_id = f.organism_id) as species \
 
   (select common_name from organism where organism_id = f.organism_id) as species \
 
   from feature f where f.seqlen>0 group by f.organism_id;'
 
   from feature f where f.seqlen>0 group by f.organism_id;'
 
  
 
==Set up GBrowse View==
 
==Set up GBrowse View==

Revision as of 22:04, 15 April 2007

Abstract

This HOWTO describes how to load GenBank format files into Chado. For a thorough discussion of this topic, including all the files that will allow you to set up a complete test environment see:

http://eugenes.org/gmod/genbank2chado/


Authors


Copyright

This document is copyright Don Gilbert, 2007. For reproduction other than personal use please contact <gilbertd@cricket.bio.indiana.edu>

Revision History

Revision 1.0 2007-04-16 BIO First version

Summary

In summary, to load Saccharomyces chromosome X to Chado database 'mychado', from a Unix command-line, do:

 curl ftp://ftp.ncbi.nih.gov/genomes/Saccharomyces_cerevisiae/CHR_X/NC_001142.gbk \
 | perl bp_genbank2gff3.pl -noCDS -in stdin -out stdout \
 | perl gmod_bulk_load_gff3.pl -dbname mychado -organism fromdata


Fetch Genbank Genome Files

Genbank genome data is available from NCBI genomes section, ftp://ftp.ncbi.nih.gov/genomes, or from a current mirror at ftp://bio-mirror.net/biomirror/ncbigenomes/

 mkdir data; cd data
 

Fetch from NCBI, or this Indiana mirror

 curl ftp://bio-mirror.net/biomirror/ncbigenomes/
 curl -RO ftp://bio-mirror.net/biomirror/ncbigenomes/Saccharomyces_cerevisiae/CHR_X/NC_001142.gbk.gz

Other sample genomes of interest:

  • Drosophila_melanogaster/CHR_4/NC_004353.gbk.gz
  • Caenorhabditis_elegans/CHR_III/NC_003281.gbk.gz
  • Arabidopsis_thaliana/CHR_IV/NC_003075.gbk.gz
  • M_musculus/CHR_19/mm_ref_chr19.gbk.gz
  • H_sapiens/CHR_19/hs_ref_chr19.gbk.gz


Create GFF from the Genbank Files

The Bioperl script bp_genbank2gff3.pl (scripts/Bio-DB-GFF/genbank2gff3.PLS) will convert to GFF v3 suited to Chado loading. Important: use a version of the script created April 2007 or later.

The new -noCDS flag is required for this. Use -s flag to summarize features found.

 bp_genbank2gff3.pl -noCDS -s -o . data/NC_001142.gbk.gz
 

Load GFF into Chado

Use the GMOD script gmod_bulk_load_gff3.pl for this. Note that gmod_bulk_load_gff3 will only handle ONE organism at a time. Chose the best --dbxref per organism (WormBase, SGD, MGI, FLYBASE), depending on contents of GenBank annotations. The 'GeneID' dbxref is standard for most GenBank genomes.

 gmod_bulk_load_gff3.pl  --dbname dev_chado_01c --dbxref GeneID --organism fromdata --gff data/NC_004353.gbk.gz.gff
 

Check data:

 psql -d dev_chado_01c -c 'select count(f.*), \
  (select common_name from organism where organism_id = f.organism_id) as species \
  from feature f group by f.organism_id;'
 psql -d dev_chado_01c -c 'select count(f.*), \
  (select common_name from organism where organism_id = f.organism_id) as species \
  from feature f where f.seqlen>0 group by f.organism_id;'

Set up GBrowse View

The install steps included making a symlink from your Apache www/cgi-bin folder to this TEST_HOME/cgi-bin with gbrowse software. This gbrowse instance needs the correct path to TEST_HOME, and you may need adjustments when using Mod_Perl with Apache server.

At this point your web server should find this test gbrowse ast http://YOUR_SERVER/cgi-bin/gmod01/gbrowse/ with the Chado genome database as cgi-bin/gmod01/gbrowse/dev_chado_ggb/

If this fails, try the default gbrowse yeast data set as cgi-bin/gmod01/gbrowse/yeast_chr1/ Should this fail, so problem other than covered by this test example exists. If it works, and dev_chado_ggb/ fails, check the settings for your gbrowse.conf/dev_chado_ggb.conf. As needed, edit this setting to match your chado database name. database = dbi:Pg:dbname=dev_chado_01c;host=localhost

Check your web server error logs for messages from this software.


Possible Errors

It's possible that you'll run into some errors coming from the input data itself. Some of the errors, and their fixes, are described below.


couldn't open /var/lib/gmod/conf directory for reading:No such file or directory

Make sure the environmental variable GMOD_ROOT is set to where gmod was installed, for example:

 setenv GMOD_ROOT /usr/local/gmod/ # tcsh

or

 set GMOD_ROOT=/usr/local/gmod/ # bash


Your GFF3 file uses a tag called <term>, but this term is not already in the cvterm and dbxref tables so that its value can be inserted into the featureprop table

Solution: This error message will be followed by SQL statements that insert the term in the correct way - execute them. By the way, one explanation for this error is that the source sequence was curated but not with terms from the Sequence Ontology.


DBD::Pg::db pg_endcopy failed: ERROR: duplicate key violates unique constraint "featureprop_c1"
CONTEXT: COPY featureprop, line ...

Solution: The CONTEXT line above is telling you what the offending data is. This error probably means that there are 2 features sharing the same name or ID and feature type in the GFF file. Correct these errors by hand and reload.