Load RefSeq Into Chado
Download the Sequence Files
These steps have been used to load data from genomic RefSeq files. You can recognize these files by their
NT_ prefixes. First download the Genbank genome files of interest. A good source for RefSeq files is NCBI's FTP site. This website provides some files in GFF3 format (suffix
Files in the Genbank format have the suffix
Convert RefSeq to GFF3
Use the BioPerl script
genbank2gff3.pl, found in
scripts/Bio-DB-GFF/ within the BioPerl distribution. If you've actually installed BioPerl then the installed script will have been renamed
bp_genbank2gff3.pl. Note that there's also an older
genbank2gff.pl script, don't use it.
This will create a GFF3 file. It may give several warnings about unrecognized feature types. If the feature types are not part of SOFA, you will have to hand edit the resulting GFF3 file to change the feature type. Any skipped features will be printed at the end. If you want those to be part of the GFF3 file, you will have to add those manually as well, fixing any non-SOFA feature types.
Add an Entry for Your Organism
SELECT genus,species,common_name FROM organism;
If you do not see your organism listed, execute a command equivalent to this:
INSERT INTO organism (abbreviation, genus, species, common_name) VALUES ('H.sapiens', 'Homo', 'sapiens', 'Human');
Substitute in the appropriate values for your own organism.
Load the GFF3
load/bin/gmod_bulk_load_gff3.pl script from the GMOD distribution:
>gmod_bulk_load_gff3.pl --gfffile <your gfffile>
If you didn't specify this organism when installing Chado, (in response to the question "What is the default organism (common name, or "none")"), then you'll need to add at least the
--organism <common_name> flag to the command. See
perldoc gmod_bulk_load_gff3.pl for an explanation of the other options this script supports.
This will load your data into the Chado database. Note that if there are non-SOFA feature types remaining in the GFF3 file the load will fail when they are encountered. If that happens, edit the file to fix the incorrect term and load again. If that happens, the load will be stopped before the database is touched, so you won't have to worry about duplicate data.
Please send questions to the GMOD developers list: