Load BLAST Into Chado
From GMOD
Contents |
[edit] Abstract
This HOWTO describes steps to add a BLAST analysis to a Chado database.
[edit] Have an existing Chado genome database
See Load Refseq into Chado for advice on how to load a GenBank Genome into a database. In the following examples, bp_scriptname is from http://Bioperl.org/, and gmod_scriptname is from http://GMOD.org//. As of this date (2007 April) you will need current modules from the BioPerl and GMOD CVS repositories to have this example work.
[edit] Convert BLAST analysis to GFF
For example, match yeast proteins to your genome with tBLASTn, and reformat to GFF v3.
$ncbi/blastall -p tblastn -i MOD_Scer.fa -d dmel4 -o dmel4-modsc.tblastn
First reformat to GFF with the Bioperl bp_search2gff.pl script. The Chado policy here is to put your program and blast query data names into the GFF --source field. The GFF --method field should always be SO term 'match_part'. You also want the --type hit and --target options.
scripts/bp_search2gff.pl --in dmel4-modsc.tblastn \
--out dmel4-modsc.tblastn.gff -version 3 \
--format blast \
--method match_part --source tBLASTn.MOD_Scer \
--type hit --target
Finally clean up the GFF a bit:
perl -pi -e 's/Target=Sequence:/Target=/' dmel4-modsc.tblastn.gff
[edit] BLAST GFF sample for Chado
Result should be formatted like this:
##gff-version 3
# sample tBLASTn yeast protein x fly chromosome 4 (GenBank NC_004353) matches
# GFF formatted for loading to Chado database
NC_004353 tBLASTn.MOD_Scer match_part 141495 141815 48.9 - 0 Target=S000003211 43 156
NC_004353 tBLASTn.MOD_Scer match_part 161699 162793 217 + 0 Target=S000005817 984 1204
NC_004353 tBLASTn.MOD_Scer match_part 160517 161407 185 + 0 Target=S000005817 455 980
# this is a protein match with 2 HSP parts, note the identical Target=
[edit] Load Query Protein sequence to Chado DB
You want to have your query sequences used for BLAST, such as proteins, for reference in your Chado db. The GMOD Bulk_load_gff3 will handle this after converting sequence Fasta to GFF format.
gmod_fasta2gff3.pl --type protein --source SGD --fasta MOD_Scer.fa --gff MOD_Scer.fa.gff gmod_bulk_load_gff3.pl --dbname my_chado_01 --organism yeast --dbxref --gff MOD_Scer.fa.gff
If your query sequence comes from UniProt or GenBank formats, you can instead use the bp_genbank2gff.pl script that will retain more useful annotations for your Chado database. Then BLAST matches can be linked to many known gene/protein attributes.
[edit] Load BLAST result GFF to Chado DB
Use the GMOD Bulk_load_gff3 script for this, indicating the input is --analysis, and the Target names are unique IDs matching above protein features.
gmod_bulk_load_gff3.pl --dbname my_chado_01 --organism fruitfly \ --analysis --unique_target --score raw \ --gff dmel4-modsc.tblastn.gff
Note: If you have pre-loaded the database with all the protein sequence features such as the SGD:S000002445 protein, you should use 'gmod_bulk_load_gff3 --analysis --unique_target' to ensure that Target feature is linked with your Blast result.
[edit] Chado Tables Updated
Then you should see these database updates:
Analysis table entry of the Chado_Companalysis_Module: analysis_id|name|description|program|programversion|algorithm|sourcename|. 10|tBLASTn.MOD_Scer||tBLASTn|null||MOD_Scer| Analysisfeature table entries (1/hsp) analysisfeature_id|feature_id|analysis_id|rawscore|normscore|significance|identity 21|88148|10|117||| << 1st HSP 22|88150|10|91.7||| ... Feature table entry for Hit feature_id 88148 : dev_chado_01b=# select * from feature where feature_id = 88148; feature_id|dbxref_id|organism_id|name|uniquename|residues|seqlen|md5checksum|type_id|is_analysis|is_obsolete| 88148|76797|10|protein_match-auto88148|auto88148||||423|t|f| ... Featureloc entries for Hit feature_id 88148: featureloc_id|feature_id|srcfeature_id|fmin|is_fmin_partial|fmax|is_fmax_partial|strand|phase|residue_info|locgroup|rank 88149|88148|88149|69|f|858|f||||0|1 << Target protein 88148|88148|31118|24052|f|24448|f|1|||0|0 << Source genome Featureloc entries for Target feature_id 88149: 88149|88148|88149|69|f|858|f||||0|1 -- first HSP 88151|88150|88149|69|f|858|f||||0|1 -- second HSP
[edit] See also
[edit] More Information
Please send questions to the GMOD developers list:
gmod-devel@lists.sourceforge.net
[edit] Authors
- Dongilbert 23:24, 3 April 2007 (EDT)
Categories: BLAST | Chado | HOWTO



