Load BLAST Into Chado
Contents
Abstract
This HOWTO describes steps to add a BLAST analysis to a Chado database.
Have an existing Chado genome database
See Load_RefSeq_Into_Chado for example to load a GenBank Genome into a database.
Convert BLAST analysis to GFF
For example, match yeast proteins to your genome with tBLASTn, and reformat to GFF v3.
$ncbi/blastall -p tblastn -i MOD_Scer.fa -d dmel4 -o dmel4-modsc.tblastn
First, patch this problematic policy in Bio::Tools::GFF.pm first, to produce Target values in GFF output:
sub _gff3_string: for my $tag ( @all_tags ) { ### next if $tag eq 'Target'; # ^^^ add comment to block, or change to next if ($origfeat->isa('Bio::SeqFeature::FeaturePair') and $tag eq 'Target');
Then reformat to GFF with the Bioperl bp_search2gff.pl script. The Chado policy here is to put your program and blast query data names into the GFF --source field. The GFF --method field should always be SO term 'match_part'. You also want the --type hit and --target options.
scripts/bp_search2gff.pl --in dmel4-modsc.tblastn \ --out dmel4-modsc.tblastn.gff -version 3 --format blast \ --method match_part --source tBLASTn.MOD_Scer \ --type hit --target
Finally clean up the GFF a bit:
perl -pi -e's/Target=Sequence:/Target=/;' dmel4-modsc.tblastn.gff
BLAST GFF sample for Chado
Result should be formatted like this:
##gff-version 3 # sample tBLASTn yeast protein x fly chromosome match data # GFF formatted for loading to Chado database NC_004353 tBLASTn.MOD_Scer match_part 141495 141815 48.9 - 0 Target=S000003211,43,156 NC_004353 tBLASTn.MOD_Scer match_part 161699 162793 217 + 0 Target=S000005817,984,1204 NC_004353 tBLASTn.MOD_Scer match_part 160517 161407 185 + 0 Target=S000005817,455,980 # this is a protein match in 2 parts: note the identical Target= and Dbxref=
Load Query Protein sequence to Chado DB
You want to have your query sequences used for BLAST, such as proteins, for reference in your Chado db. The GMOD Bulk_load_gff3 will handle this after converting sequence Fasta to GFF format.
gmod_fasta2gff3.pl --type protein --source SGD --fasta MOD_Scer.fa --gff MOD_Scer.fa.gff
gmod_bulk_load_gff3.pl --dbname dev_chado_01 --organism yeast --dbxref --gff MOD_Scer.fa.gff
Load BLAST result GFF to Chado DB
Use the GMOD Bulk_load_gff3 script for this, indicating the input is --analysis, and the Target names are unique IDs matching above protein features.
gmod_bulk_load_gff3.pl --dbname my_chado_01 --organism fruitfly \ --analysis --unique_target --score raw \ --gff dmel4-modsc.tblastn.gff
Note: If you have pre-loaded the database with all the protein sequence features such as the SGD:S000002445 protein, you should use 'gmod_bulk_load_gff3 --analysis --unique_target' to ensure that Target feature is linked with your Blast result.
Chado Tables Updated
Then you should see these database updates:
Analysis table entry: analysis_id|name|description|program|programversion|algorithm|sourcename|. 10|tBLASTn.MOD_Scer||tBLASTn|null||MOD_Scer| Analysisfeature table entries (1/hsp) analysisfeature_id|feature_id|analysis_id|rawscore|normscore|significance|identity 21|88148|10|117||| << 1st HSP 22|88150|10|91.7||| ... Feature table entry for Hit feature_id 88148 : dev_chado_01b=# select * from feature where feature_id = 88148; feature_id|dbxref_id|organism_id|name|uniquename|residues|seqlen|md5checksum|type_id|is_analysis|is_obsolete| 88148|76797|10|protein_match-auto88148|auto88148||||423|t|f| ... Featureloc entries for Hit feature_id 88148: featureloc_id|feature_id|srcfeature_id|fmin|is_fmin_partial|fmax|is_fmax_partial|strand|phase|residue_info|locgroup|rank 88149|88148|88149|69|f|858|f||||0|1 << Target protein 88148|88148|31118|24052|f|24448|f|1|||0|0 << Source genome Featureloc entries for Target feature_id 88149: 88149|88148|88149|69|f|858|f||||0|1 -- first HSP 88151|88150|88149|69|f|858|f||||0|1 -- second HSP
More Information
Please send questions to the GMOD developers list:
gmod-devel@lists.sourceforge.net
Authors
- Dongilbert 23:24, 3 April 2007 (EDT)