Load BLAST Into Chado

From GMOD

Jump to: navigation, search

Contents

[edit] Abstract

This HOWTO describes steps to add a BLAST analysis to a Chado database.

[edit] Have an existing Chado genome database

See Load Refseq into Chado for advice on how to load a GenBank Genome into a database. In the following examples, bp_scriptname is from http://Bioperl.org/, and gmod_scriptname is from http://GMOD.org//. As of this date (2007 April) you will need current modules from the BioPerl and GMOD CVS repositories to have this example work.

[edit] Convert BLAST analysis to GFF

For example, match yeast proteins to your genome with tBLASTn, and reformat to GFF v3.

 $ncbi/blastall -p tblastn -i MOD_Scer.fa -d dmel4 -o dmel4-modsc.tblastn
 

First reformat to GFF with the Bioperl bp_search2gff.pl script. The Chado policy here is to put your program and blast query data names into the GFF --source field. The GFF --method field should always be SO term 'match_part'. You also want the --type hit and --target options.

 scripts/bp_search2gff.pl --in dmel4-modsc.tblastn \
     --out dmel4-modsc.tblastn.gff -version 3 \
     --format blast \ 
     --method match_part --source tBLASTn.MOD_Scer \
     --type hit --target 

Finally clean up the GFF a bit:

  perl -pi -e 's/Target=Sequence:/Target=/' dmel4-modsc.tblastn.gff

[edit] BLAST GFF sample for Chado

Result should be formatted like this:

 ##gff-version 3
 # sample tBLASTn yeast protein x fly chromosome 4 (GenBank NC_004353) matches
 # GFF formatted for loading to Chado database 
 
 NC_004353 tBLASTn.MOD_Scer  match_part  141495  141815  48.9  - 0 Target=S000003211 43 156
 
 NC_004353 tBLASTn.MOD_Scer  match_part  161699  162793  217 + 0 Target=S000005817 984 1204
 NC_004353 tBLASTn.MOD_Scer  match_part  160517  161407  185 + 0 Target=S000005817 455 980
    # this is a protein match with 2 HSP parts, note the identical Target=

[edit] Load Query Protein sequence to Chado DB

You want to have your query sequences used for BLAST, such as proteins, for reference in your Chado db. The GMOD Bulk_load_gff3 will handle this after converting sequence Fasta to GFF format.

 gmod_fasta2gff3.pl  --type protein --source SGD --fasta MOD_Scer.fa --gff MOD_Scer.fa.gff 
 gmod_bulk_load_gff3.pl --dbname my_chado_01  --organism yeast --dbxref --gff MOD_Scer.fa.gff

If your query sequence comes from UniProt or GenBank formats, you can instead use the bp_genbank2gff.pl script that will retain more useful annotations for your Chado database. Then BLAST matches can be linked to many known gene/protein attributes.

[edit] Load BLAST result GFF to Chado DB

Use the GMOD Bulk_load_gff3 script for this, indicating the input is --analysis, and the Target names are unique IDs matching above protein features.

 gmod_bulk_load_gff3.pl --dbname my_chado_01  --organism fruitfly \
   --analysis --unique_target --score raw  \
   --gff dmel4-modsc.tblastn.gff 

Note: If you have pre-loaded the database with all the protein sequence features such as the SGD:S000002445 protein, you should use 'gmod_bulk_load_gff3 --analysis --unique_target' to ensure that Target feature is linked with your Blast result.

[edit] Chado Tables Updated

Then you should see these database updates:

   Analysis table entry of the Chado_Companalysis_Module:
 analysis_id|name|description|program|programversion|algorithm|sourcename|.
 10|tBLASTn.MOD_Scer||tBLASTn|null||MOD_Scer|
 
   Analysisfeature table entries (1/hsp)
 analysisfeature_id|feature_id|analysis_id|rawscore|normscore|significance|identity
 21|88148|10|117|||    << 1st HSP
 22|88150|10|91.7|||
 ...
 
   Feature table entry for Hit feature_id 88148 :
 dev_chado_01b=# select * from feature where feature_id = 88148;
 feature_id|dbxref_id|organism_id|name|uniquename|residues|seqlen|md5checksum|type_id|is_analysis|is_obsolete|
 88148|76797|10|protein_match-auto88148|auto88148||||423|t|f|
 ...
 
   Featureloc entries for Hit feature_id 88148:
 featureloc_id|feature_id|srcfeature_id|fmin|is_fmin_partial|fmax|is_fmax_partial|strand|phase|residue_info|locgroup|rank
 88149|88148|88149|69|f|858|f||||0|1        << Target protein 
 88148|88148|31118|24052|f|24448|f|1|||0|0  << Source genome
   
   Featureloc entries for Target feature_id 88149:
 88149|88148|88149|69|f|858|f||||0|1  -- first HSP
 88151|88150|88149|69|f|858|f||||0|1  -- second HSP

[edit] See also

[edit] More Information

Please send questions to the GMOD developers list:

gmod-devel@lists.sourceforge.net

[edit] Authors

Personal tools
developers