Load ncbi taxonomy

From GMOD
Revision as of 23:29, 29 December 2010 by Clements (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

load_ncbi_taxonomy.pl is a perl script for loading NCBI taxonomy trees in the phylotreephylotree table. This script was contributed by Naama Menda at Sol Genomics Network (SGN) led by Lukas Mueller.

Where to find it

gmod 1.1

In the 1.1 release,load_ncbi_taxonomy.pl is installed with other scripts with the distribution and will typically go in /usr/bin or /usr/local/bin.

Command line options

  • -H hostname for database [required if -g isn't used]
  • -D database name</t> [required if <tt>-g isn't used]
  • -g GMOD database profile name (can provide host and DB name) Default: default
  • -p phylotree name (optional - defaults to NCBI taxonomy tree. You want to set this if you plan to load more than one tree)
  • -i input file - list of taxonomy ids to be stored (optional- without this the entire NCBI taxonomy will be loaded)
  • -v verbose output
  • -t trial mode. Don't perform any store operations at all. (trial mode cannot test inserting associated data for new terms)

For storing phylonodes a new phylotree will be stored with the name 'NCBI taxonomy tree'. Each organism will get a phylonode id and will be stored in a tmp table, since each phylonode (except for the root) has a parent_phylonode_id, which is an internal foreign key. Next each phylonode will get a left and right indexes, which are calculated by walking down the entire tree structure (see article by Aaron Mackey: http://www.oreillynet.com/pub/a/network/2002/11/27/bioconf.html?page=2). Only after each phylonode will have calculated indexes, the phylonode table will be populated from the tmp table.