Arthropod Genomics 2011/Genome Project 101 Workshop

From GMOD
Revision as of 18:24, 8 June 2011 by Scott (Talk | contribs)

Jump to: navigation, search
Under Construction

This page or section is under construction.

This page will be used for the Genome 101 Workshop at Arthropod Genomics 2011.

VMware Image

A VMware image will be made available to participants of the workshop. We will use this image during the workshop

System Configuration

This section attempts to track what we did to create the VMware image

Operating System Ubuntu 11.04, 64 bit client. This is a popular Linux distribution
Memory 2 GB. If you run this on a system that has 2 gigabytes or less of memory, please decrease this number
Disk 80 GB. This is allocated 2 GB at a time, as needed, but VMware.
Networking NAT
Username gmod
Password

Installed Prerequisite Software

GMOD components have a variety of prerequisite software that needs to be installed. Here is a list of what was installed so we could install and run GMOD software.

Software How Comments
Mercurial sudo apt-get install mercurial Revision control system used by Galaxy
Microsoft TrueType core fonts sudo apt-get install ttf-mscorefonts-installer Used by Galaxy.
python-dev sudo apt-get install python-dev Used in Galaxy.
python-setuptools sudo apt-get install python-setuptools Used in Galaxy.
python-pip sudo apt-get install python-pip Used in Galaxy.
bx-python scripts sudo pip install bx-python Scripts used by Galaxy
Python 2.6 sudo apt-get install python2.6 Ubuntu 11.04 comes with Python 2.7, which Galaxy, does not like. This installs 2.6 in parallel.
Graphics libraries sudo apt-get install libgd2-xpm-dev libgd-gd2-perl libgd-tools libgd-svg-perl Used by GBrowse
System utilities and web server sudo apt-get install autoconf apache2 Used by GBrowse and Chado
Database server sudo apt-get install postgresql postgresql-client Used by Chado and GBrowse
Variety of perl modules sudo apt-get install libcgi-session-perl libdbd-pg-perl libdigest-md5-file-perl libclass-base-perl libmodule-build-perl libstatistics-descriptive-perl libtemplate-perl libxml-simple-perl liblog-log4perl-perl libparse-recdescent-perl libsql-translator-perl perl-doc Used by Chado and GBrowse
sudo apt-get install libjson-xs-perl libdevel-size-perl
Perl graphics library cpan> install GD Used by JBrowse and GBrowse
BioPerl libraries cpan> install Bio::Perl Bio::Graphics JSON Used by JBrowse, GBrowse and Chado
GBrowse Chado adaptor cpan> install Bio::DB::Das::Chado Used by JBrowse, GBrowse and Chado
More perl libraries cpan> install GO::Parser Module::Load DBIx::DBSchema XML::Parser::PerlSAX Used by Chado
Even more perl libraries cpan> install Heap::Simple Heap::Simple::Perl Heap::Simple::XS PerlIO::gzip Used by JBrowse

PostgreSQL Configuration

The postgresql server will be set up with fairly unrestricted access to make life easier during the tutorial. If used "in real life", the configuration should be tightened down quite a bit.

Edit config file

 sudo su -
 vi /etc/postgresql/8.4/main/pg_hba.conf

Change the bottom lines to look like this:

 # "local" is for Unix domain socket connections only
 local   all         all                               trust
 # IPv4 local connections:
 host    all         all         127.0.0.1/32          trust
 # IPv6 local connections:
 host    all         all         ::1/128               trust

by replacing the text in the last column to "trust" as it is here (that's the insecure part!). Then restart the postgresql server:

 /etc/init.d/postgresql restart

Then, switch users to the "postgres" user and create a new user called "gmod":

 su - postgres
 createuser gmod
   Shall the new role be a superuser? (y/n) y
 exit  # to leave postgres user shell
 exit  # to leave root shell

Install DBIx::DBStag

This is a perl module that can only be installed after PostgreSQL is configured, so it is installed now. First, create a database called "test":

 createdb test

Then install via the cpan shell:

 cpan
 cpan> install DBIx::DBStag

Note that installing via the cpan shell is difficult if you typically use cpan as root, like "sudo cpan". If instead you use cpan as a regular user but have it configured to do "sudo make install" and "sudo ./Build install" it is easy and works correctly.

GMOD Components

MAKER Web Annotation Service

While we could install MAKER locally on this machine, it is nice to be able to make use of the web service provided by Mark Yandell's group at University of Utah. To use it, go to

 http://derringer.genetics.utah.edu/cgi-bin/MWAS/maker.cgi

and create a free account (I created one for this tutorial with a user name of gmodags). After that is created, we can upload some sample data. I put the sample data that I used on ~/Downloads/MAKER_input, where there are three files:

  • pyu-contig.fasta - a FASTA file containing a 1.7 MB contig
  • pyu-est.fasta - A set of assembled 454 read ESTs from P. ultimum and a related organism
  • pyu-protein.fasta - a set of protein sequence from a related organism

After clicking on the "New Job" tab, I uploaded all three files in the appropriate spot, ignoring the others:

MAKER contig.png

MAkER est.png

MAKER protien.png

After uploading these, I pressed "Add to Job Queue" to get it started running. The job waited under an hour before starting, and then finished in under three hours.

Upon finishing, I was presented with multiple ways of looking at the data:

MAKER download.png

and after taking a quick look at both GBrowse and JBrowse, downloaded the data to the machine (in ~/Downloads/3263.maker.output). The GFF file in this directory will be loaded into Chado.

Galaxy

The default python on Ubuntu 11.04 is 2.7. We need 2.6 to run Galaxy. Using the instructions from the GetGalaxy wiki page, Python 2.6 was downloaded and added at the front of the path.

mkdir ~/galaxy-python
ln -s /path/to/python2.5 ~/galaxy-python/python

~/.bashrc was edited and these lines were added to the end.

# Use Python 2.6 for Galaxy
export PATH=~/galaxy-python:$PATH

Galaxy was then downloaded:

cd ~/Documents
mkdir work
cd work
hg clone http://bitbucket.org/galaxy/galaxy-dist

And we then customized the landing image for this conference. (Details are not important.)

And now we can start it:

cd galaxy-dist
sh run.sh

And Galaxy is now installed and running. Goto http://localhost:8080.

Chado

Get Chado from SourceForge; point a browser at

 http://sourceforge.net/projects/gmod/files/gmod/chado-1.11/chado-1.11.tar.gz/download

and extract the files:

 cd ~/Downloads
 tar zxvf chado-1.11.tar.gz
 cd chado-1.11

Set up some environment variables:

 vi ~/.bashrc

and add these lines to the bottom:

 export GMOD_ROOT=/usr/local/gmod
 export CHADO_DB_NAME=chado
 export CHADO_DB_USERNAME=gmod

save .bashrc and source it so that the values are available in the shell:

 source ~/.bashrc

Now to install Chado:

 perl Makefile.PL

Accept all of the defaults except for the "default organism" question. Put "pythium" here.

 make
 sudo make install
 make load_schema   #ignore the error about a chado database not existing
 make prepdb
 make ontologies
   answer with 1,2,4

Add our organism to the database:

 psql chado
 psql> INSERT INTO organism ( abbreviation, genus, species, common_name)
       VALUES ('P.ultimum','Pythium','ultimum','pythium');

Make a database dump that saves progress to this point:

 pg_dump chado | bzip2 -c > ontologies_only_dbdump.bz2

Loading Data

Note that if MAKER is installed locally, it provides a tool to automatically load data from it straight into Chado. Since we don't have that, we'll load "by hand".

Go to where the data is:

 cd ~/Downloads/3263.maker.output/

There are several files here, but the one we are interested in at the moment is 3263.all.gff, which has all of the "raw material" annotations made by MAKER as well as the final gene calls. We'll load these into Chado using a tool that came with Chado:

 gmod_bulk_load_gff3.pl -a --noexon -g 3263.all.gff

where the -a tells the loader that these are computational results (as opposed to human-curated annotations), and the --noexon tells the loader not to create exon features that correspond to the CDS features, because the exon features are already present in the GFF. This load takes a few minutes.

GBrowse

GBrowse can be installed directly from the cpan shell like several of the GBrowse prerequisites were installed:

 cpan
 cpan> install Bio::Graphics::Browser2

Accept all of the defaults when asked questions.

JBrowse

Get JBrowse and unpack:

 wget http://jbrowse.org/releases/jbrowse-1.2.1.zip
 unzip jbrowse-1.2.1.zip
 sudo cp -r jbrowse-1.2.1/ /var/www/jbrowse/

Get a conf file:

 cd ~/Downloads/
 wget https://gist.github.com/raw/1014946/65ab0c150984d7bed47150d82da5026d960406f3/pythium.conf

Setting up data:

 cd /var/www/
 sudo chown -R gmod:gmod jbrowse
 cd jbrowse
 bin/prepare-refseqs.pl --conf ~/Downloads/pythium.conf --refs scf1117875582023

This gets the "reference sequence", that is, the contig we're working on. At this point, visiting the jbrowse url (http://gmodags.bx.psu.edu/jbrowse) would show an empty jbrowse user interface, though if you zoom all the way in, you'd see DNA residues.

Now, get more data:

 bin/biodb-to-json.pl --conf ~/Downloads/pythium.conf

This talks to the Chado database and extracts all of the evidence and predictions that were used and created in the MAKER analysis.