GBrowse syn PAG tutorial

From GMOD
Jump to: navigation, search


This tutorial walks you through how to install and configure the GBrowse_syn comparative genomics viewer. This tutorial was originally taught by Sheldon McKay at the 2009 GMOD Schools - Europe & Americas. The notes and VMware image used on this page are from the Europe course.



VMware

This tutorial was taught using a VMware system image as a starting point. If you want to start with that same system, download and install the Starting image.

See VMware for what software you need to use a VMware system image, and for directions on how to get the image setup and running on your machine.

Download Links will be added here

Caveats

Important Note

This tutorial describes the world as it existed on the day the tutorial was given. Please be aware that things like CPAN modules, Java libraries, and Linux packages change over time, and that the instructions in the tutorial will slowly drift over time. Newer versions of tutorials will be posted as they become available.

The Generic Synteny Browser

GBrowse_syn, as implemented at WormBase

GBrowse_syn, or the Generic Synteny Browser, is a GBrowse-based synteny browser designed to display multiple genomes, with a central reference species compared to two or more additional species.  It can be used to view multiple sequence alignment data, synteny or co-linearity data from other sources against genome annotations provided by GBrowse. GBrowse_syn is included with the standard GBrowse package (version 1.69 and later).  Working examples can be seen at TAIR, WormBase, and SGN.

Gbrowse_syn Introduction

GBrowse_syn Documentation

There is detailed documentation on the GMOD wiki for how to install, configure and use GBrowse_syn. To get started, browse these pages:

Whole Genome Alignments

The focus of the section of the course is on dealing with alignment or synteny data and using GBrowse_syn. However, how to generate whole genome alignments, identify orthologous regions, etc, are the subject of considerable interest, so some background reading is listed below:

Installing GBrowse_syn

GBrowse_syn is part of the GBrowse package and was pre-installed when you went through the GBrowse installation.

Installing Gbrowse_syn

This is the same as installing GBrowse, most of the prerequisites are loaded. We will use the script gbrowse_netinstall.pl to load bioperl-live, Gbrowse 1.7 and Bio::Graphics.

$ cd ~/build
$ sudo perl gbrowse_netinstall.pl -d

NOTE: the -d flag is essential to get the latest code.

Configuration of GBrowse_syn

The example we will use is a two-species comparison of rice (Oryza sativa) and one of its wild relatives*

*Data courtesy of Bonnie Hurwitz; sequences and names have been obfuscated to protect unpublished data

The instructions for downloading these data to the Ubuntu virtual disk:

$ tar xjvf gbrowse_syn_PAG.tar.bz2
$ cd ~/data/gbrowse_syn

Create a MySQL database

  • GBrowse_syn uses a "joining" database to store all of the alignment data
  • The first thing we need to do is create a MySQL alignment database using the command-line incantation below:
$ mysql -uroot -e 'create database rice_synteny'
  • Then make sure the web user "nobody" can read the database. Pay special attention to the quotes!
$ mysql -uroot -e "GRANT SELECT on rice_synteny.* to 'nobody'@'localhost'"

Loading the alignment data

The alignment data file

Have a look at the input data in clustalw format:

 $ cd ~/data/gbrowse_syn/rice
 $ more data/rice.aln

CLUSTAL W(1.81) multiple sequence alignment W(1.81)


rice-3(+)/16598648-16600199      ggaggccggccgtctgccatgcgtgagccagacggggcgggccggagacaggccacgtgg
wild_rice-3(+)/14467855-14469373 gggggccgg------------------------------------agacaggccacgtgg
                                 ** ******                                    ***************


rice-3(+)/16598648-16600199      ccctgccccgggctgttgacccactggcacccctgtcccgggttgtcgccctcctttccc
wild_rice-3(+)/14467855-14469373 ccctgccccgggctgttgacccactggcacccctgtcccgggttgtcgccctcctttccc
                                 ************************************************************


rice-3(+)/16598648-16600199      cgccatgctctaagtttgctcctcttctcgaacttctctctttgattcttcacgtcctct
wild_rice-3(+)/14467855-14469373 cgccatgctctaagtttgctcctcttctcgaacttctctctttgattcttcacgtcctct
                                 ************************************************************



rice-3(+)/16598648-16600199      tggagcctccccttctagctcgatcacgctctgctcttccgcttggaggctggcaaaact
wild_rice-3(+)/14467855-14469373 tggagcctccccttctagctcgatcgcgctctgctcttccgcttggaggctggcaaaact
                                 ************************* **********************************

Note on CLUSTALW

These data are in clustalw format. The scripts used to process these data will recognize clustalw and other commonly used formats recognized by BioPerl's AlignIO parser. This does not mean that clustalw is the actual program used to generate the alignment data.

Note on the sequence ID syntax

The sequence ID is this clustal file is overloaded to contain information about the species, strand and coordinates. This information is essential:

 rice-3(+)/16598648-16600199
 speciesv-refseq(strand)/start-end

The database loading script

Then, we will load the database.

we are using the options:
-u root          -- username is root
-d rice_synteny  -- use database rice_synteny
-f clustalw      -- use clustalw format
-c               -- initialize a new (or overwrite the old) database
-v               -- print information about what is happening
other available options that we do not need here:
-p password      -- not used because the root user has no password
                    in this implementation
-n               -- do not calculate map coordinates (faster)


We will be running the script with this command line incantation (see below):

$ ../bin/load_alignments_msa.pl -u root -d rice_synteny -format clustalw -v data/rice.aln

Running in the background with the linux screen command

Using screen: Running the script as we are below is time-consuming, so we will use a screen session to run it in the background while we turn our attention to downstream tasks. [more information on 'screen'...]

  • When entering screen mode, hit 'space' to clear the first screen if a message appears.
  • If your backspace key does not work in screen mode, use ^H (ctrl key + H key).
gmod@ubuntu:~/data/gbrowse_syn/rice/data$ screen -S load1
gmod@ubuntu:~/data/gbrowse_syn/rice/data$ ~/data/gbrowse_syn/bin/load_alignments_msa.pl -u root -d rice_synteny -format clustalw -v rice.aln -c
Processing alignment file rice.aln...
Processing Multiple Sequence Alignment 1 (length 1557)
Processing Multiple Sequence Alignment 2 (length 11275)
Processing Multiple Sequence Alignment 3 (length 3526)
Processing Multiple Sequence Alignment 4 (length 5992)
Processing Multiple Sequence Alignment 5 (length 24267)
Processing Multiple Sequence Alignment 6 (length 697)
Processing Multiple Sequence Alignment 7 (length 6798)
Processing Multiple Sequence Alignment 8 (length 4760)
Processing Multiple Sequence Alignment 9 (length 4595)
Processing Multiple Sequence Alignment 10 (length 95)
Processing Multiple Sequence Alignment 11 (length 479)
Processing Multiple Sequence Alignment 12 (length 9123)
Processing Multiple Sequence Alignment 13 (length 80)
Processing Multiple Sequence Alignment 14 (length 11864)
Processing Multiple Sequence Alignment 15 (length 775)
etc...
  • This will go on for some time (there are 1800 alignments), so we will let the screen run in the background and work on our other tasks. We do this like so:
  1. hit ^A (ctrl key + A key), then release
  2. hit the D key, which will detach the screen (continues to run in the background)
  • We can check back later like so:
$ screen -r load1
  • If the job is done, we can exit the session by typing 'exit' at the command prompt.

Setting up the species' databases

GFF3

Let's have a look at the GFF3 data:

$ more rice.gff3
##gff-version 3
##sequence-region 3 1 19401704
3       ensembl gene    78      1849    .       -       .       ID=3_FG2548;Name=3_FG2548;biotype=protein_coding
3       ensembl mRNA    78      1849    .       -       .       ID=3_FGT2548;Parent=3_FG2548;Name=3_FGT2548;biotype=protein_coding
3       ensembl CDS     1645    1849    .       -       0       Parent=3_FGT2548;Name=CDS.12
3       ensembl CDS     1444    1547    .       -       1       Parent=3_FGT2548;Name=CDS.13
3       ensembl CDS     999     1144    .       -       0       Parent=3_FGT2548;Name=CDS.14
3       ensembl CDS     799     913     .       -       2       Parent=3_FGT2548;Name=CDS.15
3       ensembl CDS     646     786     .       -       0       Parent=3_FGT2548;Name=CDS.16
3       ensembl CDS     78      215     .       -       0       Parent=3_FGT2548;Name=CDS.17
3       ensembl gene    4910    5518    .       +       .       ID=3_FG2546;Name=3_FG2546;biotype=protein_coding
3       ensembl mRNA    4910    5518    .       +       .       ID=3_FGT2546;Parent=3_FG2546;Name=3_FGT2546;biotype=protein_coding
3       ensembl CDS     4910    5518    .       +       0       Parent=3_FGT2546;Name=CDS.19
3       ensembl gene    5743    6351    .       -       .       ID=3_FG2565;Name=3_FG2565;biotype=protein_coding
3       ensembl mRNA    5743    6351    .       -       .       ID=3_FGT2565;Parent=3_FG2565;Name=3_FGT2565;biotype=protein_coding
3       ensembl CDS     5743    6351    .       -       0       Parent=3_FGT2565;Name=CDS.21
3       ensembl gene    10979   16914   .       +       .       ID=3_FG2570;Name=3_FG2570;biotype=protein_coding
3       ensembl mRNA    10979   16914   .       +       .       ID=3_FGT2570;Parent=3_FG2570;Name=3_FGT2570;biotype=protein_coding
3       ensembl CDS     10979   11592   .       +       0       Parent=3_FGT2570;Name=CDS.29
3       ensembl CDS     11670   13317   .       +       2       Parent=3_FGT2570;Name=CDS.30
3       ensembl CDS     13390   14204   .       +       0       Parent=3_FGT2570;Name=CDS.31
3       ensembl CDS     14433   16914   .       +       2       Parent=3_FGT2570;Name=CDS.32

Some key things to note:

The ##sequence-region directive 
is used to create a reference sequence named 3, which is the scaffold on which all of the other features in the file are located
The 'gene' features 
are the top-level parent featured. The 'mRNA' and 'CDS' features are children of the gene. The containement hierarchy is organized using the 'Parent' tag. The CDSs are children of the mRNA, which is in turn a child of the gene. For display purposes, we only need to worry about the gene.

Loading

Note: before we load the GFF3 databases, we need to create a database for each species and give the web user 'nobody' read privileges. Let's create a little SQL script to make this easier:

  • This is just a list of SQL commands that give instructions to the mysql database manager, which we can pass via STDIN
  • create a file create_species_dbs.sql with the contents below.

CREATE DATABASE rice;
CREATE DATABASE wild_rice;
GRANT SELECT on rice.* TO 'nobody'@'localhost';
GRANT SELECT on wild_rice.* TO 'nobody'@'localhost';

  • Then we can run the commands like so:
gmod@ubuntu:~/data/gbrowse_syn/rice/data$ mysql -uroot <create_species_dbs.sql
  • Make sure we are in the location of the GFF data files
$ cd ~/data/gbrowse_syn/rice/data
  • The script we need is bp_seqfeature_load.pl, which come pre-installed with bioperl-live
  • The -f options means "fast load"
  • The -c option means complete (or destructive) load. It would overwrite previously loaded 'rice' databases

Load the rice data...

gmod@ubuntu:~/data/gbrowse_syn/rice/data$ bp_seqfeature_load.pl -u root -d rice -c -f rice.gff3
loading rice.gff3...
Building object tree... 0.53s4s
Loading bulk data into database... 0.65s
load time: 11.74s

and repeat for wild rice...

gmod@ubuntu:~/data/gbrowse_syn/rice/data$ bp_seqfeature_load.pl -u root -d wild_rice -c -f wild_rice.gff3
loading wild_rice.gff3...
Building object tree... 0.55s7s
Loading bulk data into database... 0.66s
load time: 11.98s
  • The alignment database loading should also be done by now, we can check like so:
gmod@ubuntu:~/data/gbrowse_syn/rice/data$screen -r load1

Setting up the Configuration Files

Copy the configuration file to the installation directory. Note that you will need root privileges to do this.

Change to the conf directory and make sure we have the files...

gmod@ubuntu:~/data/gbrowse_syn/rice/conf$ cd ../conf
gmod@ubuntu:~/data/gbrowse_syn/rice/conf$ ls
header.txt  oryza.synconf  rice_synteny.conf  wild_rice_synteny.conf
<pre>

* The default configuration location for Ubuntu Linux is /etc/apache2/gbrowse.conf, copy the files there

gmod@ubuntu:~/data/gbrowse_syn/rice/conf$ sudo cp *conf /etc/apache2/gbrowse.conf
[sudo] password for gmod:

A Species Config File

File: rice_synteny.conf

[GENERAL]
description   = Domestic rice chromosome 3
db_adaptor    = Bio::DB::SeqFeature::Store
db_args       = -adaptor DBI::mysql
                     -dsn     dbi:mysql:rice;host=localhost
                     -user    nobody

# examples to show in the introduction
examples = 3:51418..52015
           3:67260..67704

# what image widths to offer
image widths  = 450 640 800 1024

# default width of detailed view (pixels)
default width = 1024

initial landmark = 3:200000..300000

# Web site configuration info
stylesheet  = /gbrowse/gbrowse.css
buttons     = /gbrowse/images/buttons
tmpimages   = /gbrowse/tmp

# max and default segment sizes for detailed view
max segment      = 5000000
default segment  = 5000

# zoom levels
zoom levels      = 50 100 200 1000 2000 5000 10000 20000 40000 50000 100000 500000 1000000 5000000

# colors of the overview, detailed map and key
overview bgcolor = lightgrey
detailed bgcolor = lightgoldenrodyellow
key bgcolor      = beige
default features = EG
balloon tips     = 1

[TRACK DEFAULTS]
glyph         = generic
height        = 10
bgcolor       = lightgrey
fgcolor       = black
font2color    = blue
label density = 25
link          = AUTO
link_target   = _blank
title         = Hello, my name is $name!

################## TRACK CONFIGURATION ####################
# the remainder of the sections configure individual tracks
###########################################################

[EG]
feature      = gene:ensembl
glyph        = gene
height       = 10
bgcolor      = peachpuff
fgcolor      = hotpink
description  = 0
label        = 0
category     = Transcripts
key          = ensembl gene

The GBrowse_syn Config File

File: oryza.synconf

#include header.txt

# example searches to display
examples = rice 3:157000..200000
           rice 3:16050173..16064974
           wild_rice 3:1..400000

zoom levels = 5000 10000 25000 50000 100000 200000 400000

# species-specific databases
[rice_synteny]
tracks    = EG
color     = blue

[wild_rice_synteny]
tracks    = EG
color     = red
#Note the include statement below.
#include header.txt

# example searches to display
examples = rice 3:157000..200000
           rice 3:16050173..16064974
           wild_rice 3:1..400000

zoom levels = 5000 10000 25000 50000 100000 200000 400000

# species-specific databases
[rice_synteny]
tracks    = EG
color     = blue

[wild_rice_synteny]
tracks    = EG
color     = red

This should complete the installation. Time to test it out...

Testing the rice and wild_rice data sources in GBrowse

  • If things have worked out, you should see something like the image below when you point you browser to:
http://localhost/cgi-bin/gbrowse/rice

Note you will use 'localhost' if you are running your browser within the VMware player.

Rice in gbrowse.png

Viewing the data in GBrowse_syn

  • Cross you fingers
http://localhost/cgi-bin/gbrowse_syn/oryza


Ihopethisworks2.png

Optional Advanced Section

We will setup up a five-genome database if time permits.