GBrowse Tutorial 2012

From GMOD
Jump to: navigation, search

This GBrowse tutorial was taught at the 2012 GMOD Summer School by Scott Cain.

To follow along with the tutorial, you will need to use AMI ID: ami-5bab1c32, name: GMOD 2012 day 4 start, available in the US East (N. Virginia) region. See the GMOD Cloud Tutorial for information on how to get this AMI.

Prerequisites

Installed before using apt or cpan.

Install GBrowse

Easily installed via the cpan shell:*

 sudo cpan
 cpan> install Bio::Graphics::Browser2

Which gets all of the prereqs that aren't installed on the machine.

*Small caveat: the 2.49 release has a small bug that makes installing it slightly tricky; this will be fixed in the upcoming 2.50 release. For more information about the bug, you can see the mailing list archive.

Tutorial

Go to http://ec2-##-##-##-##.compute-1.amazonaws.com/gbrowse2

Basic Chado Configuration (if we have time)

Bio::DB::Das::Chado was installed when we created the image. We'll start with the sample Chado configuration that is in /etc/gbrowse:

 sudo cp 07.chado.conf pythium.conf

Some simple tweaks and additions:

  • Change description
  • Remove or change examples (yeast examples don't help anybody)
  • Add initial landmark (initial landmark = scf1117875582023:1..10000)
  • Change the connection stuff to connect to drupal db

(Re)loading GFF data

So as it turns out, there is an incompatibility between the not-yet-released Tripal 1.0 GFF loader and the GBrowse Chado adaptor. While I will work with Steven et al. to figure out where the problem is (and I'm not pointing any fingers, since it could very well be my fault). In addition, the Chado adaptor doesn't really care for the GFF that MAKER is producing either, so we're going to munge it a little. Make a perl script named munge.pl with this contents:

#!/usr/bin/perl
use strict;
use warnings;
while (<>) {
   if (/\tmatch(_part)?\t/) {
       s/Name=[^;]+?;//;
   }
   print;
}

and run it like this:

 perl munge.pl scf1117875582023.gff > new.gff

And we'll load it into an existing Chado database like this:

 gmod_bulk_load_gff3.pl --org dicty --noexon --an -g new.gff --dbprof chado

Where:

 --org dicty    #yes, I know, but I didn't want to create a new org entry
 --noexon       #tells loader not to create exon feature from the CDS features because they're already there
 --an           #this is an analysis result (save the score)
 -g             #load this gff file
 --dbprof chado #use the database profile called chado (which is the name of the database)

After we do that, we have to change the db connection parameters to use the database named chado:

[main:database]
db_adaptor    = Bio::DB::Das::Chado
db_args       = -dsn dbi:Pg:dbname=chado
               -user ubuntu
               -srcfeatureslice 1
         #      -fulltext 1
               -organism dicty

Add a BAM data source

[bam_sample:database]
db_adaptor     = Bio::DB::Sam
db_args        = -fasta /var/lib/gbrowse2/databases/pythium/scf1117875582023.fasta
                 -bam   /var/lib/gbrowse2/databases/pythium/simulated-sorted.bam
search options = default

Add track defaults

[TRACK DEFAULTS]
glyph       = generic
database    = main
height      = 8
bgcolor     = cyan
fgcolor     = black
label density = 25
bump density  = 100

Note particularly the "database" entry--for most tracks we'll be using the main (ie, chado) database, but the bam_sample data source will be available when we want it.

Add some tracks

[Genes]
feature      = gene
glyph        = gene
ignore_sub_part = polypeptide
#bgcolor      = yellow
forwardcolor = yellow
reversecolor = turquoise
label        = sub { my $f = shift;
                    my $name = $f->display_name;
                    my @aliases = sort $f->attributes('Alias');
                    $name .= " (@aliases)" if @aliases;
                    $name;
  }
height       = 6
description  = 0
key          = Named gene
[CDS]
feature      = mRNA
glyph        = cds
description  = 0
ignore_sub_part = polypeptide exon
height       = 26
sixframe     = 1
label        = sub {shift->name . " reading frame"}
key          = CDS
citation     = This track shows CDS reading frames.
[repeats]
feature      = match:repeatmasker
glyph        = generic
bgcolor      = black
key          = Repeats
[ests]
feature      = expressed_sequence_match
glyph        = segments
stranded     = 1
bgcolor      = green
key          = EST matches
[proteins]
feature      = protein_match
glyph        = segments
stranded     = 1
bgcolor      = pink
fgcolor      = red
key          = protein matches
[CoverageXyplot]
feature        = coverage
glyph          = wiggle_xyplot
database       = bam_sample
height         = 50
fgcolor        = black
bicolor_pivot  = 20
pos_color      = blue
neg_color      = red
key            = Coverage (xyplot)
[Reads]
feature        = match
glyph          = segments
draw_target    = 1
show_mismatch  = 1
mismatch_color = red
database       = bam_sample
bgcolor        = blue
fgcolor        = white
height         = 5
label density  = 50
bump           = fast
key            = Reads
[Pair]
feature       = read_pair
glyph         = segments
database      = bam_sample
draw_target   = 1
show_mismatch = 1
bgcolor       = sub {
                my $f = shift;
                return $f->attributes('M_UNMAPPED') ? 'red' : 'green';
                }
fgcolor       = green
height        = 3
label         = sub {shift->display_name}
label density = 50
bump          = fast
connector     = dashed
balloon hover = sub {
                my $f     = shift;
                return '' unless $f->type eq 'match';
                return 'Read: '.$f->display_name.' : '.$f->flag_str;
                }
key           = Read Pairs

Add our new database to the GBrowse.conf

To let GBrowse know that there is a new database available, we have to add a few lines to GBrowse.conf. Add this to the bottom:

[pythium]
description   = Pythium ultimum
path          = pythium.conf

Add semantic zooming for the BAM tracks

Not doing this for very dense data (like BAM) is probably the number one performance killers for GBrowse; asking GBrowse to draw a track that has thousands of glyphs is time consuming (and ultimately, probably not very informative).

[Reads:5001]
feature        = coverage
glyph          = wiggle_density
height         = 15
[Pair:5001]
feature        = coverage
glyph          = wiggle_density
height         = 15
bgcolor        = purple

Add "show summary" functionality

For other tracks, when zoomed way out (100kb or 1MB), performance can similarly suffer, with a decreasing "information" content. Newer versions of GBrowse provide the ability to automatically generate density plots when zoomed out. This functionality is available from Chado and Bio::DB::SeqFeature::Store data adaptors. To prepare our Chado database to do this semantic zooming, we need to run a script that comes with Bio::DB::Das::Chado:

 cd ~/GBrowse-Adaptors/Chado
 git pull
 perl bin/gmod_create_summary_statistics.pl --dbprof chado

and then add to the pythium.conf file, somewhere near the top (i.e., not in the track definitions):

 show summary = 99999

Enabling full text searching

If we try searching for "gene 0.27", we'll get "Not Found" as a result, even though maker-scf1117875582023-snap-gene-0.27 does exist. To look for partial strings, we need to enable full text searching. To do so, we need to run another script that comes with Bio::DB::Das::Chado:

 perl bin/gmod_chado_fts_prep.pl --dbprof chado

This does several things (including poorly estimating how long it will take to finish), including creating materialized views, using a tool provided by SOL Genomics Network (SGN). In practice, it would be a good idea to read the documentation of gmod_materialized_view_tool.pl for information on keeping the view up to date.

We also have to tell GBrowse that this Chado database can now do full text searching, by adding this to the Chado database stanza:

 -fulltext 1

Now we can search for "gene 0.27" and we'll find our gene (plus its mRNA and exons) and we can click on the gene to see it in GBrowse.