August 2009 GMOD Meeting

From GMOD
Jump to: navigation, search
August 2009 GMOD Meeting
6-7 August, 2009
Oxford UK

Part of GMOD Europe 2009, five days of GMOD including a GMOD Summer School


August 2009 GMOD Meeting

GMOD Europe 2009


This GMOD Community Meeting was held 6-7 August, 2009, in Oxford UK. The meeting was a part of GMOD Europe 2009, a week long event that also included a GMOD Summer School. This is the first time a GMOD meeting has been held in Europe.

As with previous GMOD meetings, this meeting had a mixture of project talks, component talks, and user talks. The agenda was driven by attendee suggestions. The two previous meetings were the January 2009 and July 2008 meetings. GMOD meetings are an excellent way to meet GMOD developers and users, and to learn (and affect) what's coming in the project.


Contents

Schedule

Heng Li
Wellcome Trust Sanger Institute


Dr Heng Li of the Sanger Institute was the special guest speaker. Heng discussed his recent work on SAMtools, a set of file formats and scripts for efficiently storing and accessing next generation sequence data. Heng is a developer on several projects focused on next generation sequencing, including SAMtools, BWA, and MAQ.
Date Time Session Link(s)
Thursday
6 August
8:30-12:00 Last half day of 2009 GMOD Summer School - Europe
13:30-14:30 Scott Cain - Introductions and the State of GMOD Prezi, PPT, PDF, Summary
14:30-15:00 Dave Clements - GMOD Help Desk Stuff PPT, PDF, Summary
15:00-15:30 Jun Zhao - Linked Data for GMOD Databases PDF, Summary
15:30-15:45 Coffee Break
15:45-16:15 Steve Taylor - GMOD in the Trenches PDF, Summary
16:15-16:30 Scott Cain (for Robert Buels) - A DBIx::Class layer for Chado S5 Slides, Summary
16:30-17:00 Ed Lee - GMOD Biological Object Layer PDF, Summary
17:00-17:30 Josh Goodman - A Restful interface for MODs Summary
17:30 Dinner (on your own)
Friday
7 August
8:45-9:15 Heng Li - Quest for Standard: Sequence alignment/map format (SAM) and SAMtools PDF, Summary
9:15-9:45 Dave Clements - Visualising NGS Data in GBrowse 2 PPT, PDF, Summary
9:45-10:15 Erick Antezana & Frederic Potier - GBrowse: Lessons Learned and Statement of Interest PDF, Summary
10:15-11:45 Ian Holmes - JBrowse PDF, Summary
10:45-11:00 Coffee Break
11:00-11:30 Sheldon McKay - GBrowse_syn PDF, Summary
11:30-12:30 Discussion: NextGen data and GMOD: What do we do (and not do)?
12:30-13:30 Catered Lunch
13:30-14:00 Alessandra Bilardi - GBrowse.org PDF, Summary
14:00-14:30 Jonathan Warren - DAS update PPT, PDF, Summary
14:30-15:00 Julie Sullivan - InterMine update Summary
15:00-18:00 Show and Tell, Discussion Summary

Presentations

GMOD Project Talks

Scott

Scott Cain, Ontario Institute for Cancer Research, Intro, What's New PPT, PDF
Dave Clements, NESCent, Help Desk Update, PPT, PDF

HHMI Science Education Alliance

The Howard Hughes Medical Instutute's Science Education Alliance (SEA) is using GMOD tools to teach annotation to college freshmen. They isolate and sequence phage samples. The sequence is then stored in Chado, annotated with Apollo and visualized with GBrowse. In production at 12 colleges across the US.

What's new

Chado (GMOD) 1.1 is coming
  • Minor schema changes
  • Minor fixes to GFF scripts
  • Addition of Chris Mungall's script to create views based on CV terms.
GBrowse
  • GBrowse 2
    • Distributed databases and render servers
    • AJAX track loading
    • Improved configuration management
    Next Generation Sequencing in GBrowse
    • Support for SAM/BAM databases - see SAMtools
    • Coverage XY-plot, Confidence density plot, Individual alignments, Paired reads
    • Currently Alpha in GBrowse 2; may work in GBrowse 1 with some DAS magic.
    Circular chromosome support
    • Can scroll through origin, and features can span origin
    • Coming in GBrowse 1.71
    • Developed by Nathan Liles of the EcoliWiki project.
JBrowse
  • Another complete rearchitecture
  • Uses AJAX for client side rendering
GBrowse_syn
  • Distributed with GBrowse 1.70
  • Makes use of data adaptors/databases that GBrowse uses
Tripal
DIYA

DIYA is a gene prediction pipeline for prokaryotes. It complements MAKER, a pipeline for eukaryotes. DIYA is actually a generic, lightweight pipeline framework which was initially built to produce gene predictions. DIYA is becoming part of GMOD.

Atlases and Aniseed

Aniseed is converting its schema to Chado. One of Aniseed's particular strengths is atlases for expression, anatomy, and cell fate. They are extending Chado to better support atlases, and will also make their web front end available as a part of GMOD.

GMOD Summer School

2008

2008 GMOD Summer School - first school ever offered

2009

2009 GMOD Summer School - Americas

  • 4 days at NESCent
  • 8 GMOD Components covered; 9 instructors
  • 52 applications for 25 slots

2009 GMOD Summer School - Europe

  • 3 1/2 days at University of Oxford
  • 7 GMOD Components covered; 10 instructors
  • 58 applications for 25 slots

That's an over 350% increase in interest from 2008.

We'll do another summer school at NESCent in 2010. We are also considering one in Asia/Pacific in 2010.

Outreach

GMOD Community Surveys

GMOD is now surveying the community every year. The 2008 GMOD Community Survey had 89 and is very informative about how GMOD is used. The 2009 survey will be in October

Upcoming GMOD Hackathon ?

There may be a GMOD hackathon this coming spring (March to May) at US National Evolutionary Synthesis Center (NESCent) in Durham, NC, USA. If this happens the focus will be on extending GMOD for evolutionary biology. Contact Dave if you want to be on organizing committee or participate.

Linked Data for GMOD Databases

Jun Zhao

Jun Zhao, Department of Zoology, University of Oxford, PDF

Jun first introduced the Resource Description Framework (RDF) and the SPARQL language for querying it.

OpenFlyData

Jun discussed her group's efforts to build an RDF triple store from several very different data sources: FlyBase (a Chado database), BDGP, FlyTED, FlyAtlas, and Affymetrix data sources. The integrated triple-store can be accessed at OpenFlyData.

They used

  • D2RQ mapping to load FlyBase and BDGP, using conservative mapping with minimum interpretation.
  • OAI2SPARQL to harvest N3 RDF metadata via the OAI-PMH protocol, using built-in support by Eprints, and further info from ESWC2008 paper.
  • Custom Python program to get FlyAtlas data.

Some performance numbers:

  • Loading: Our datasets ~175 million triples
  • Querying:
    • Good enough for real time user interaction, e.g., <1s for single gene search, 1-4s for multigene search (unions)
    • No significant slowdown when scale from 10m to 175m triples
  • Text matching and case insensitive search
    • Problems with using SPARQL regex filter, the only mechanism for case-insensitive search in SPARQL
    • Pre-generated lower-case gene names and loaded into the FlyBase RDF DB
    • Tried with OpenLink Virtuoso, still ~10 seconds for a case-insensitive search

Jun used OpenFlyData to:

  • Search by gene, gene expression mashup: (go)
  • Search gene expression by gene batch (go)
  • Search gene expression by tissue expression profile (go)

Open-BioMed

Jun also described a second effort, Open-BioMed, that uses the same technologies to connect knowledge about alternative medicine and western drugs. Open-BioMed demonstrate the value of Linked Data, and shows a novel technique for creating interlinks between datasets on a large scale. This is a joint effort of the BioRDF and LODD (Linked Open Drug Data) task forces of the World Wide Web Consortium (W3C) Health Care Life Science Interest Group. Jun used Open-BioMed to Search for herbs associated with a particular disease.

RDF & SPARQL: Benefits & Risks

Some identified benefits:

  • RDF provides a uniform and flexible data model
    • RDF dump is cheaper and quicker
    • Maintaining a separate SPARQL endpoint for each data source makes it easier than a data warehouse approach for handling data updates
  • RDF facilitates data re-use and re-purposing
  • SPARQL raises the point of departure for an application
    • Expressive, open-ended query protocol
    • Support for unanticipated queries

and risks:

  • Mapping data to RDF requires expertise and experience
  • Expressive query protocol is a double-edged sword
  • Performance is good for some queries, not for others ...

GMOD in the Trenches

Stephen Taylor

Stephen Taylor, Computational Biology Research Group, University of Oxford, PDF

The Computational Biology Research Group (CBRG provides bioinformatics support to researchers at the University of Oxford. They are heavy GMOD users and have used GBrowse, Citrina, BioMart, and Apollo (along with Artemis).

GBrowse at CBRG

Back in 2004, the CBRG wanted to pull data together to make a lab resource, and the genome is a useful data organiser. The CBRG evaluated these platforms: UCSC, Ensembl, AceDB, and GBrowse. Each had advantages and disadvantages, but GBrowse looked like it was built to be distributed and used elsewhere. Ease of installation and were not a priority for the others.

The CBRG now supports over 50 different GBrowse databases. Data is mainly human, mouse or bacterial, and data types include time series, arrays, and ChIP-on-Chip. They visualize a lot of Next Generation Sequencing data, including histone modifications, ChIP-Seq, cis/trans interaction data, PCR amplified regions, and RNA-Seq.

The CBRG actively manages data flow to its GBrowse instances. Each production GBrowse instance has a matching development instance where updates and changes are staged and tested before pushing them to production. They also use core and satellite databases. Core databases are built for human and mouse using public source data. To meet individual groups' needs they then clone a core database, load custom data that is specific to that group, and then run a script to merge the core and satellite GBrowse configuration files. They use Apache to restrict access to the satellite instances.

The CBRG strives to encourage power users. Data is available for download, and they have regular meetings to discuss best practices.

Extending GBrowse

In the future they would like to use GBrowse as a workbench. To do this they need flexible ways to import and export features. For example, you can define a temporary track by uploading a GFF3 file, or by connecting via DAS to an outside source, or to another GBrowse. It would be nice to have a method to commit a temporary track and make it permanent. This requires some sort of user authentication.

Steve also walked through and example of how it would be useful to support querying and visualize data from multiple loci at the same time.

Make Existing GBrowse More Useful to External Developers

Finally Steve listed these 5 ways to make GBrowse more useful to external developers:

  • Document general structure of GBrowse perl modules
  • Tips on debugging
  • Document / define API
  • Central Glyph page
  • Include a copy of BioPerl inside GBrowse

A DBIx Class layer for Chado

Sol Genomics Network

Scott Cain, OICR, for Robert Buels, Sol Genomics Network (SGN), S5 Slides

Chado needs middleware, a layer of software between the application (e.g. a website) and the database. Chado's flexible design makes for complex queries and a steep learning curve. It is also hard to get good performance. This talk introduces a Perl DBIx::Class layer for use with Chado, which can be used as the basis for many applications, including the next generation of Modware.

DBIx::Class is an object-relational mapping framework for Perl, and is the de facto. It has powerful features for:

  • query building (the magic of chainable ResultSets)
  • cross-database deployment (using SQL::Translator in the backend)
  • testing with Fixtures

Middleware can help by storing and/or automating complex queries, codifying best practices with both code and unified, high-level documentation. Some performance optimizations can be put in middleware and it can assist in creating indexes and materialized views.


The Bio-Chado-Schema project has been set up by Robert Buels, with source control at GitHub, and releases available on CPAN. This contains DBIx::Class modules for every Chado table that should work with all database platforms that are supported by Chado. The project uses automated tools to keep the modules in sync with changes in the Chado schema. The project is currently actively looking for development help, CPAN releases are currently intended for developers. Future goals include API support for common querying and loading patterns, interoperation with BioPerl objects, forming the basis for a future version of Modware, and more.

Rob says:

  • other people should start building features onto and into it
    • and do some of the other things on the slides
  • make a new version of Modware based on it
  • do you think somebody could get funding to work on it full time?

GMOD Biological Object Layer

Ed Lee

Ed Lee, BBOP, PDF

Ed has been working with E.O. Stinson and Robert Bruggner at BBOP, and Robin Houston and Adrian Tivey at Sanger to create a Java based biological object layer (GBOL) for genomic features.

GBOL Architecture

GBOL is the top layer of a multilayer architecture:

Biological Object Layer (GBOL)

This layer defines an object at a biological level of interest, say a gene. It aggregates together all of the information about that high level concept into a single, programmatically accessible entity. It hides all of the information about how and where the underlying data is stored.

This layer is inspired by Chado, but is not necessarily built on top of Chado.

Biological Object/IO Layer

This layer ...

Simple Object Layer

This layer knows about basic biological concepts, but does not directly know how or where this information is stored.

Simple ObjectI/O Layer

This is the bottom layer of the stack and it is closely tied to how and where the data is stored. This layer knows if it is talking to a Chado database, a GFF3 file, or some other data source.

This layer can do simple aggregation such as "return all features in this range", but does not perform aggregation based on biological models. That type of aggregation is performed by higher levels.

Biological Layer Configuration

<?xml version="1.0" encoding="UTF-8"?>
<gbol_mappings>
 <feature_mappings>
  <type cv="SO" term="gene" default="true">
   <read_class>Gene</read_class>
  </type>
  <type cv="SO" term="transcript" default="true">
   <read_class>Transcript</read_class>
  </type>
  <type cv="SO" term=”my_transcript”>
   <read_class>Transcript</read_class>
  </type></feature_mappings>
 <relationship_mappings>
  <type cv="relationship" term="part_of" default="true">
   <read_class>PartOf</read_class>
  </type></relationship_mappings>
</gbol_mappings>

Future Developments

  • Continued development on Biological layer
  • Inference of data: infer introns from exon structure
  • New format handlers: Chado XML, GAME XML, BioPerl bridge
  • Configuration of common relationship variations such as ESTs aligned to the genome directly vs having a "match" feature


A Restful interface for MODs

Josh Goodman, FlyBase

Josh talked about the progress of the GMOD REST API group that was started at the January 2009 GMOD Meeting.


Quest for Standard: Sequence alignment/map format (SAM) and SAMtools

Heng Li

Heng Li, Wellcome Trust Sanger Institute, PDF

Heng spoke about SAM/BAM and SAMtools, a platform agnostic set of file formats and programs for next generation sequence data.

SAM/BAM is a generic nucleotide alignment format that is

  • is simple to understand, easy to generate and easy to parse
  • is compact in file size
  • is streamable
  • supports fast random access

Quest for Standards

There had been no standardized and computationally efficient way to store the volumes of data that next generation sequence data. Several formats such as phrap ACE and GFF existed but these were unable to scale up.

SAM Format

The Sequence Alignment / Map (SAM) format is motivated by short read alignment but also works with long reads and de novo assemblies. SAM uses a GFF3-like tab-delimited format with 11 mandatory fields for key information, and variable optional fields and predefined tags for non-standard information. It is designed to be simple to generate and to parse. It uses an extended CIGAR string for various types of alignments. The extended CIGAR string format adds support for clipped, spliced, multi-part, and padded alignments. See the SAM Format Specification for details.

BAM Format

SAM is a text format. The Binary Alignment/Map (BAM) format is an exact binary representation of SAM. It has Zlib/gzip compatible compression (and can be decompressed by zlib/gzip). BAM is space efficient, achieving 1 byte per raw base pair, including sequence, quality, read name, position and meta info. BAM is also streamable: programs can process alignments without loading the entire alignment into memory. BAM is usually sorted by the leftmost chromosomal position. BAM is indexed, supports random access, and can quickly retrieve sequences overlapping a specified region.

BAM uses BGZF, a generic indexable compression format. The standard gzip/zlib format is not block-wise. Indexing is intricate and inefficient. BGZF is separated into multiple standalone gzip/zlib blocks (64kB each).

BAM indexing uses binning plus linear index for alignments sorted by the leftmost coordinates. B-trees and pure linear indexes are inefficient for resolving ‘overlap’ queries. R-tree and pure binning indexes have difficultly in streaming. For short read alignment, typically one seek function call for the retrieval of reads in a region (more efficient than R-trees). Also produces small index files (e.g., ~9MB for deep human resequencing)

APIs, Implementations and Supported Platforms

Several assembly programs can now produce SAM directly, and SAMtools comes with scripts to convert the output of several other assemblers to SAM format.

SAM also has native HTTP/FTP support. Programs can retrieve alignments overlapping a specified region from a remote file via http/ftp. Simply replace the input BAM file name with a URL (http/ftp only). This partial load approach greatly reduces data transfer for applications such as genome browsers, that typically only need small regions of an assembly at any time.

Several implementations using SAMtools are available. The SAMtools package itself includes command line tools and C APIs for:

  • Conversion from other formats
  • SAM ⇔ BAM, indexing, sorting, merging, pileup, SNP/indel calling, alignment viewer ...
  • Native HTTP/FTP support

There are also implementations in Java ([http:picard.sourceforge.net Picard] and GATK), and Perl (Bio::DB::Sam, which is what GBrowse uses - see the next talk).

Displaying Alignments

An alignment viewer is a great help for method development:

  • Visually understand the alignment: the error rate, the depth, etc.
  • Validate aligner results: even read depth? right coordinates? right gaps?
  • Validate SNP/indel calls: human eyes are always better.
  • Validate structural variations: pair-end information

SAMtools comes with a Text Alignment Viewer, tview which uses the GNU ncurses library. tview retrieves alignments using FTP/HTTP and is fairly simple. It shows alignments, but not annotation, paired-end information, multiple tracks, ...

The Broad Institute's Java-based Integrative Genomics Viewer (IGV) also works with data in BAM format.

And you can view SAM/BAM in GBrowse using the Bio::DB::Sam Perl adaptor (based on SAMtools C APIs). For SAM/BAM, GBrowse is a lightweight and versatile shared alignment viewer supporting mutliple tracks and gene annotations.

For GBrowse, SAM/BAM can provide an efficient way to access large-scale new sequencing data, store various types of alignment (EST, mRNA, etc.) as an alternative to SQL databases, and possibly realize distributed alignment resources. GBrowse already pulls in data from remote sources using DAS. It could be extended to pull in remote SAM/BAN data using FTP/HTTP.

Are distributed alignments feasible? There is already Native HTTP/FTP support in SAMtools. This could be added to Bio::DB::Sam as well. Alignment files are compressed. For short reads, one seek call (establishing network connection) is required to get alignments in a region. This would require very little configuration at the server hosting alignments, and compressed data transfer between file servers and the GBrowse server.

There are some major obstacles. The index files have to sit on local disks at the GBrowse server, and matching the reference sequences may be an issue. Also have to address bandwidth and caching.


Visualising NGS Data in GBrowse 2

Dave Clements, NESCent, PPT, PDF

Lincoln Stein has written a GBrowse adaptor, Bio::DB::Sam, for Next Generation Sequencing data stored in the BAM format that Heng Li described in his talk. This is currently in Alpha release, and works only with GBrowse 2. It is in available in the gbrowse-adaptors project of GMOD's CVS repository. Short read, next generation sequence data can be directly represented in GFF3, but the amount of data makes it very slow, and requires a very large database ti back it. Using Bio::DB::Sam on top of BAM files makes visualizing individual reads both computationally tractable, and manageable.

The talk used an example of 4 E. coli strains: an ancestral strain for which a reference sequence is available, a manipulated strain, and then two strains with phage resistance that evolved from the manipulated strain. Whole genome resequencing was performed on the manipulated and evolved lines. The resequencing was done on an Illumina GA2 and then assembled with the MAQ aligner. The MAQ alignments were then converted to SAM using a SAMtools script, and then to BAM.

Dave then showed how to configure GBrowse to be a short read viewer using Bio::DB::Sam, including an example callback to show alignment quality using color. However, the utility of showing short reads quickly declines as you zoom out past 100-200 bp. You can also use to Bio::DB::Sam to show summary statistics such as coverage depth. Dave will work on documenting the Bio:DB::Sam adaptor and it's interface to SAMtools in the coming months.

The talk then showed several other visualizations that can be done with next generation sequence data that don't display the short reads themselves. This included a number of ways to show allele and genotype frequencies (including showing them on a geolocation map).

Finally, if you are planning on starting to use NGS data, make sure you have a lot of bioinformatics infrastructure in place first.


GBrowse: Lessons Learned and Statement of Interest

Erick

Erick Antezana & Frederic Potier, Bayer CropScience, PDF

History and Current GBrowse Infrastructure

Bayer CropScience uses GBrowse 1.70 and GBrowse 2, CMap, Galaxy, and Ergatis. They have been a GBrowse user since 2004. They also evaluated Chado and chose not to use it because of performance issues. Currently using GBrowse 2 and mainly Bio::DB:GFF databases, focused mainly on plants. They have both publicly available plant genomes, private genomes, and increasingly frequent annotation updates. Their requirements include minor data reformatting, fast data loading and querying, customizable application, and a high level of integrity.

Bayer currently has more than 30 databases with public data, at around 30GB. Their in house data includes next generation sequence data (stored in BAM and accessed in GBrowse 2 via Bio::DB::Sam), genome annotation (stored in a Bio:DB:GFF database), molecular mapping visualized with CMap. They also considering supporting user annotation / manual curation with Apollo and/or Artemis. Their automated annotation workflow produces GFF and generates GBrowse configurations files.

Bayer has extended GBrowse in several ways, including user authentication, permissions, and tracking.

Also

  • On the fly visualization
  • Blast anchoring/Sequence homology search
    • blast homologies are uploaded as user annotations
  • Plugins
    • data export
    • links to in house applications
  • In house keyword search engine
    • fast search utility
    • cross databases search
  • Gateway
    • centralised access point

Statement of Interest: Requirements and Needs

Bayer CropScience would also like to see GMOD extended in a number of areas.

GBrowse Database Adaptors

  • NGS adaptor (Bio::DB::Sam) is a key priority
  • Memory adaptor would like to be able to specify a file name or a complete path via a parameter so, the adaptor doesn't need to load all the GFF files in the directory
  • Chado adaptor Portability to Oracle; ability to store user-specific annotation / manual curation; a system track versions and history of the annotations; and management of user access rights
  • SeqFeature::Store Portability to Oracle (c.f. user access rights via VPD) and faster loading time.
  • Compatibility with other genome browsers databases for instance ensembl databases?

GBrowse User Interaction

  • Authentication
    • To track user sessions
    • To enable user access rights management
  • User Annotation Management
    • To store the user annotations in a database or in a file on the server. Thus the users will be able to get their annotations while getting connected to different machines
    • To send automatically user’s annotations to GBrowse via a URL parameter
  • Integration with CMap

GBrowse Configuration Files

Current format is error prone, difficult to debug, has a steep learning curve, and is time consuming to maintain. Bayer (and CBRG and modENCODE and ...) partially works around this by having scripts generate their configuration files.

A better solution would be to have a better representation of the configuration file, XML for instance. (JBrowse addresses this issue by using JSON for its configuration files - Dave)

Would also like the ability to configure the global layout to enable/disable components such as disable the custom tracks or display settings components.

Would also like to have a standardized way to specify metadata in the configuration files. For example, species and assembly versions:

#################################
# database definitions
#################################
 
[TAIR_Arabidopsis_V8:database]
db_adaptor         = Bio::DB::GFF
db_args            = -adaptor DBI::mysql
                     -dsn dbi:mysql:TAIR_Arabidopsis_V8
species            = Arabidopsis thaliana
assembly.source    = TAIR
assembly.version   = 8
annotation.source  = TAIR
annotation.version = 8

Metadata Web Services

Web services could be used to query and report on metadata such as: list of reference sequences, annotation version, assembly version, list of available feature types,

Suggestion:

<browser>
  <species>Arabidopsis</species>
  <assembly>bayer</assembly>
  <annotation>1.0</annotation>
  <reference-sequence>chr1</reference-sequence>
  <reference-sequence>chr2</reference-sequence>
  <feature-type>fgenesh:mRNA</feature-type>
  <feature-type>splign:mRNA</feature-type>
</browser>

This information could be defined in the config file:

[TAIR_Arabidopsis_V8:database]
db_adaptor    = Bio::DB::GFF
db_args       = -adaptor DBI::mysql
                -dsn dbi:mysql:TAIR_Arabidopsis_V8
species=Arabidopsis thaliana
assembly.source=TAIR
assembly.version=8
annotation.source=TAIR
annotation.version=8

Conclusion / Discussion

GBrowse 2 is a tool that can be used in a production environment. It is intensively used within the Bayer Bioinformatics platform to facility a high level data integration. It is easy to maintain.

Our priorities for further developments:

  • Adaptors performance
  • Need to focus on user interaction
  • GBrowse.conf representation
  • Native integration of other GMOD tools (e.g. CMap)

JBrowse

Ian Holmes

Ian Holmes, University of California - Berkeley, PDF

Some useful links:

JBrowse was initially going to look and feel very much like GBrowse, but with pre-rendered, tiled images, a la Google Maps. A prototype was built, but this approach did not scale:

D. melanogaster at pixel resolution is an order of magnitude wider than the continental US.

Prerendering also prohibits things like user uploaded data. The original approach was abandoned and JBrowse now uses JavaScript based client side rendering. This approach is several orders of magnitude faster to generate the tracks, and takes several orders of magnitude less disk space to store them.

JBrowse uses nested containment lists (NCList) to store features. This approach is 5-500 times faster than competing methods such as R-trees, and B-trees with binning.

Ian demonstrated a TWiki plugin for JBrowse that demonstrated an easy way for users to upload their own tracks.

Some "imminent" developments for JBrowse:

  • Lazily-loaded NCLists
  • Text autocompletion; “proper” search
  • Nextgen sequence data
    • Start with basic summarization, then custom tracks
  • Community annotation
    • Persistent upload & sharing of tracks
    • Editing/curation over the web (ackles...)
  • Documented image-track API
  • Synteny browser (c.f. GBrowse_syn)
  • Much more at jbrowse.lighthouseapp.com

Ian closed with a very strong acknowledgment of Mitch Skinner's contribution to this work.


GBrowse_syn

Sheldon McKay

Sheldon McKay, Cold Spring Harbor Laboratory (CSHL), PDF

A synteny browser had display elements in common with a genome browsers. They use sequence alignments, orthology or co-linearity data to highlight different genomes, strains, etc., and they usually displays co-linearity relative to a reference genome.


Other GMOD Synteny Viewers

GMOD has several supported synteny browsers, in addition to GBrowse_syn:

SynView

SynView is an add-on to native GBrowse package. It uses GFF3 or DAS1 compliant data adapters. GFF requires special tags (but they are allowed by the spec). Reference panel appears on the top.

SynBrowse

SynBrowse uses the same core libraries as GBrowse. Uses the Bio::DB::GFF (GFF2) adaptor. The GFF uses standard 'Target' syntax. It currently supports only two species.

Sybil

Sybil is not GBrowse-based. It uses a Chado database as a backend and provides whole genome and detailed views.

CMap

CMap is a comparative map viewer and can be used to show alignments between markers and regions on any type of map.

Apollo

Apollo (and Artemis too) provides an embedded synteny viewer.

GBrowse_syn

GBrowse_syn is different from the other browsers in a number of ways:

  • Does not rely on perfect co-linearity across the entire displayed region (no orphan alignments)
  • Offers on the fly alignment chaining
  • No upward limit on the number of species
  • Used grid lines to trace fine-scale sequence gain/loss
  • Seamless integration with GBrowse data sources
  • Ongoing support and development
  • Some people think it looks nice

GBrowse_syn is part of the GBrowse distribution. It uses native (GBrowse-compliant) GFF2/GFF3 or Chado adapters for individual species' data, and stores synteny data are stored in a separate joining database. The databases form a hub and spoke (or star), with the joining database at the hub, and the individual species databases as the spokes.

At run time, GBrowse_syn reads the species databases, the joining/alignment database, and configuration files for each species and an overall config file.

Where do I get data for GBrowse_syn?

You have to make it.

GBrowse_syn helps you visualize multiple sequence alignment data, but it does not generate it for you. This is a non-trivial task and is not for the faint of heart. Sheldon provided a high level overview of one possible process and possible tools you could use in that process.

Raw genomic sequences
Step:

ex. tools:

Mask repeats

RepeatMasker, Tandem Repeats Finder, nmerge

Step:

ex. tools:

Identify orthologous regions

ENREDO, MERCATOR, orthocluster

 GBrowse_syn
Step:

ex. tools:

Nucleotide-level alignment

PECAN, MAVID

 GBrowse_syn
Step: Further processing
GBrowse

Once you have the data, you need to get it into a format that is supported by the GBrowse_syn load scripts.

Using GBrowse_syn

GBrowse_syn's user interface looks very much like GBrowse's interface. After selecting a reference assembly, GBrowse_syn displays each aligned sequence as a track, with every other track being the reference assembly. Aligned regions can be shown with and without connecting ribbons. Ribbons are twisted to indicate strand reversal. Strands can also be reversed in the display to untwist the ribbons. Alignment ribbons can be shown with or without embedded grid lines. Grid lines show a finer level of alignment than plain ribbons, allowing the user to easily identify regions with indels, and to visualize gene structure evolution or gene loss. They also require nucleotide level alignment.

GBrowse_syn can show the same breadth of features as GBrowse. However, for a clearer display, users are strongly encourage to limit what they show. As in GBrowse, arbitrary annotations can be added to any feature and shown with popups or linked pages.

GBrowse_syn also provides direct visual feedback on the likely quality of assemblies and can be used for guidance on refining them. For closely related species, regions in the reference should like to only a few regions in the other sequences. If it links to many different regions, the assembly likely needs significant additional work.

If all you have is orthology data, GBrowse_syn can show that. However, the utility of GBrowse_syn declines if the aligned sequences are too far apart. It does faithfully show the results of the alignment, but the visualization often highlights that the alignments are of poor quality.

Finally, if your alignment data has regions aligning to multiple regions in other species, say because of recent duplications, GBrowse_syn will visualize this correctly.

Future Developments

  • Integration with GBrowse 2.0
  • "On the fly" sequence alignment view
  • AJAX-based user interface and navigation
  • High-level graphical overviews


GBrowse.org

Alessandra

Alessandra Bilardi, CRIBI Biotech Center Padua University, PDF

Alessandra created GBrowse.org to facilitate exchange of data, configuration files, and best practices between GBrowse users. The web site links to GBrowse instances and data download pages. It is based on the MediaWiki wiki package and makes extensive use of category tags to make information accessible in many different ways.

GBrowse.org is updated through a mixture of automated and manual mechanisms. Entrez' EFetch utility is used to initially create pages with their genome sequencing status. Each organism's page includes links to browsers, downloades, and sites and pages about that organism. If information is available on how the sequence and annotation data was produced then that is included as well.

GBrowse.org is not limited to just GBrowse sites. It also links to Ensembl, UCSC, and several other browser types.

Future plans for GBrowse.org include:

  • complete automations
  • test and edit links
  • edit sequencing and annotation methods
  • generate GBrowses and pages about all genomes with sequencing completed
  • divide GBrowses and genome pages in different sites (optional)

Finally, if you have a GBrowse site, you are encouraged to notify Alessandra for inclusion on GBrowse.org.


DAS update

Aug2009Jonathan.JPG

Jonathan Warren, Sanger Institute, PPT, PDF

Jonathan started with an introduction to DAS. DAS:

  • Stops us from suffering under too much data to manage.
  • Allows us to download annotations for regions of interest rather than for whole genomes or databases,
  • Allows data providers to be in control of their annotations displayed to the world and can keep them up to date for users.

DAS stands for Distributed Annotation System. It allows data providers to provide their data over the web in a common format. It is based on HTTP and XML. Apollo and GBrowse, and many other popular packages, can speak DAS. DAS client programs request a list of DAS sources, and can then request regions of interest from those sources.

DAS 1.6E

DAS has a couple of versions. DAS was originally published in 2001. Over the years the DAS standard bifurcated into the DAS 1.x and DAS2 lines. DAS 1.x has proved more popular than DAS2. Current standard is 1.53E, but a DAS1.6E standard came out of a workshop in March 2009. DAS 1.6E is expected to provide the functionality that many DAS2 users desired. 1.6 spec has new features and is a consolidation of the way DAS is being used. 1.6E has extensions being developed.

Some DAS 1.5/1.6 Commands: Sources, Features, Sequence, types, Stylesheet, Structure Alignment, and Interaction.

Some extensions in DAS 1.6:

  • Represent features with more than two levels
  • Reliably relate feature types to a more structured ontology.
  • Identify when two DAS servers are using the same coordinate system.
  • A standard way to create and edit DAS features.
  • Verification of DAS servers for standards compliance.

DAS Registry

The DAS Registry is increasing validation capability of the registry for 1.53E and upcoming 1.6E spec. A RelaxNG schema has been created to support this.

Current and Future Work

  • More validation (headers and feature by id).
  • Capability of bulk uploading/mirroring DAS sources to Registry (sources cmd).
    • Adding all of ensembl genomes (bacteria and viruses) as DAS sources and to the registry.
  • Completing the 1.6 spec - hierarchies, nextFeature.
  • Updating client libraries and servers to work with both 1.53 and 1.6 spec
  • New user interface to the registry for faster searching using Lucene - also limited version available from Sanger and EBI sites.
  • Greater support for ontologies-give me all das sources that provide genes?

Some Implementations

DAS Libraries DAS Servers DAS Clients
  • PERL
    • Proserver, LDAS - servers
    • Bio::Das::Lite - client library
  • Java
    • Dazzle, MyDAS - servers
    • Dasobert - client library
  • Affymetrix
  • BioSapiens servers
  • Ensembl server
  • KEGG DAS
  • Sanger DAS server
  • EBI Genomic DAS server
  • EBI Protein DAS server
  • Uniprot DAS server
  • TIGR's listing of servers
  • UCSC server
  • Ensembl
  • Spice
  • Dasty
  • Pfam
  • STRAP
  • DASher


InterMine update

Julie

Julie Sullivan

Some bullet points from Julie's talk on InterMine:

  • InterMine has RESTful web services
  • Web service can return HTML.
  • FlyMine started in 2002. 5 developers, release about 10 times a year.

Mines4Mods

The Mines4Mods project started May 2009. It is a 2 year grant. RGD, SGD, and ZFIN are all participating. Each has half a developer working on it. The project is aiming for interoperability between InterMine instances. Hope to port results from one InterMine to another, and then use it in a query in its new location.


Show and Tell, Discussion

Daniel Sobral and Baptiste Brault of INRA Versailles demonstrated the Aniseed website, particularly the anatomy and gene expression atlas parts of it. Aniseed is currently in the process of converting their schema to Chado and is planning on making their web interface available to the GMOD community.

Agenda Suggestions

If you have items that you would like to discuss (or be discussed) at this meeting, please add them here.

Location

The meeting was held at the Medical Science Teaching Centre (MSTC) at the University of Oxford, in Oxford, United Kingdom.

Lodging

See the Lodging section of the GMOD Europe 2009 page for information on lodging for both the summer school and this meeting.

Cost and Registration

The cost was £50, which included a catered lunch on Friday. Space was limited to the first 50 people to register.

Mailing List

The meeting has a mailing list that all meeting related correspondence will be sent to:

august2009gmodmeeting@gmod.org

Any meeting participant can send an email to the list.

CBRG

We would like to thank the Computational Biology Research Group (CBRG) of the University of Oxford for hosting and financially supporting the week's events.

I would particularly like to thank Stephen Taylor, Simon McGowan and Zong-Pei Han for their help and support during the entire week of GMOD Europe 2009. We could not have done this without you. -- Dave C.

Attendees

First Name Last Name Affiliation
Ambrose Andongabo Rothamsted Research
ERICK ANTEZANA BAYER BIOSCIENCE NV
T. Grant Belgard MRC FGU
Alessandra Bilardi CRIBI - University of Padova
Dan Bolser Dundee University
Baptiste Brault INRA Versailles
Tim Burgis Imperial College- London
Scott Cain Ontario Institute for Cancer Research
Maria Cartolano University of Oxford
Dave Clements NESCent
Ros Cutts Imperial College
Etienne P de Villiers ILRI
Phil East Cancer Research UK
Matt Eldridge Cancer Research UK- Cambridge Research Institute
Ben Elsworth University of Edinburgh
Josh Goodman FlyBase (Indiana University)
Cyprien GUERIN INRA
Zong-Pei Han Computational Biology Research Group, Oxford
Andreas Heger MRC FGU
Ian Holmes UC Berkeley
Jim Hughes MRC
Bernd Jagla Institut Pasteur
Baptiste Laporte IBDML
Ed Lee Lawrence Berkeley National Laboratory
Jacob Lemieux Computational Biology Research Group
Siu-wai Leung University of Macau
Christopher Love Rothamsted Research
Emanuele Marchi University of Oxford
Simon McGowan Computational Biology Research Group, Oxford
Sheldon McKay Cold Spring Harbor Laboratory
FREDERIC POTIER BAYER BIOSCIENCE NV
Peter Rice European Bioinformatics Institute
Kim Rutherford University of Cambridge
michelle simon Medical Research Council
Daniel Sobral IBDML
Aengus Stewart London Research Institute CRUK
Julie Sullivan InterMine- Dept of Genetics- Cambridge
Steve Taylor Computational Biology Research Group, Oxford
Adrian Tivey Wellcome Trust Sanger Institute
Giles Velarde Welcome Trust Sanger Institute
Pieter Emiel Ver Loren van Themaat Macx Planck Institute for Plant Breeding Research
Jonathan Warren The Sanger Institue
Xikun Wu Institute for Animal Health
Jun Zhao University of Oxford
Pinglei Zhou Harvard University/FlyBase

Feedback

Attendees were asked to provide feedback at the end of the meeting.


Q: Would you recommend GMOD meetings to others

Yes Maybe No
100% 0% 0%


Q: Please rate the meeting(s) using the following scale: 1 (not at all) to 3 (reasonably) to 5 (exceptionally).

1 2 3 4 5
How useful was the meeting? 0% 0% 23% 53% 23%
Was the meeting well run and organized? 0% 0% 18% 47% 35%


Q: Was the meeting what you expected?

No. Yes. Yes!
0% 86% 14%

Longer responses:

  • Yes of course! The meeting was really interesting!
  • yes and it was good to for me to meet the developers
  • Yes, pretty much. It was in part this time just a good way to meet up with particular collaborators.
  • Yes, but I was hoping to learn more about Chado
  • Very very useful.


Q: Which presentations and sessions at this meeting were the most useful or interesting?


Q: Do you have suggestions for improving GMOD meetings in the future?

  • Another one in Europe please. We could host one in Hinxton but I am prepared to travel
  • I was able to come to the meeting because it was in Europe, so more meetings in Europe would be very helpful
  • Maybe some people can present posters during Coffee Breaks for the next GMOD meeting.
  • more sessions
  • Less instruction copying, more problem solving
  • no
  • I do think a informal or formal drinks or meal in the evening is a good idea, even if it's just - 'we are going to this pub to get a meal' which delegates can go to or not and then pay for themselves?
  • Better time keeping
  • Somewhere drier ;-) Seriously, it didn't seem to have the energy of some of the other 2 I've been to - maybe me or maybe people tired from the course
  • Try encouraging outsiders to bring non-genomic information to GMOD E.g. people from BDGP, ZFIN expression data, 4Dxpress, BGee, etc...


Additional feedback, suggestions, criticism, and praise.

  • This is the first time ever to learn to make use of so many useful bioinformatics tools from the developers and experts of them.
  • Thanks for the meeting.
  • Thanks very much to the organisers for their hard work - I definitely thought it was worth it
January 2010 GMOD Meeting

Next Meeting: January 2010 in San Diego California

The next GMOD Community Meeting was held January 14-15, 2010 in San Diego, California, United States, immediately following PAG 2010.