Difference between revisions of "GMOD Evo Hackathon Proposal"

From GMOD
Jump to: navigation, search
m (Post-Reference Genome Tools)
m (Third-generation Sequencing)
Line 99: Line 99:
 
=== Third-generation Sequencing ===
 
=== Third-generation Sequencing ===
  
What challenges will the GMOD community face in handling third-generation (single-molecule) sequencing data, and how can we prepare for them?  Areas of discussion include:
+
What challenges will the GMOD community face in handling third-generation (single-molecule) sequencing data, and how can we prepare for them?  Second generation sequencing technologies typically produce many short reads at very high coverage.  The high coverage is necessary to compensate for lower accuracy.  Challenges with 2nd gen data include 1) dealing with the huge amount of data that comes with the high coverage, 2) distinguishing read and amplification errors from signal, and 3) assembling short reads.
 +
 
 +
Third generation technologies will have some commonalities and some key differences.  First, they will continue to produce large volumes of data.  The nature of the data will change significantly, though.  Third generation technologies are expected to be significantly less error-prone, thus reducing the need for high coverage.  This will also reduce cost and turnaround time.  While the average ''depth'' of the data will decrease, the ''width'' of the data will greatly increase.  The technology will enable more samples to be sequenced, and at greater accuracy.  The improved accuracy and longer read length will also make assembly easier.
 +
 
 +
Areas of discussion include:
  
 
* data modeling and storage
 
* data modeling and storage

Revision as of 23:39, 12 April 2010

NESCent Hackathon on GMOD Tools for Evolutionary Biology

__NOTITLE__ The GMOD Evo Hackathon aims to bring together experts in evolutionary biology, software, and bioinformatics to design and implement enhancements for GMOD tools, improving their support for evolutionary biology.



Motivation

The GMOD project is a confederation of open-source projects developing software tools for storing, managing, curating, and publishing biological data. GMOD tools are used by many large and small biological databases, and increasingly by individual research labs, for the dissemination of the results of experimental research and curated knowledge.

While these software tools provide a powerful and feature-rich basis for working with biological data, many GMOD tools still lack features needed to effectively support evolutionary biology. GMOD's current strengths are with genomic data. In the past GMOD has not emphasized areas that are traditionally important in evolutionary biology, such as phenotypes, phylogenetics, population genetics and natural diversity. Historically, evolutionary researchers have not had access to genomics data because of cost issues.

However, two trends are now bringing the interests of these communities together in ways that can benefit both. First, high-throughput sequencing technologies have made large-scale, multi-specimen/multi-organism sequencing affordable, even for small labs. This trend has also vastly increased the volume and diversity of public genome and transcriptome data, creating tempting opportunities for evolutionary and comparative analysis. GMOD provides important tools for working with this data: GBrowse and JBrowse for visualization, Chado for storage, indexing, and as a backend for analyses, MAKER for high-throughput standardized annotation, and Galaxy as a comparative genomics workbench. Second, GMOD's existing user base is increasingly pursuing research in areas such as phenotypes and population biology, that evolutionary biologists have a great deal of experience with.

A hackathon is way to bring these communities together so that GMOD tools can be enhanced to:

  1. better serve the needs of evolutionary biologists for data types GMOD already handles well, and
  2. better support data types that evolutionary biologists have a longstanding interest in, but that are new to GMOD.

We are seeking NESCent's support and hosting for this event. NESCent has good facilities in a pleasant setting and, more importantly, has significant experience hosting events of this nature.

Specific objectives

Organizers have identified the following broad objectives for guiding work at the event. This is based on our own experience, and interaction with others in the GMOD and evolutionary biology communities. These include insights gained by the recent Tools for Emerging Model Systems working group (EMS WG) at NESCent. This group consisted of evolutionary biologists working on non-model organisms and struggling with how best to exploit their data and connect their communities.

During the hackathon, participants will refine and distill these and other options into concrete implementation objectives.

The hackathon concentrates on writing code. All code and documentation will be made available immediately and freely to the community under an OSI-approved open source license.

Better GMOD support for alignment metadata

Sequence Alignment Map (SAM) format has become the de-facto standard format for representing short-read genome alignments, but it still has only limited support in these tools. One objective of the proposed hackathon is to design and implement improvements to GMOD tools to give them excellent support for SAM data, particularly for cross-species alignments and views. This includes extending Chado to properly store experimental and alignment metadata, vital for identification of source and target genome builds for comparative analysis. We expect the improved analysis and storage tools resulting from this work to make cross-species comparative analysis of large-scale datasets much more accessible.

Note: Next generation sequencing (NGS) data in SAM format pops up repeatedly in this proposal. This is a function of both the widespread adoption of SAM as the standard way to represent this data, and of the importance of NGS data in evolutionary biology. The EMS working group identified 'working with NGS data' as their number one concern.

GBrowse_syn compatibility with SAM data

The GBrowse_syn comparative genomics viewer does not currently support SAM data. GBrowse_syn currently runs on the GBrowse 1.x platform. It needs to be upgraded to the GBrowse 2.x platform before it can support SAM data. We may want to also extend basic SAM functionality to show per-base information.

GBrowse_syn database backend scalability

Currently, each pairwise comparison between organisms is stored in a separate database to drive GBrowse_syn. This quickly becomes unwieldy for large numbers of genomes. A hackathon objective could be to address this scalability issue.

Whole-Genome Comparison Visualization

This functionality could be de novo to GBrowse_syn, or we could add an "export to tool x" functionality for doing this. There has been some talk of trying to bring Circos or MizBee into the GMOD fold. We could approach those projects to pursue interoperability.

Either way, we currently don't have an easy way to do whole genome visualization for data in GBrowse_syn.

Phylogenetics Visualization

SGN has a nice web-based multiple alignment and tree browser, one possible implementation objective could be to extract it as a GMOD component.

Evolutionary phenotype data in Chado

This task could include several items. Support for phenotype data in Chado needs to be rationalized, as it currently supports two distinct models (an older prototype, a more robust followup) that use overlapping sets of tables. Ideally, we will settle on one set of well defined tables to facilitate future work, as well as come up with migration plans for those using the old model. Included in this would be the ability to support both EAV (Entity-Attribute-Value, used in the "old" schema) and EQ (Entity-Quality), both of which can leverage PATO and other ontologies for phenotype term specificity. In particular, we will want to make sure that the Phenotype module is congruent with the Natural Diversity module (below), so that proper links are made between the recorded phenotypes, and the environments in which they are observed.

We could also add Phenote support. At least one current GMOD user has written data adapters to take the Phenote generated annotation and load/retrieve it into/from Chado. This may be a suitable foundation for a general purpose program for doing this data transfer. Similarly, Phenex is a tool for curating evolutionary character trait data across multiple species. It is built on the same base code as Phenote, although it currently uses a different database backend for storate. Chado could enhanced/adapted to support this type of data as well.

Population diversity support for Chado and associated application connectivity

(a la GDPDM)

Phenotypic diversity data is also very useful for evolutionary studies. In-depth analysis of this data requires proper representation, handling, and storage: specific phenotypes, environmental conditions, population details, and other experimental metadata all must be tracked, and more importantly cross-referenced with known genomic and genetic information. Developers at this hackathon will work to add ... One of the best conceptual tools for representing this type of information in machine-readable form is ontologies, and GMOD's open-source Chado database schema is the most mature, flexible, and feature-rich storage engine for storing ontology-based data. However, it lacks specific support for evolutionary phenotype data or natural diversity data. Earlier this year, a working group was formed to work on the design of a new Natural Diversity module for Chado, and one of the objectives for this hackathon will be to finalize and integrate the group's work into the larger Chado schema, and to make sure it integrates well with the existing or new Phenotype module (discussed in the previous aim).

Web Interfaces to Evolutionary Data in Chado

The ANISEED project has an atlas/image-based web interface for phenotype, gene expression, and cell fate data. They are currently developing version 3 of this interface, called NISEED, that will be based on Chado for the first time.

Tripal is a Drupal-based web interface to Chado databases. It supports interfaces for several popular data types, but does not currently support phylogenies, phenotypes, expression, or natural diversity data. We could extend it to evolutionary data types as part of the hackathon.

Natural Diversity / Population Genetics / Multidimensional Data Visualization in a Genomic Context

The Barley1k project (Eyal Fridman) is an example dataset that should be supportable by GMOD. They gathered a thousand wild samples of barley, and recorded many local environmental conditions. The Natural Diversity module will allow us to store this type of data. However, we lack tools to visualize such multi-dimensional data in a genomic context (e.g., GBrowse, JBrowse, GBrowse_syn). This could be solved either specific new glyphs and plugins, or with generic interfaces to statistical/geolocation/image based visualization packages.

Discussion / Development Topics

This section contains early-stage ideas that merit discussion and serious consideration by the attendees of the hackathon, but are not yet developed enough for specific implementation objectives.

Post-Reference Genome Tools

This is a great example of how evolutionary biology can help lead the rest of the GMOD community.

The concept of a reference genome has been an extremely valuable tool in model organisms. The importance of a reference genome is not diminishing, but the need for an additional framework is on the rise.

To explain this, lets contrast evolutionary and developmental biologists. Developmental biologists embrace and strive for similarity as a means of controlling experimental conditions. Inbred lines of organisms do not usually occur in nature, but are usually preferred for developmental biology work. Developmental biology has anatomy ontologies and staging series based on stereotypic progression of anatomical development in single inbred lines. Developmental biologists strive to eliminate genetic and environmental diversity in order to create controlled experimental conditions. The concept of a reference genome historically has fit very well into this paradigm.

In contrast, evolutionary biologists embrace and study genetic and environmental diversity. Evolutionary biologists typically study populations rather than individual lines. They characterize and analyze differences, rather than eliminate them. For evolutionary biologists, a reference genome is much less of a central tool than it is for developmental biologists.

Second-generation sequencing now allows evolutionary biologists to exploit genomic data for populations or large numbers of individuals. It also allows every other kind of biologist to do the same. We currently have tools to show linkage disequilibrium, and genotype and allele frequencies, but these still typically show data in the context of a reference genome.

By some estimates, three years from now many projects will have thousands of full genomes. In such an environment, does the concept of a reference genome still remain relevant? How should GMOD tools change, grow, and adapt?

High-throughput Imaging / Phenotyping

Adoption of high-throughput imaging and phenotyping technologies is increasing. What software exists for working with this type of data, and how should the GMOD community participate?

Third-generation Sequencing

What challenges will the GMOD community face in handling third-generation (single-molecule) sequencing data, and how can we prepare for them? Second generation sequencing technologies typically produce many short reads at very high coverage. The high coverage is necessary to compensate for lower accuracy. Challenges with 2nd gen data include 1) dealing with the huge amount of data that comes with the high coverage, 2) distinguishing read and amplification errors from signal, and 3) assembling short reads.

Third generation technologies will have some commonalities and some key differences. First, they will continue to produce large volumes of data. The nature of the data will change significantly, though. Third generation technologies are expected to be significantly less error-prone, thus reducing the need for high coverage. This will also reduce cost and turnaround time. While the average depth of the data will decrease, the width of the data will greatly increase. The technology will enable more samples to be sequenced, and at greater accuracy. The improved accuracy and longer read length will also make assembly easier.

Areas of discussion include:

  • data modeling and storage
  • graphical visualization
  • online display and searching

Subgroups

Participants will split into subgroups at the event. The composition and tasks of the subgroups will be guided by the overall objectives, but will otherwise emerge and be self-determined by the participants both prior to and at the event.

Participants

Participation will be arranged by invitation and by self-nomination followed by review. If you are interested in participating, please contact one of the organizers.

Participants List

Organization

Organizing Committee: Nicole Washington, Hilmar Lapp, Sheldon McKay, Scott Cain, Robert Buels

Time & Venue: The hackathon is tentatively scheduled to take place June 7-11, 2010 at NESCent in Durham, North Carolina.

Agenda: The agenda of the event will be posted here once developed by the participants.

Suggestions

  • add suggestions here as bullet points