January 2009 GMOD Meeting

From GMOD
Jump to: navigation, search
Jan2009MtgLogoNoText.png
January 2009 GMOD Meeting
January 15-16, 2009

Following PAG 2009
San Diego, California, USA
San Diego Convention and Visitors Bureau


This GMOD community meeting was held January 15-16, 2009, in San Diego, immediately following Plant and Animal Genome (PAG 2009). There were over 50 participants at the meeting.

Jan 2009 Meeting



Agenda

Thursday, January 15

Time Topic Presenter(s) Links
10:00 AM Registration
10:30 AM Introductions Scott Cain
11:00 AM The State of GMOD Scott Cain PPT, Summary
11:30 AM A variety of GMOD Help Desk stuff Dave Clements PDF, Summary
12:00 PM Lunch one hour 30 minutes
1:30 PM Drupal and MarineGenomics.org Stephen Ficklin PDF, Summary
2:00 PM Artemis and Chado at GeneDB Robin Houston PPT, PDF, Summary
2:30 PM modENCODE: extending Chado, BIR-TAB, & GBrowse for automating data validation & display Nicole Washington PDF, Summary
3:00 PM Break
3:30 PM A RESTful interface for MODs? Josh Goodman PPT, PDF, Summary, Discussion
4:00 PM Metadata Input and Submission Tool and GIS linked metagenomic database Iddo Friedberg and Christopher Condit PDF, Summary
4:30 PM Data Representation in Chado: Best Practices Joshua Orvis and/or Scott Cain Summary, Discussion
5:00 PM Dinner (on your own)

Friday, January 16

Time Topic Presenter(s) Links
9:00 AM Chado and GUS at SBRI Dhileep Sivam and Isabelle Phan PPT, PDF, Summary, Discussion
9:30 AM BioMart Arek Kasprzyk PDF, Summary
10:00 AM BeeSpace Barry Sanders, Dave Arcoleo PPT, Summary
10:30 AM Break
11:00 AM WebGBrowse GBrowse configuration management, Summary Ram Podicheti PPT, PDF, Summary
11:30 AM JBrowse (aka GBrowse 3.0) Mitch Skinner ODP, PDF, Summary
12:00 PM Lunch one hour 30 minutes
1:30 PM EcoliWiki and TableEdit Daniel Renfro I tried, but all I get is errors.
And PowerPoint makes a 10MB pdf,
which is way too big to upload.
Contact me if you want a copy.
, Summary
2:00 PM Generic Gene Page XML Scott Cain PPT, PDF, Summary, Discussion
2:30 PM GMODWeb and package management Brian O'Connor PPT, PDF, Summary
3:00 - 5:00 PM MIGS and MIMS Iddo Friedberg Summary
Bovine Genome Database Justin Reese and Chris Childers PPT, PDF, Summary
GNPAnnot Pierre Larmande ChadoControler, Summary

Themes and Discussions

Several themes ran throughout the meeting

Data Sharing

Several presentations touched on this:

A common question during these talks was how much should we do? Should we implement a comprehensive data sharing protocol or start with very modest goals, or should we aim for a sweet spot in the middle? Should we emphasize robustness or ease of implementation? Should GMOD support semantic web efforts?

Josh Goodman argued that RESTful interfaces provide the highest payoff for the least amount of effort -- that RESTful was both useful and the easiest to implement.

The semantic web in general and SSWAP in particular was discussed. Ren Nelson of SoyBase pointed out that SoyBase's map data is now available through SSWAP. RDF, the Bio2RDF projct, and the Swoogle semantic web search engine were also mentioned.

Josh Goodman, Rob Buells, Rex Nelson, and Kevin Clancy formed the Web services working group and will continue and expand this discussion with the GMOD community.

Joshua Orvis's "Data Representation in Chado: Best Practices" session dealt with the same issue, this time in the context of representing biology within Chado in the same way across organizations. In this session we proposed converging on common representations by having organizations post their current Chado practices to the wiki, discussing them on the wiki or on the GMOD-Schema mailing list, and then converging on a common set of Chado best practices. Common practices would enable both data sharing and common tools. Joshua got the ball rolling by describing IGS's Chado practices on the IGS Data Representation page.

RDF got additional discussion during Dhileep Sivam and Isabelle Phan's session on Chado and GUS at SBRI. Uniprot uses RDF to represent their data. XML gives you a tree representation, while RDF gives you a graph of RDF files. Graphs more often better reflect what is being described. Sparql is the standard query language for accessing RDF.

Presentations

The presentations are listed here in a very approximate order:

  • GMOD Project Presentations
  • GMOD Components
  • GMOD User Experiences

The State of GMOD

Activity since the July 2008 GMOD Meeting:

Releases

  • GBrowse 1.69 released
  • Apollo 1.9.? released
  • CMap 1.01 released
  • Generic Gene Page released
  • GMOD 1.1 (Chado)
    • Minor Schema changes
    • Minor fix to GFF3 preprocessing tools
    • Chris Mungall's Materialized views of controlled vocabulary terms.
  • Bio::Graphics split from BioPerl
    • Both for releases and source code repository.
    • As soon as BioPerl 1.6 is released, installing Bio::Graphics from CPAN should work.

Howard Hughes Medical Institute Science Education Alliance

A program set up by the Science Education Alliance (SEA) staff of the Howard Hughes Medical Institute (HHMI), in close collaboration with Ed Lee of Lawrence Berkeley National Laboratory, uses GMOD Components to teach sequencing from sample collection to annotation to submission to GenBank. College freshmen at 12 schools isolate and then sequence mycobacterium smegmatis phages, manage the sequence with Chado, annotate it with Apollo, and display it with GBrowse. At the end of the course, each school submits a newly sequenced and annotated phage genome to GenBank. Matt Conte, the SEA bioinformatics specialist at the time the workflow was developed and implemented, attended the 2008 GMOD Summer School.

GBrowse Update

Roadmap:

  • 1.69 released in August. Lots of new stuff
  • Popups (from Sheldon McKay)
  • Vertical dragging of tracks
  • Rubber banding (also Sheldon)
  • Quantitative data (Wiggle tracks)
  • Conservation data
  • Track sharing
  • Galaxy integeration

1.70 Release:

2.0 Release:

  • Parallelizable track data sources and rendering.
  • User interface changes also make it feel quicker, even when running on a single CPU
    • Tracks rendered as soon as they are done.

3.0 Release


GMOD Help Desk

Dave gave an update on what the GMOD Help Desk has been up to since the July 2008 GMOD Meeting, and also talked about future plans.

The 2008 GMOD Community Survey was conducted in October. Dave summarized the result of the survey.

2009 will be the year that natural diversity support becomes an integral part of GMOD,

  • Dave will polish the natural diversity and fold it into production Chado. This was originally written for HeliconiusDB and is based on GDPDM, a data model used by Gramene and MaizeGenetics. This work may involve tweaking existing modules like Phenotype.
  • Ben Faga is working on adding geolocation visualization to GBrowse
  • Dave is involved in a NESCent working group with the goal of extending GMOD to better support evolutionary and ecological research.
  • Dave is hoping to organize a hackathon later this year to specifically address natural diversity needs in GMOD.

Some upcoming meetings and courses:

See below for more.

Documentation and Web Site:

  • A GBrowse user tutorial will be released by OpenHelix later this month. This is a Flash based tutorial that goes into great detail on how to use GBrowse's basic and advanced user features.
  • GMOD for High-throughput sequencing
    • Dave will be putting time into how GMOD supports this and outreach to the community that is doing this.
  • Web site upgrades
    • Went to MediaWiki 1.12 in August 2008. Got better search ability and uniform URLs.
    • Planning to upgrade to 1.13 or 1.14 this spring, and maybe give the site a new look.

And some delayed tasks, that Dave talked about at the July 2008 GMOD Meeting:

Finally, Dave solicited feedback on what he should be doing. Some highlights:

  • Document what parts of the GFF3 standard GBrowse can't currently deal with.
  • Document what variables are available in GBrowse callbacks.

(Russell Smithies also mentioned that if you put GBrowse callbacks in a separate file and then include them, the GBrowse parser becomes much less fragile.)

A RESTful Interface for MODs?

Josh Goodman of the FlyBase project proposed that GMOD support a RESTful interface for biological data. Josh started by listing many of the APIs that already exist in GMOD or at GMOD user sites, and then asked the question "Why have another one?" Here's why:

  • None of the existing APIs are compatible with each other.
  • Use different data models
  • Some assume the Chado data model
    • and not everyone uses Chado, nor will they
  • Some assume a particular language like Java or Perl.

Why go with REST?

  • If adoption costs are high, people won't use a technology
  • REST tends to have a low cost of adoption
  • It's language neutral

What should a GMOD REST implementation have?

  • Simple and lightweight (low cost of adoption)
  • URL based
  • Versioned URLs for stability
    • Versions refer to format/API, not to data.
  • Data model neutral (that is, don't assume Chado
  • Result lists in XML or JSON
  • Gene records in Generic Gene Page XML

See Data Sharing for the discussion that Josh's talk spawned, and GMOD REST API for a proposed standard.

Data Representation in Chado: Best Practices

Jan2009Orvis.JPG

Joshua Orvis's talk was the second to specifically address common data representations and formats.

The same data can be stored in Chado in wildly different ways. There are few commonly agreed on or documented best practices. Common practices are vital for both data and tool sharing.

An example undocumented/unestablished best practice would describe how to version features in Chado. Should you use naming conventions? Should the Chado Audit Module be used for this. Either approach (or others) may be viable, but without a standard any one groups's implementation is likely to be incompatible with every other group's implementation.

This led to two questions:

Q: How do changes to the Chado schema happen?

Scott: Post your ideas to the GMOD-Schema maililng list, people will discsuss it, and then Scott (or you) will fold the changes into Chado.

Q: How do we establish Chado best/common practices?

The general response here was

  • Ask organizations to describe what they currently do, and how they think things should be represented.
    • Joshua created a page, IGS Data Representation, that describes how IGS uses Chado. This page is an excellent template for other organizations to use as a starting point for describing their practices.
  • Use these documents as discussion points for converging on a common set of best practices.

Dave pointed out that once we have standards, a Chado validator could be written to report any data in a Chado database that does not conform to best practices.

Generic Gene Page XML

A Perl implementation of the Generic Gene Page XML was released after the July 2008 GMOD Meeting.

There are now at least three servers that generate GMOD Generic Gene Page XML:

The Perl implementation has 11 or so abstract classes that need to be implemented, and the new() method also needs to be overridden.

The XML generated by this package does not conform to Chado XML, It also does not share tags with NCBI's Gene XML. It is very close to Uniprot XML. Column 9 attributes in GFF becomes a comments with a type of the attribute tag name (e.g., "note").

We don't currently have an XML-Schema or documentation on this format. The Perl package is what currently defines it. We also don't have any code that consumes this XML. BioPerl code could be written that would eat the XML and produce a Bio::SeqFeature object.

We discussed adding sequence to the XML, but did not reach a conclusion on this.

See Data Sharing above for more on this topic.

BioMart

Arek Kasperzyk described the BioMart project with a particular emphasis on tighter integration with other GMOD Components.

BioMart can handle very large datasets and can be configured as a centralized data warehouse or as a federated one. BioMart also enables you to hide your own mart while federating with others. Arek expects a lot of future demand for BioMart to be an inhouse bioinformatics protal.

BioMart is being used by the International Cancer Genome Consortium, which expects to do 50,000 human genomes. With this amount of data you have to use the federated data model. They are using OpenID for authentication and have support for versioning.

Three key concepts in BioMart are datasets, filters, and actions. BioMart provides makes its data available through a web interface, and Perl API, and web services. The three core concepts are used in all of those interfaces.

In the past, BioMart has only supported Chado as a data source to feed BioMart (and a problematic one - see below). However, could also use BioMart's interfaces as a data source for many GMOD components such as GBrowse, CMap, Galaxy, etc. BioMart is interested in pursuing this and Arek floated the idea of a hackathon to help achieve this.

Arek described the current Chado to BioMart mapping as challenging because of the extensive cross-referencing in the schema. This makes Chado very flexible, but it makes it difficult for BioMart to tease out relationships in the data.

Arek was asked to summarize differences between BioMart and InterMine, another similar GMOD component. There are many differences but two key ones are how optimization is done and the data model that each uses.

JBrowse

JBrowse, formerly known as GBrowse 3.0 is a reimplementation of GBrowse functionality in JavaScript.

JBrowse can load data from a number of data sources including GFF files and Bio::DasI implementations like Bio::DB::GFF (GFF2) and Bio::DB::SeqFeature::Store (GFF3) databases. It then translates them into nested containment lists in JSON format.

This strategy has number of benefits:

  • using pre-generated JSON means no CGI needed for browsing
    • Thus, easier installation, and
    • Scale to large genome and large numbers of users.
  • JSON is also cached by web browsers, making it fast.

The JBrowse configuration file also uses JSON syntax.

Several new features have been added since the last report at the July 2008 GMOD Meeting:

  • Name/ID searching
    • Names and IDs are stored in JSON using a Trie structure (a string prefix tree) and subtrees are loaded lazily.
  • Quantitative tracks
  • Subfeature support

A development version of JBrowse is available for download.

GMODWeb and Package Management

Brian O'Connor from the Nelson Lab at UCLA spoke about GMODWeb and RPM package management.

GMODWeb is a GMOD component for generating web sites that are driven by a Chado database. It is based on Turnkey. Turnkey takes a schema and turns it into a website. GMODWeb adds several GMOD specific widgets that add things like support for GBrowse. Brian divided GMODWeb into two distinct phases. First, generating the website. Second, load it into Apache and mod_perl and bring that web site up. Users (see Stephen Ficklin's talk below for an example) generally have little trouble with the first step. However, in the second step users often find themselves in Perl dependency hell. GMODWeb has over 100 Perl dependencies and user success at getting them all lined up correctly depends on a tenacious GMOD Systems Administrator.

Can we do anything about this? Perhaps if we had package management support ...

The second part of Brian's talk was about BioPackages.net, a repository of biology software RPMs for the CentOS (based on Red Hat Enterprise Linux) and Fedora (also part of the Red Hat suite) Linux distributions. RPMs are software packages that clearly specify what other RPMs they depend on. They greatly simplify software management.

BioPackages.net was created by and is currently managed by the Nelson Lab at UCLA. It has a build farm backing it that is used to generate the packages. They would like to see the BioPackages build farm be replicated somewhere else. Build farms have different virtual servers for each version of Linux that is supported. BioPackages currently has a rich, but somewhat dated, library of CentOS4 RPM packages. They are now transitioning to CentOS5, and are currently packaging Chado 1.0 releases, as well as a DAS/2 reference server.

RPMs, while good on paper, have a number of implementation challenges. It is rare that you can find all your needed software in RPM format. Sometimes you have to install from CPAN, or from source, and as soon as that happens the RPM infrastructure starts to lose its integrity. One way around this is to use virtual machines. Currently working on a prototype CentOS4 machine with Chado 1.0, recent BioPerl, and Turnkey/GMODWeb 1.4 installed. After this meeting Dave Clements is going to UCLA for two days to learn about RPM generations and the BioPackages.net infrastructure.


Brian would like people to contact him ...

  • Turnkey/GMODWeb: looking to expand Java producer to eliminate Perl dependency problem
  • BioPackages: looking for RPM developers (or Debian package builders for Ubuntu)
  • Virtual Machines: looking to create CentOS5 machines
    • Pre-configured GMOD demo/dev kit
    • Pre-configured Biopackages dev kit
  • Anyone who is using GMOD tools for Next Gen Sequencing (Dave C would also like to know if you are doing this)


EcoliWiki and TableEdit

Daniel Renfro of the Jim Hu lab spoke about EcoliWiki and the TableEdit MediaWiki extension.

EcoliWiki is a wiki for the E. coli community. One goal of wikis is to enable any interested user to make small updates and corrections to the web site. Wikis enable users to easily enter plain text and they support simplified markup languages to enable users to do some basic formatting like bolding or italicization without having to learn the more complex (and none too intuitive) HTML or CSS markup. Wikis enable making the unit of submission very small.

However, even simplified wiki markup for tables is, at its best tedious, and at its worst impenetrable. TableEdit is an extension to the MediaWiki wiki package that protects users from dealing with MediaWiki markup. In addition TableEdit also supports templates so the same table format can be reused in multiple places in a wiki (e.g., on every gene page as on EcoliWiki) and integration with backend databases, such as Chado.

Recent work on TableEdit includes:

  • Added a button for insert TableEdit text.
  • Templates can now pull data from a table on another page.
  • Better documentation.
  • Support for really big tables.
  • Conflict detection (simultaneous updates)
  • help links 100% editable.
  • Bulk loader

Work continues on TableEdit and version 2 is expected to have support Chado round-tripping. That is, data from Chado can be displayed in TableEdit tables in a wiki, and you can use the TableEdit wiki interface to update data in a Chado database.

Iddo asked about uploading excel spreadsheets. This function is not currently planned.

WebGBrowse: GBrowse Configuration Management

Ram Podicheti spoke on WebGBrowse, a new web interface for configuring and hosting GBrowse instances that was developed at Indiana University.

WebGBrowse was developed to ease GBrowse configuration. One reason for the popularity of GBrowse is the wide range of glyphs it supports, and the configurbility of those glyphs. However, most glyphs are not well documented and some are not documented at all. When documentation does exist, it is aimed at Perl programmers, and to learn all the options supported by each glyph you need to look at the Perl code. GBrowse is now being used in smaller organizations that may not have personnel with relevant experience.

The aim of WebGBrowse is to make GBrowse available to biologists without the installation or support costs. WebGBrowse allows users to upload their own GFF3 datasets, and to use a web interface to to configure GBrowse. It curently supports configuration options for 42 glyphs. The information about and options for each glyph is stored in YAML.

WebGBrowse is available both at IU and as downloadable software.

The next release of WebGBrowse will add more functionality:

  • Support uploading of GFF3 as tar balls
  • Expand the glyph library
  • Allow of loading of pre-existing configuration files and start from there.
  • Support GENERAL section configuration
  • Balloons, plugsin, etc
  • Allow group feature configuration
  • Categorize the glyphs
  • Perl callbacks.

In a discussion afterward on how best to document glyphs, Rob Buells (I think) suggested having the glyphs be self-documenting. That is, Bio::Graphics::Glyph would be extended to make it possible to ask a glyph what it can do and what options it supports. This facility could be used by WebGBrowse to learn about glyphs, or to automatically produces wiki documentation for each glyph, or by any other program that cares.

Rob will look into this.

Drupal and MarineGenomics.org

Jan2009Ficklin.JPG

Stephen Ficklin of the Clemson Univerisity Genomics Institute (CUGI) spoke about his group's work using GMOD components to power a number of web sites.

Most of these sites use Chado as the back end database and either GMODWeb or Drupal to implement the web interface. Stephen elaborated on the benefits and challenges of using Drupal with Chado.

Drupal

CUGI chose Drupal because:

  • Quicker development
  • Easy user contribution
  • well documented.
  • large user community
  • plugins
  • easy to customize look and feel
  • social networking abilities.

Drupal uses:

  • Menus, node, and blocks for organization
  • PHP, CSS, JQuery and AJAX for user interface.

Drupal choices:

Drupal and Chado

They are using the sequence, organism, and library modules from Chado.

Kept Chado and Drupal schemas separate, but Drupal needs to know what is in the Chado database. Implemented Chado_feature node in Drupal. Correlates feature node with feature ID. Provide forms for updating Chado

Need to synchronize Drupal and Chado databases. Some data (GFF) is added to Chado first and then copied to Drupal, while other data (EST pipeline results) are added to Drupal first and then copied to Chado.

They have large putative data sets and their BLAST results are not stored in Chado - they are just too big. Instead they use XML formatted resutls for each feature and then Drupal indexes this. Stored in filesystem based on DB ID and feature ID.

Tripal

CUGI is calling its Drupal on Chado implementation Tripal and they are planning to release it to the GMOD Community in Spring of 2009 so that others can use it to power their web sites.

Artemis and Chado at GeneDB

Jan2009Houston.JPG

GeneDB is a core part of the Sanger Institute Pathogen Sequencing Unit. GeneDB currently has data on 50+ pathogens and expects that number to grow by orders of magnitude in the coming years. GeneDB curators use Artemis to do manual genome annotation. (Artemis serves the same purpose as Apollo, and like Apollo is also implemented in Java.)

GeneDB is currently in the process of moving their data to Chado. GeneDB is also upgrading its web site to pull data from Chado indirectly. Data will be read from Chado and cached in serialized Java object in a BerkeleyDB database. This approach results in a very responsive (as in darn near instantaneous) web site. The new web site will launch in the first half of 2009.

GeneDB developed a Hibernate mapping for Chado. The feature hierarchy is represented using single table inheritance.

Chado and GUS at SBRI

Dhileep Sivam introduced how GMOD is used at the Seattle Biomedical Research Institute (SBRI). Currently using Nimblegen's viewer and particularly like its smooth scrolling abilities.

Chado is used in the SSGCID project to store Nimblegen microarray data. Challenges in working with this data include normalization, scaling, feature level aggregations, remapping and visualization. SBRI has a pipeline for discovering protein structure that looks at 60 different resources. Dhileep spends a lot of time writing scripts to parse Nimblegen data, tools for BLAST searches, and scripts to export data from Chado.

This work has led SBRI to use and extend Chado in several new ways:

  • Complexity of querying BLAST searches and microarray data.
    • Use materialized views for both
  • Grouping of genes
    • Use DBXREFs
  • Gene Models
    • Use the simplest possible model.

Isabelle Phan discussed how SBRI uses Chado, GUS, BioMart, Galaxy, Ergatis, Apollo and Manatee to produce, annotate, and manage their data.

SBRI uses Chado and GUS for different purposes. Chado is used (in collaboration with IGS) to store annotation from Apollo and Manatee, and the results of Ergatis workflows. It is used to manage internal data production. GUS (in collaboration with UPenn) powers the web front end and is used for external data access. SBRI uses Chado because of its data model, and GUS because of its strong software engineerin and flexibility.

SBRI, like many others, would like to have standardized object-relational mappings (ORMs) for mapping biological data to Chado. They want RDBMS-free data mining. Use BioMart and Galaxy to do this, rather than Chado and GUS. (Isabelle commented that it took 5 minutes to install Galaxy.) Could also use RDF in combination with a triple store (RDF represents everything with triples) plus Lucene. Want the ability to take what you need from Chado (as little as 6 tables) and map it to ORMs. (See the Data Sharing discussion above for more on this.)

modENCODE: extending Chado, BIR-TAB, & GBrowse for automating data validation & display

Jan2009Washington.JPG

The modENCODE project is working to identify all functional elements and find evidence for every gene prediction in worm and fly. They were originally using ChIP-chip, but have now switched to ChIP-Seq technology.

Nicole works at the modENCODE Data Coordination Center (DCC). The DCC is a central collection and validation point for modENCODE's many data providers. They also provide project statistics. The DCC uses Chado, GBrowse, and InterMine. In addition they are also developing many new tools.

They do extensive data and metadata validation before loading both into Chado. They link between metadata and the resulting features and have added methods to add and drop data easily. Protocols have been formalized in BIR-TAB, which is based on MAGE. Protocol inputs and outputs are typed, but internals of the protocol are a black box. This was added to Chado with a custom protocol extension.

modENCODE uses GBrowse 2.0 for visualization. The backing database is very large and they have added methods to easily add and drop datasets. They would like to be able to use PostgreSQL for storing Bio::SeqFeature::Store (GFF3) databases.

modENCODE also does track finding. Chado is scanned looking for features that belong together and should be shown in the same track. Uses Heuristics to group things together. This can produce GFF3 or Wiggle or both. Track finding code is written in Ruby

Finally, they have a submission and publishing pipeline that is written with Ruby on Rails. It also uses the GoogleGraph API.

Many of these tools are available at the modENCODE BIR_TAB svn repository:

svn://public-svn.modencode.org/modencode

BeeSpace

Barry Sanders and Dave Arcoleo gave us an update on what's happened with the BeeSpace project since the July 2008 GMOD Meeting.

First, they migrated from a Java and JSF platform to a Python and Django platform with lots of AJAX support thrown in. Django directly consumes web services and directly supports RESTful interfaces. They are really happy with the usability and productivity of Django. BeeSpace also uses the Ext JavaScript Library (Ext JS) and are quite happy with it. Ext JS costs real money, unless your software is GPL, in which case it is free. They also sang the praises of the Firefox FireBug extension for JavaScript development and debugging.

BeeSpace now supports interactive, iterative collection builing and automatic collection version management. Currently collection sharing is BeeSpace's only social feature, but they plan to add more in the future.

Gene Summarizer is a part of the BeeSpace software suite that does automated curation of papers. It attempts to mimic what human annotators do. In BeeSpace it currently analyzes only abstracts. Gene Summarizer can analyze full text, but currently does not because of licensing issues, not technical issues.

Gene Summarizer accepts text as input and produces complex output in JSON format. It is available both as a command line program and as a web application.

In the future BeeSpace plans to add gene ontology search tools and provide analysis and clustering of their data. They will also add set operation support to their collections.

Metadata Input and Submission Tool and GIS linked metagenomic database

Iddo and Christopher presented their work on the CAMERA project. CAMERA is a metagenomics and Iddo and Chris's aim was to have the GMOD community consider how GMOD can help metagenomics projects with their data, and how metagenomics projects might expand what GMOD can do.

Metagenomics involves sequencing whole communities or organisms, usually take from a spatial sample like a cube of clay, a liter of water, or an area of the human gut. In metagenomics you frequently don't know what organism a DNA fragment came from. The data is huge, noisy, and partial. Metadata is also key: microbes are enormously affected by their habitat and you need to store as much environmental data as possible.

CAMERA uses a number of data standards:

Things to think about:

  • Visualization
    • How do we look at ”disembodied” sequence data?
    • ”Fragment recruitment” track
    • Visualization of sequence data <--> metadata associations
  • Database: association of metadata and sequence data; queries by metadata

Iddo closed by emphasizing that the new high-throughput technologies change everything and that we need to start thinking about the challenges that come with that right now.

Christopher Condit showed some ways to visualize metadata associated with each sample, and general climatic data for context. Chris used the NASA MODIS data to show both global averages and data for a specific day, down to 4km square resolution.

Bovine Genome Database

Bovine Genome Database, Justin Reese and Chris Childers

Justin and Chris presented the architecture and workkflow of the Bovine Genome Database.

  • Use two Chado databases, one "main Chado" to hold semi-stable data (SNPs, ESTs, protein alignments, gene calls) and one "incoming Chado" to hold incoming annotations
  • One GBrowse MySQL database (to serve GBrowse)
  • Flat files to serve BLAST
  • Annotation system
    • ~400 or so annotators, ~4,000 or so annotations so far
    • annotators pulls directly from Chado
    • no Apollo writebacks to chado yet, users save annotations in Apollo as Chado XML, and upload to our servers via user management/upload CGI scripts they wrote
    • curators curate incoming annotations, resolve conflicts, submit to NCBI
    • will periodically synchronize annotations between incoming Chado and main Chado on an ongoing basis
  • Glean used to generate a non-redundant set of gene calls from a handful of different automated gene calls (e.g. Fgenesh, GeneMark, etc). This simplifies community annotation by providing an easy starting point for community annotators (many times they can just correct minor errors given Glean gene model for errors and promote to a manual annotation)

GNPAnnot

ChadoControler, Pierre Larmande

GNPAnnot is a project on green genomics which intends to develop a system of structural and functional annotation supported by comparative genomics and dedicated to plant and bio-aggressor genomes allowing both automatic predictions and manual curations of genomic objects. Four community annotation systems are released on three sites: monocots (CIRAD / Bioversity at Montpellier), insects (INRA at Rennes), fungi (BIOGER at Versailles) and wheat / grapevine (URGI at Versailles).

Pierre gave an overview of how GNPAnnot Monocots uses GMOD. Like GeneDB, they use both Artemis and Apollo to do manual annotation, and the output of both is loaded into Chado.

They are evaluating the GMOD synteny viewers.

However, the off the shelf Chado is missing several features that they need:

  • Access privileges
  • Revision History (would the Chado Audit Module work?)
  • Coordination of concurrent access
  • Client compatibility checks
  • Network security

Finally, they are developing a MVC architecture in their website.

Registration

Thanks to the generous support of Doreen Ware and USDA-ARS, registration for this meeting was free.

Agenda Proposals

If you have something you want to be on the agenda at this meeting please add it below.

Meeting Participants

Name Affiliation(s)
Dave Arcoleo BeeSpace
Saravanaraj Ayyampalayam University of Georgia
Hugo Berube National Research Council Canada
Robert Buels SGN
Ramesh Buyyarapu Alabama A&M University
Scott Cain GMOD, Ontario Institute for Cancer Research (OICR)
Jing Chen UCSD
Chris Childers Georgetown University
Kevin Clancy Life Technologies
Dave Clements NESCent / GMOD
Christopher Condit University of California San Diego
Heather Estrella Pfizer
Kathleen Falls FlyBase
Stephen Ficklin Clemson University Genomics Institute / MarineGenomics.org, Fagaceae.org
Brian Fox UCSD
Iddo Friedberg University of California San Diego
Carol Germain Pfizer
Josh Goodman FlyBase
Dong He CalTech, SpBase
Christopher Hemmerich Indiana University - Center for Genomics and Bioinformatics
Ian Holmes UC Berkeley
Robin Houston Pathogen Informatics, Sanger Institute
Jim Hu Texas A&M University/EcoliWiki and GONUTS
Ying Huang University of California, San Diego
Arek Kasprzyk OICR/BioMart
Andrei Kouranov Protein Data Bank
Daniel Lang University of Freiburg, cosmoss.org
Pierre Larmande Joint Research Unit Plant Development and Genetic Improvement
Ping Ling Pfizer
Dorrie Main Genome Database for Rosaceae
Weidong Mao Virginia State Unviersity
Sheldon McKay Cold Spring Harbor Laboratory iPlant/GMOD
Rex Nelson SoyBase
Brian O'Connor UCLA
Joshua Orvis Institute for Genome Sciences
Georgios Pappas, Jr EMBRAPA/Brazil
Isabelle Phan SBRI
Ram Podicheti Center for Genomics and Bioinformatics
Gowthaman Ramasamy Seattle Biomedical Research Institue (SBRI), Seattle,WA
Robert Reed U.C. Irvine
Justin Reese Georgetown University, BeeBase and Bovine Genome Database
Daniel Renfro EcoliWiki
Peter Rose UCSD - Protein Data Bank
Barry Sanders BeeSpace
Andy Schroeder FlyBase
Dhileep Sivam University of Washington & Seattle Biomedical Research Institute (SBRI)
Mitch Skinner UC Berkeley, JBrowse project
Russell Smithies AgResearch
Weijia Su Tyler Applied Systems, Inc.
Shulei Sun UCSD
Randall Svancara Genome Database for Rosaceae
Adrian Tivey Pathogen Informatics, Sanger Institute
Nicole Washington LBNL, modENCODE, GBrowse, Phenote
John Westbrook PSIKB / PDB
Geoff Winsor Pseudomonas Genome Database (Simon Fraser University)
Zi Yang Pfizer
Andreas Zimmer University of Freiburg/cosmoss.org

Feedback

"It's tempting to see bioinformatics as a collection of potential problems. Being at a GMOD meeting helps us see bioinformatics as a collection of potential solutions."
Isabelle Phan, SBRI, and January 2009 GMOD Meeting Participant

Attendees were asked to fill out a one page evaluation of the meeting.

Q: Please rate the meeting(s) using the following scale: 1 (not at all) to 3 (reasonably) to 5 (exceptionally).

1 2 3 4 5
How useful was the meeting? 0% 0% 0% 25% 75%
Was the meeting well run and organized? 0% 0% 0% 25% 75%

Q: Was the meeting what you expected?

Yes Much better/more than expected.
72% 18%

Q: Would you recommend GMOD meetings to others

Yes No
100% 0%

Q: Do you have suggestions for improving GMOD meetings in the future?

  • I think it is important to keep it at the same time as the PAG (or the ISMB) as it might be difficult for some to attend if its not related to some other big conference.
  • Overall the meeting was pretty good and more useful for me than I expected. I do not think people would mind paying a small registration fee in order to help around with the food cost (and maybe have some croissant and such in the morning).
  • I think speakers shouldn't be disturbed during their speech, it is better to question them after their presentation is done.
  • Keep running them like this. Having the catered lunch on site was a huge plus for networking with other attendees, which is why we attend. Planned social events for the express purpose of mixing/networking would be a plus.
  • I'd like to see more pure bench biologists there. It's unclear to me how we could accomplish this, but I think GMOD's success of failure will ultimately depend on the degree to which we are able to reach the rank and file (non-informatics) scientist.
  • I think at this stage GMOD meetings should focus on a wide variety of subject matters and deal with people with a widely different levels of experience.
  • Include a tutorial for first time users.
  • More presentations on GBrowse (since it's the most popular?)
  • Better networking would have been nice.

Additional feedback, suggestions, criticism, and praise.

  • It was a goldmine.
  • Thanks for helping to organize the meeting, we really got a lot out of it.
  • The organizers really know how to keep things casual, and approachable.
  • We were glad to see such large attendance. The seating arrangement worked out well for the presentation format. Proximity to PAG was a benefit. Looking forward to next year!
  • I liked the fact that it was a mix between presentation and open discussion. This way presentation usually lead to interesting discussions related to it.
  • Thanks again for a great meeting!
  • I am not a GMOD user (yet) and I came to this meeting to learn about GMOD. Besides clear and good presentations exploring many facets of GMOD, the GMOD users and developers were very accommodating in explaining the more basic points of GMOD.
  • I was really looking forward to it, and I ended up enjoying it immensely.
  • I think one of the more important tasks that GMOD has to face is finding user interface tools that will allow biologists to comfortably interact with data stored in a Chado database.
  • The meeting is more helpful for those who use GMOD before than first time users.
  • I like the addition of the lunch. It made things smoother.

Next Meeting: August 2009 at Oxford

GMOD Week Europe

The next GMOD meeting was held in Oxford UK as part of GMOD Europe 2009. The meeting was held 6-7 August at Oxford, immediately following 2009 GMOD Summer School - Europe.