November 2007 GMOD Meeting
A list of suggested topics, raised in advance by GMOD community members.
- community annotation - FlyBase seconds this topic
- Chado standard on ortholog/paralog/synteny storage.
- The state of GFF tools in BioPerl. Some of the auditing and examples are on a Bioperl wiki page.
- GMOD releases and packaging
- How hard would it be to heap together specific releases of popular GMOD components into a named/numbered release that has gone through some level of compatibility testing?
- How much pain does a lack of such a release currently cause users?
- how much might the community annotation server help with this?
There was a $25 registration fee to cover meals and other costs associated with the meeting. Please contact Scott Cain firstname.lastname@example.org if you need a reciept for your payment.
- James Abbott, Imperial College, London
- Sam Angiuoli, University of Maryland Medical School
- Tim Burgis, Imperial College, London
- Scott Cain, GMOD Coordinator
- Mike Caudy, CSHL
- Dave Clements, GMOD Help Desk, NESCent
- Norie de la Cruz, WormBase
- Quenfen Dong, Indiana University
- Dave Emmert, FlyBase
- Ben Faga, CSHL
- Kathleen Falls, FlyBase
- Steve Fischer, ApiDB
- Don Gilbert
- Josh Goodman, FlyBase - Indiana University
- Jay Hannah, University of Nebraska
- Todd Harris, WormBase - Cold Spring Harbor Laboratory
- Sven Heinicke, Princeton
- Kevin Galens, JCVI
- Gregg Helt, DAS/2
- Chris Hemmerich, FlyBase
- Hideya Kiwaji, Riken
- Ed Lee, Lawrence Berkeley Labs
- Suzi Lewis, National Center for Biomedical Ontology
- Sheldon McKay, WormBase/modENCODE - Cold Spring Harbor Laboratory
- Lukas Mueller, Sol Genomics Network
- Joshua Orvis, University of Maryland Medical Center
- Suzanne Paley, EcoCyc
- Chinmay Patel, GeneDB, Sanger Institute
- David Riley, University of Maryland Medical Center
- Andy Schroeder, FlyBase
- Taner Sen, MaizeGDB
- Linda Sperling, ParameciumDB - CNRS
- Jason Stajich
- Lincoln Stein, CSHL
- Victor Strelets, FlyBase
- Haiming Wang ApiDB.org
- Robert Wilson, FlyBase
- Haiyan Zhang, FlyBase
- Pinglei Zhou, FlyBase
We spent some time on our first day discussion what topics attendees would like to discusss. This list of topics helped shape the meeting agenda.
- Community Annotation
- DAS, Apollo, genome-Wiki
- Comparative Genomics
- Synteny viewers
- Chado data storage
- See Chado Comparative Schema.
- BioPerl and GFF(2/3)
- GFF Questions
- Postgres Tuning / Materialized views
- Performance Strategies
- Apollo-Chado Connection
- Performance - See PostgreSQL Performance Tips.
- Too many JDBC Adaptors
- ID Generation
- Moving away from Postgres
- Missing Chado pieces (phylogenetics)
- What Should GMOD Focus On (What's Missing)
- What should GMOD Help Desk do?
- UIs: Picture Intensive
- What should be the outcome of this meeting?
1:00 Shuttle from Grace Auditorium to Woodbury
5:30 Shuttle from Woodbury to Grace Auditorium
8:50 Shuttle from Grace Auditorium to Woodbury
9:15 ? Scott
10:15 Community Annotation
- Linda Sperling - ParameciumDB
- Lukas Muller - SGN
- Michael Caudy - FlyBase Drupal
1:00 Standards and applications for storing comparative genome data
- Steve Fisher - GBrowse: SynView and the Generic database adaptor
- Victor Strelets - FlyBase Orthoview (GBrowse)
- Sheldon McKay - gbrowse_syn
5:30 Shuttle from Woodbury to Grace Auditorium
8:50 Shuttle from Grace Auditorium to Woodbury
- GFF3 tools
- Sequence Ontology
12:00 Shuttle from Woodbury to Grace Auditorium
- GMOD Indiana update slides, Don Gilbert
- WormBase update, Todd Harris; Slides: Keynote, Powerpoint, PDF, Mov
- ApiDB GBrowse update slides, Haiming Wang
- CMap/CMAE Progress Report, Ben Faga
- Gbrowse_syn Sheldon McKay
- Community Annotation Linda Sperling
- Community Annotation Chinmay Patel
- Modeling and Displaying Synteny w/ SynView Steve Fischer
- Recent Developments in Pathway Tools, Suzanne Paley
The minutes here are based on Dave Clements' notes from the meeting. They are far from complete and you are encouraged to expand and correct them.
The minutes are not chronological. Rather they are broken up into 3 sections:
We had several discussions about the big picture.
Don Gilbert pointed out that cheap short sequencers are now available. Lots of people have inexpensive sequnces, but there still is no way to do cheap annotation.
Current GMOD clients are species or family centered. Want to make it easy to integrate multiple species. ApiDB is at the point of opening new species databases and web sites with relatively little effort.
Comparative genomics came up over and over again, both across species and within species.
As data grows and is consolidated, issues of who owns the data and who's responsible for the annotation become more problematic.
How does GMOD want to deal with integration issues?
How close to the sequencer does GMOD want to get? We don't want to pull the data off the sequencer.
Should we position GMOD as something that can feed data into places like Ensembl? Ensembl does not have curation expertise of the MODs. Even if NCBI is wonderful at consolidation, they won't have quality curation. GMOD sits right there, supporting curation. So, we doubt that Ensembl or NCBI will swallow us whole.
Releases and Bundles
We need to figure out what components we want and what we are pushing. If we focus on a core set of packages then life gets easier for the project.
There was discussion of better release management for components, and the VMWare Community Annotation Server package. Are GMOD bundles the way of the future? Believe that binary packages are generally not going to work for GMOD unless someone is willing to put a lot of time into maintaining them.
Comparative genomics came up over and over again, both across species and within species. The GBrowse_syn talk in particular spawned a discussion on this.
First, can Chado represent relationships that have more than two members? Yes. Feature_loc has a rank column. Do we want collections in Chado?
Jason suggested a working group on how to do this. Dave from UMD volunteered to manage a wiki page on this, with the end goal of establishing a document that defines how to store comparative genomes.
Talks on synteny are spread throughout this document.
GMOD Components / Functions
- A GFF3 adapter
- Speeding up Apollo when it uses Chado as a backend (or, just speeding up Chado).
- Communicating with more than one Chado instance.
- Undo/Redo support.
ID Generation and JDBC Drivers
Apollo can talk directly to a database or it can use XML files instead. FlyBase, VectorBase, BeeBase, and BovineBase are all believed to take the XML approach.
Apollo currently has two choices for database adaptors:
- One that uses Postgres database triggers to set IDs.
- One that does not.
The trigger version is used in the Community Annotation Server and on the Dolan-Rice project. We could not think of anywhere else it was used. The triggerless version is used everywhere else that we knew of.
The trigger version is Postgres specific. The triggerless version stores multiple copies of shared exons.
Notes from Tuesday: Decided to actively discourage use of the trigger version. Best thing may be to go through trigger code and externalize the logic.
Notes from Wednesday: Apollo - Chado - No short term decision. Long term probably move to Crabtree.
As you may have noticed, those notes disagree.
There was a discussion of BioPerl and how it relates to GMOD.
Jason Stajich created a slimmed down feature Perl package based on arrays instead of hashes: Bio::SeqFeature::Slim. This is 70% faster for reading a GFF file. Bio::Feature::IO only supports GFF3. It is slow, uses heavy objects, and is strongly typed. Jason wants to spend more time on middleware speed. He also wants converter into a common object model and code to get it back out to any supported format.
6 to 8 people are currently contributing to BioPerl.
GFF3 has an ID field. ID is not clear in earlier versions. GFF2 supports arbitrary feature types. GFF3 requires SO types (but you can always ignore that). Keep detailed alignment data in a separate database, not in GFF3. Indicate in GFF3 that data is stored elsewhere. Could store cigar strings in GFF3 and spec supports that.
There was a request to make to Chado be more database neutral, rather than Postgres-specific.
The slowness of Chado databases came up in several contexts. David from UMD Medical Center started a Postgres performance page on the wiki.
Scott described a potential way to implement materialized views in Chado that gets us most of the benefits of DBMS-supported materialized views. Store
- the SQL to create it in a table,
- a run time schedule for when the table should be rebuilt,
- an enabled/disabled flag that is disabled by default.
Question was raised if genome metadata fits into the current Chado. The belief was that it does not.
Jason Stajich wants a better idea of who is responsible for what in terms of Chado modules. Dave C will take this on.
The table level and column level documentation for Chado is in a good state. Enhanced basic, big picture documentation was requested. Josh Goodman is thinking of providing a mapping from Chado DB columns to FlyBase report columns. Mike Caudy pointed out we should have multiple examples of implementation, not just FlyBase.
We discussed if a Chado database validator would be worthwhile. A validator would check a Chado database to see if it conforms to the canonical model for a Chado database. There was no consensus on the value or practicality of this. There was consensus that no one was willing to volunteer to write it.
Ben suggested that if and when we do this, we use the GFF3 to Chado validator as a starting point.
There was a request to make to Chado be more database neutral, rather than Postgres-specific. Someone also asked if there was an SQLite adapter for GBrowse.
Slow performance of Chado Postgres implementations came up repeatedly.
- Specify locale. ASCII-US runs fast. UTF-8 is slow and that is the default. Specified for the server, at server start.
- A lot of time has been spent on making the queries go fast.
- RTree indexes are in the core.
- Allen's FRange functions are in the DB, but aren't used by default queries.
New CMap release (1.0) is on its way. Will have an assembly editor. Includes a dot plot, new glyphs, and an install script based on the GBrowse install script.
Ben will ask users to do beta testing, and hopes to start with that before end of 2007. Ben is looking for a project that is doing large scale assembly, to test CMap for doing assembly correction.
This was a popular motif in the meeting.
Community Annotation at ParameciumDB
Linda Sperling discussed ParameciumDB. Paramecium is a small community with few resources and no dedicated curators.
Paramecium curators are a small set of people that must do their annotation from fixed IP addresses. Curator annotations are kept in addition to existing Genoscope predictions. These annotation are not validated when they are submitted. Annotators cannot chage annotations made by other people. There are two databases: one backing the website, and one where annotation goes. Once a month the new annotation is pushed to the web site. Validation happens prior to release.
They are also using ParameciumDB to teach annotation at two colleges, and some annotation comes from that. The bulk of annotations come from 2 curators, with the other curators all making a small number of annotations.
Uses Java WebStart version of Apollo. Annotators click on link and Apollo starts up. Apollo talks directly to Chado, using the triggerless database adapter.
Community Annotation at JGI
Don Gilbert briefly described community annotation at JGI. They have a web interface for simple annotations and use Apollo for complex annotations. Anyone can promote any gene model, but they can't delete other models. Use the Wikipedia model: Whoever annotates last is correct.
Community Annotation at SGN
Lukas Mueller discussed SGN.
SGN has data for tomato, potato, eggplant, and many other species. SGN is locus centric. Each locus has (or can have) a single person who is the editor/owner of that locus. The locus editor can change anything about that locus that they want. The name of the locus editor is displayed on the locus page. Every locus has a "request editor privileges" link, if that locus has been assigned or not.
All edits are logged, and nothing is ever truly deleted. 'Deleted' items are retained but flagged as obsolete and are no longer shown.
SGN supports tagging of loci. Tags are free text that are rationalized after they are created. The tagging metaphor for curation also came up in several contexts during the Genome Informatics meeting.
Community Annotation Server (CAS)
Scott Cain spoke about this. It is almost ready to go. The Community Annotation Server (CAS) is meant to be "GMOD in a box". Currently it consists of:
- A VMWare image, containing
- Ubuntu Linux, version 6.10 LTS.
- Picked Ubuntu LTS over CentOS because LTS stands for long term service and it will be supported for a while.
- A Chado database with DictyBase data in it.
- An empty Chado database
- Apollo - Uses the JDBC adaptor with triggers. This is a Java WebStart version.
- MediaWiki - includes Cite, ProcessCite and TableEdit extensions.
- Cite extensions make it easy to provide literature annotations. Provide PubMed ID and it finds and grabs extract from PubMed.
This can run on any Intel machine, including Apple. Very little performance hit is caused by virtualization.
An online trial version of the Community Annotation Server was requested and was already on the way.
Distributed Annotation System/2 (DAS/2)
Gregg Helt attended with the goal of bringing the Distributed Annotation System, version 2 (DAS/2) into the GMOD family.
Preserving DAS/1 Strengths in DAS/2
- Keep focus on location-based annotation of biological sequences.
- Protocol, not an implementation.
- HTTP for transport,
- URLs for queries
- XML for responses
- REST-like style.
- No Required central authority.
- Couple XML response to URL request formats.
- XML has been shortened, but big gain comes from client-server content format negotiation, including binary. Empty elements dropped.
- Uses HTTP caching in the client.
- IGB - reference client for DAS2. Integrated Genome Browser
Allen Day built a DAS2 server on top of Chado. That is in CVS.
There is a validation suite for server responses to different queries.
Spec has not changed in over a year.
Scott would like that when someone installs Chado, they also get BioMart and DAS2. That is, they get access by default. Gregg would like to see GBrowse get a DAS/2 adapter.
- Is in pre-release state.
- drag tracks vertically
- quantitative data
- multiple alignment and conservation tracks.
- Release by end of year
- Rubberbanding (zoom by selecting a rectangle with mouse)
- Release in early 2008
- Major performance and scalability enhancements.
- e.g., each track can be drawn by different server or CPU.
- 3.0 (subsequently renamed to JBrowse)
- Released sometime in 2008
- Google maps type interface.
- e.g., zooming and panning via mouse.
Version 3.0 (now called JBrowse) is a fork of the code and version 2 and 3 are expected to co-exist 'forever'. Some shops won't have the horsepower to power version 3, and Lincoln wants to keep it as an easy to install tool.
Jason S argues that GBrowse slows down when it does BioPerl object creation. These are relatively heavyweight objects. He has just written a Slim version that is up to 70% faster.
Browser speed was also the number one issue (with all browsers) at the Genome Browsers Birds-of-a-Feather meeting at Genome Informatics.
Presentation: GMOD Indiana update slides, Don Gilbert
Don Gilbert spoke about Genome grid.
Genome Grid is middleware to enable easy use of TeraGrid for genome analysis tasks. Don is looking for genomes that need compute intensive analysis. He also interested in applying BioMart and Ergatis to these problems.
Dave Clements introduced himself and the goals of the GMOD Help Desk position.
Presentation: Recent Developments in Pathway Tools
Suzanne Paley talked about recent developments in Pathway Tools, including:
- Advanced Query Form
- Richer representation of regulation
- Pathlogic over-infers pathways. Pathways now have to be tagged to be shown.
- Dataset diffs and incremental updates.
His talked raised a number of issues that have come up with recent extensions to SynView.
This is a MediaWiki extension by Jim Hu. It does two things. First, it makes it easier to update tables in MediaWiki, by presenting a nicer interface for altering wiki tables. Secondly, it supports synchronizing MediaWiki tables from database tables and vice versa.
Turnkey, GMODweb, DrupalFly
These are all web interface layers that lay on top of Chado databases.
Michael Caudy argued that even if GMODWeb did work right now that it is not extensible enough to support complex queries and presentation. Mike presented Drupal, Drupal Views, and PHPTemplate as an alternative web framework for providing a web interface to Chado databases. Mike demonstrated a prototype called DrupalFly that presents FlyBase data in an alternative organization.
GMOD Participating Organizations
A number of organizations talked about their recent work.
Presentation: ApiDB GBrowse update, Haiming Wang
Steve Fischer talked about ApiDB. ApiDB uses GUS as their schema. They do multispecies comparative analysis. They have a database adapter link from GBrowse to GUS. It is based on the Chado adapter. They use materialized views in Oracle 10G and it is still relatively slow.
Synteny at ApiDB
See SynView above for details on SynView.
Syntenic maps at ApiDB are produced with Mercator. The maps are based on gene orthology. Gene orthologs are generated using OrthoMCL. All alignments are pairwise, rather than multiple. Orthology is represented outside standard GUS schema. In the synteny schema, everything is defined relative to the reference sequence. Also need a table to define anchors.
Steve Fischer showed an 11 track page, which has about 5000 popups in it.
ApiDB has a release cycle. They discard and recalculate synteny with every new release.
Berkeley National Labs
Synteny at FlyBase
Victor also presented the genetic interactions viewer, a fast way of visualizing gene interactions. It does not run directly off of the Chado database.
Presentation: Community Annotation, Chinmay Patel
Chinmay Patel spoke about a week-long annotation project at Sanger involving 40 people all annotating the same genome.
They used the Artemis annotation editor (instead of Apollo), but Artemis was talking to a Chado database using an Artemis-Chado Ibatis-based (instead of Hibernate-based) adapter. The adapter is not yet released. (But it is now: see Artemis-Chado Integration Tutorial.)
Imperial College London
Using GMOD to support a fungal sequencing project. Using:
JCVI (nee TIGR)
Using Chado as database schema.
Taner Sen from MaizeGDB was at the meeting. Maize has multiple groups generating different gene models. It would be nice to display each groun in a separate track. MaizeGDB is evaluating genome browsers and is considering using GBrowse.
Use GMOD for almost everything:
Paramecium is an odd critter (unicellular eukaryote, ciliate clade):
- 72 Mbp
- 40K gene models
- 12,500 computationally identified potential errors.
- At least 3 whole genome duplication events.
- Nuclear dimorphism. Germline nucleus (not yet sequenced) and somatic nucleus (sequenced) which is a rearranged version of the germline nucleus, streamlined for gene expression.
Fewer than 20 paramecium molecular biology labs in the world. Database supported with 1.5 staff.
It is important that people be able to click on a link, launch Apollo, add some curation and save it. Their Apollo talks directly to Chado (no triggers). See Community Annotation above for more.
Riken uses GBrowse.
University of Maryland Medical Center
WormBase / CSHL
Wormbase is migrating to Chado slowly. There is currently very little Chado there.
Sheldon McKay talked about GBrowse_syn, a prototype extension to GBrowse for viewing synteny. Goal is to have a sequence alignment viewer that can look at more than two species at a time. GBrowse_syn is based purely on sequence alignments. It does not know about genes or orthologs per se.
Used PECAN for the alignments. Maps are precomputed in a very CPU-intensive step.
Chado may or may not support multiple alignments.