September 2010 GMOD Meeting
|September 2010 GMOD Meeting
13-14 September 2010
This GMOD community meeting was held 13-14 September 2010, in Cambridge, UK, as part of GMOD Europe 2010, which also included Satellite Meetings, an InterMine Workshop, and a BioMart Workshop. The meeting was sponsored and hosted by the Cambridge Computational Biology Institute at the University of Cambridge.
GMOD Meetings are a mix of user and developer presentations, and are a great place to find out what is happening in the project, what's coming up, and what others are doing. The January 2010 GMOD Meeting was the previous event. The next meeting is likely to be held in spring 2011.
- 1 Agenda
- 2 Presentations
- 2.1 By Topic
- 2.2 The State of GMOD
- 2.3 Help Desk Update
- 2.4 The Open Microscopy Environment: Open Informatics for Biological Imaging
- 2.5 PSICQUIC: The PSI Common QUery Interface
- 2.6 MolGenIS and XGAP
- 2.7 The GMOD Chado Natural Diversity Module
- 2.8 Cosmic GBrowse: Visualising cancer mutations in genomic context
- 2.9 GMOD Projects at the Center for Genomics and Bioinformatics
- 2.10 GMOD RPC API: The almost RESTful GMOD API
- 2.11 Overview of current resources and update on DAS Meeting Cambridge 2010
- 2.12 InterMine: new Mines and new features
- 2.13 Literature Curation in GMOD
- 2.14 Towards a GO Annotation Tool: Curation Accelerator Software
- 2.15 BioPivot
- 2.16 CRAWL (Chado RESTful Access Web-service Layer)
- 2.17 Lessons from the Apache Community
- 2.18 Lightning Talks
- 3 Participants
- 4 Logistics
- 5 Sponsor: Cambridge Computational Biology Institute
- 6 Feedback
- 7 The Next Meeting
The Open Microscopy Environment: Open Informatics for Biological Imaging
- Professor, Wellcome Trust Centre for Gene Regulation and Expression, University of Dundee
- Principal Investigator, Open Microscopy Environment (OME)
The meeting's guest speaker was Prof Jason Swedlow, who discussed his work with with the Open Microscopy Environment (OME), an open international consortium that develops and releases data specifications and management tools for biological imaging. OME metadata enables image sharing, analysis, and integration with other data types.
Dr Swedlow is a Professor at the Wellcome Trust Centre for Gene Regulation and Expression and the University of Dundee. Jason's research focuses on mechanisms and regulation of chromosome segregation during mitotic cell division.
Monday, 13 September
|10:00||The State of GMOD||Scott Cain||PDF, Summary|
|11:00||Help Desk Update||Dave Clements||PDF, PPT, Summary|
|11:30||Keynote: The Open Microscopy Environment: Open Informatics for Biological Imaging||Jason Swedlow||PDF, PPT, Summary|
|13:45||PSICQUIC: The PSI Common QUery Interface||Bruno Aranda||PDF, Summary|
|14:15||MolGenIS and XGAP||Morris Swertz||PDF, Summary|
|14:45||The GMOD Chado Natural Diversity Module||Bob MacCallum||PDF, PPT, gdoc, Summary|
|15:45||Cosmic GBrowse: Visualising cancer mutations in genomic context||David Beare||PDF, PPT, Summary|
|16:15||GMOD Projects at the Center for Genomics and Bioinformatics||Chris Hemmerich||PDF, PPT, Summary|
Tuesday, 14 September
|09:15||GMOD RPC API: The almost RESTful GMOD API||Josh Goodman||PDF, Summary|
|09:45||Overview of current resources and update on DAS Meeting Cambridge 2010||Jonathan Warren||PDF, PPT, Summary|
|10:15||InterMine: new Mines and new features||Richard Smith||PDF, Summary|
|11:00||Literature Curation in GMOD||Daniel Renfro||PDF, PPT, Summary|
|11:30||Towards a GO Annotation Tool: Curation Accelerator Software||Helen Field||PDF, KEY, Summary|
|12:00||BioPivot: Applying Microsoft Live Labs Pivot to Problems in Bioinformatics||Steve Taylor||PDF, PPT, Summary|
|13:45||CRAWL (Chado RESTful Access Web-service Layer)||Giles Verlarde||PDF, Summary|
|14:15||Lessons the GMOD community can glean from the Apache Software Foundation||Summary|
Wednesday & Thursday, 15-16 September
GMOD Europe 2010 continued after the GMOD meeting, starting with the Satellite Meetings (topics were Post Reference Genome Tools and Community Annotation) and the InterMine Workshop, and finishing with the BioMart Workshop. See GMOD Europe 2010 for a complete schedule.
Presentations spanned two days and covered a wide variety of topics.
The talks can be roughly categorized:
The State of GMOD
- A set of interoperable open-source software components for visualizing, annotating, and managing biological data.
- An active community of developers and users asking diverse questions, and facing common challenges, with their biological data.
These two things are equally important.
GMOD is used by
- hundreds of organizations
- large and small
- corporate and academic
- all over the world
- across the tree of life
- 1.70, 2.14
- Rubberband region selection
- Drag and drop track ordering
- Collapsible tracks
- Popup balloons
- Allele/gentotype frequency
- Geolocation popups
- Circular genome support (1.71)
- Asynchronous updates (2.0)
- User authentication
- Multiple server support (2.0)
- SQLite, SAMtools (NGS) adaptors
- GMOD's 2nd Generation Genome Browser
- It's fast
- Completely new genome browser implementation:
- GBrowse based comparative genomics viewer
- Shows a reference sequence compared to 2+ others
- Can also show any GBrowse-based annotations
- Syntenic blocks do not have to be colinear
- Can also show duplications
- Chado is the GMOD schema; it is modular and extensible, allowing the addition of new data types “easily.” Covered data types in ontologies, organisms, sequence features, genotypes, phenotypes, libraries, stocks, microarrays, with natural diversity recently being rolled into the schema (but not yet released).
- 1.0 Release solidified the Chado that most people were already using from source.
- 1.1 Introduced support for GBrowse to use full text searching and “summary statistics” (i.e., feature density plots). Version 0.30 of Bio::DB::Das::Chado is needed for these functions.
- New (2009) web front end for Chado databases
- Set of Drupal modules
- Modules approximately correspond to Chado modules
- Easy to create new modules
- Includes user authentication, job management, curation support
- A MediaWiki extension (MediaWiki software used at Wikipedia, GMOD.org)
- Provides graphical user interface (GUI) to wiki tables
- Can also provide GUI to database tables
- Work in progress to use this with Chado
- Potential to give wiki access to a Chado database
- See http://ecoliwiki.net
- BioMart is a query-oriented data management system
- Provides a web based query interface
- Strong data federation
- BioMart Workshop on Thursday.
- InterMine is a query-oriented data management system
- Provides a web based query interface
- Very flexible queries and query optimization
- InterMine Workshop on Wednesday
- Genome annotation pipeline for creating gene models
- Output can be loaded into GBrowse, Apollo, Chado, …
- SNAP, RepeatMasker, exonerate, BLAST, Augustus, FGENESH, GeneMark, MPI
- Other capabilities
- Map existing annotation onto new assemblies
- Merge multiple legacy annotation sets into a consensus set
- Update existing annotations with new evidence
- Integrate raw InterProScan results
- MAKER Online in beta
- Java-based GUI application for browsing and annotating genomic sequences
- Can be installed via WebStart (i.e., by clicking on a link)
- Can read/write to Chado, GFF3, GenBank, GAME XML
Next GMOD Meeting?
- Next Spring Sometime:
- ABRF: Association of Biomolecular Resource Facilities
- Feb. 19-22, San Antonio, TX
- Biology of Genomes
- May 10-14, Cold Spring Harbor Lab, NY
Help Desk Update
Mailing List Archives
GMOD Mailing Lists are all over. Many are hosted at SourceForge, but several are elsewhere (EBI, Bluehost, Berkeley, ...). Some don't have public archives and those that do are spread around The lists at SourceForge have searchable archives but the search interface is frustrating.
Since May/June 2010, all emails to GMOD mailing lists have been archived in a single searchable hierarchy at Nabble. Nabble has a functional search capability and you can now search all lists, or just a single list.
GMOD Membership Requirements
Version 1 Requirements:
- Meets a common need
- Useful over time
- Configurable and Extensible
- Open source license for all users
- Interoperable with existing GMOD Components
- Commitment of support
For next version, want to add:
- Support mailing list that is publicly archived
- Publicly accessible code repository
Discussion favored these additions. The issue of incompatible open source licenses also came up. GMOD currently requires any OSI approved license. However, some of those licenses are not compatible with each other, meaning they such components can't be bundled together.
Help spread the word about GMOD components and the GMOD project.
- Increased visibility leads to
- → Increased adoption, which leads to
- → more projects contributing back
- → Increased adoption, which leads to
- Increased adoption & development leads to
- → increased funding
- Cite GMOD, GMOD Components in your papers, presentations, grants
- Powered by GMOD icons
- Speakers at your event; not just Scott and Dave. PIs and developers are also available.
- Graphics & slides for your presentations, posters
- Presentation and event promotion
- Brochures (GMOD project, events)
The GMOD Promotion page launched in July 2010.
Nine projects got new logos in the Spring 2010 Logo Program. Logos were done by John Aikman's Spring 2010 Advanced Design class at Linn-Benton Community College, Albany, Oregon, United States. Each project worked with 2-3 students during the quarter to produce the selected logos.
We might do this again in 2011.
2010 GMOD Community Survey
The 2008 GMOD Community Survey covered components and project wide topics. The 2009 GMOD Community Survey focused on genome and comparative genomics browsing. The 2010 GMOD Community Survey will cover components and project wide topics. We may use it to produce a GMOD Project publication.
These surveys help guide the project and also show potential and current GMOD users what the larger community is doing.
Look for the 2010 survey in October.
The satellites at the January 2010 GMOD Meeting were such a success that we decided to do them again. Satellites are birds of a feather discussions where participants with a common interest discuss that topic. The satellites at this meeting were:
See the satellite meeting pages for summaries of the discussion.
In 2010 we held our 4th summer school in May at NESCent, in Durham, North Carolina, US. We had 62 applicants for 25 slots.
The 2011 course will likely be at NESCent again. However, starting in 2011, summer school expenses will no longer be covered by a grant (see below). This means that we will start charging tuition, and that we will also start seeking sponsors.
Summer school sessions become online tutorials that include starting and ending VMware images, step by step instructions, and example datasets.
- Other Upcoming Events of Note
- Biocuration 2010
- October, Tokyo, Japan
- Pathway Tools Workshop
- October, Menlo Park, California, US
- GMOD Evo Hackathon
- November, Durham, North Carolina, US
- Computational and Comparative Genomics
- November, Cold Spring Harbor, New York, US
- Plant and Animal Genome
- January, San Diego, California, US
- Workshop on Molecular Evolution
- January, Cesky Krumlov, Czech Republic
- Galaxy Developers Conference
- 2011, Europe
- 1.1 just released
- Scalability: very large data sets, including NGS reads, human EST/SNP tracks
- Extensibility: custom tracks
- Backward incompatible JSON format
- 1.2 Release (December 2010)
- improved NGS display (paired-end reads, possibly read-to-genome alignments)
- reduced memory usage for NGS
- minor UI enhancements including y-axis labels for wiggle tracks
JBrowse Grant Proposal
Sent proposal in this summer; if approved will start around February 2011.
- JBrowse concepts have proven themselves
- Scalable to coming data set sizes
- GBrowse development will wind down during the grant.
- New Features
- JBrowse ecosystem on par with what GBrowse has
- DAS and web services support
- Scalability and NGS
- Large numbers of tracks
- Community annotation (upload/publish, tagging, comment, …)
- Mobile device support?
- GBrowse → JBrowse Migration Support
- Migration Scripts: Config files, data (data is easy)
- Simultaneous GBrowse and JBrowse support
- JBrowse running on top of GBrowse config and data
- Bioinformatics pipeline service software built on Ergatis
- Newest GMOD component
- WebGBrowse - Ram Podicheti et al. at Indiana U.
- SOBA - Ginger Fan et al. U of Utah
- GFF3 file analysis and reporting
- Tabular and graphical reports
- Nominated and approved, code being refactored
- Drupal based toolkit for building organism web sites
- Submitted for publication; not yet nominated
Some Interesting Documents
- File:How to load chado to biomart.pdf
- AO Keliet, J Amselem, S Derozie, and D Steinbach, all @ INRA URGI
- TZ Sen, LC Harper, ML Schaeffer, CM. Andorf, TE Seigfried, DA Campbell, and CJ. Lawrence
- How and why MaizeGDB picked GBrowse
- Appeared in Database: The Journal of Biological Databases and Curation
- Visualizing biological data - now and in the future, SI O'Donoghue, et al.
- Visualizing genomes: techniques and challenges, CB Nielsen, et al.
- Visualization of multiple alignments, phylogenies and gene family evolution, JB Proctor, et al.
- Visualization of image data from cells to organisms, T Walker, et al.
- Visualization of macromolecular structures, SI O'Donoghue, et al.
- Visualization of omics data for systems biology, N Gehlenborg, et al.
GMOD on the Web
- Moving from CSHL to OICR, real soon now
- MediaWiki upgrade
- Probably lots of new extensions
- Maybe a modified skin
- Look into adding
- User log section
- Scrapbook for contributed code
- Membership directory (TableEdit based)
- Semi-automated publication listing/linking
- Should GMOD have a social presence?
GMOD already has mailing lists, wiki, GMOD News (RSS), and IRC. Should GMOD have a presence in social media as well? If so, what should the goals be? Outreach? Community building or forums? Social bookmarking? Which tools should we use: Twitter, Facebook, Connotea, StumbleUpon, Technorati, Nature Network…
This generated some discussions and some conclusions:
- Community bookmarking may be worthwhile.
- If you can automatically tweet page updates and news items, do it.
- Don't manually post stuff to twitter
- Don't build community through Facebook. There are better time investments.
The Open Microscopy Environment: Open Informatics for Biological Imaging
Images are pretty pictures and measurements.
The Open Microscopy Environment (OME) was founded in 2000, by people who had a strong need, but weren't necessarily image metadata experts. OME has grown considerably over the years by embracing collaborators.
OME software is released primarily under the GPL. and to a lesser extent, LGPL licenses. The commercial company Glencoe Software was founded in 2005. Glencoe is a GPL based company that provides customization.
As a project OME is very public in building and publishing project roadmaps, and in making its bug tracking system be publicly accessible. The OME community meets annually in Paris in the Spring (which undoubtedly encourages collaboration).They also care a great deal about quality. OME software includes automatic reporting of exceptions, and their bug tracking system is publicly accessible.
They use Hudson for continuous integration All remote stuff done through ICE form ZeroC. Needed to talk to everything in use out there, Ruby to Matlab. Use ICE for job distribution.
Data Standards and Formats
OME also produces, and its software is based on, an open data model for biological images, called OME-XML. This format can be dropped into the header of a TIFF file.
Bio-Formats.org is another OME effort. Bio-Formats provides tools to convert ~ 95 image formats into a single common format that standardized tools (include OME tools) can work with. OME is trying to establish Bio-Formats the default library for reading biological image data.
Metadata matters. Researchers need computation access to image data in the real world. OME uses structured annotations that can be attached to any part of image data.
Genome community knows what it's baseline is. It has a basic framework, the location on the genome. With an image it's not at all clear what the baseline is.
Imaging Culture and Sustainability
Most biologists view image output as a result, not as shared data. Most images that are created are not useful and are destined to sit on DVDs, CDs, floppies, ... until they are finally thrown away when the media can no longer be read, or the researcher moves.
How do we make sure that useful data sticks around? Images take up a lot of storage, relative to many data types. Images in publications are of high quality, but the resolution is relatively low. (And research indicates that at least 25% of published articles have images that have been manipulated in some way, although usually just cleaned up.) OME started as trying to solve a local, lab at a time problem.
More to come ...
PSICQUIC: The PSI Common QUery Interface
The Proteomics Standards Initiative (PSI) Common Query Interface ([PSICQUIC, pronounced like "psychic" - most of the time) standardizes access to molecular interaction data. PSICQUIC is a web service specification based on PSI standards. Resources that implement PSICQUIC are listed in a public registry. There are currently more than 14 million binary interactions from at least 12 different resources (IntAct, Reactome, ChEMBL, ...) available using PSICQUIC. This widespread adoption allows client programs that speak PSICQUIC to uniformly access all this no matter where it is located.
PSI talked for many years about standards and formats and how to share data. They 2002-2006 thinking about standards. They found it was very complicated to agree on something. but that it has been easy to implement. Most PSICQUIC implementation came out of 3 biohackathons.
PSICQUIC Web Services
Several methods are supported:
- getByInteraction - Retrieves interactions by using an interaction AC.
- getByInteractionList - Retrieves interactions by using a list of interaction AC.
- getByInteractor - Retrieves interactions by using a participant identifier.
- getByInteractorList - Retrieves interactions by using a list of participant identifiers.
- getByQuery - Retrieves interactions by using a Molecular Interaction Query Language (MIQL) query (full text searches)
- getVersion - Returns the version of the web service implementation.
- getSupportedDbAcs - Returns the supported database identifiers
- getSupportedReturnTypes - Returns the list of available format types for the results.
A limited number of interactions can be fetched. It is possible to retrieve large datasets using pagination. Most methods have two additional parameters:
- First result: Index for the first result to retrieve.
- Max results: Number of interactions returned per query.
IMX Consortium and UniProt identifiers are currently being used. Don't have the one single identifier yet.
- SOAP and REST
As PSICQUIC is a Web Service, you can access the data:
- Via SOAP
- A WSDL file exists, and it is the same for all the databases.
- IntAct has developed a Java client, but any other language can be used.
- The SoapUI client uses this.
- However, SOAP's future in PSICQUIC is uncertain and may go away in the future.
- Via REST
- Retrieving data directly by using a URL
- Easy to access and data can be obtained just using an internet browser.
- Effective for scripting.
PSICQUIC has two standard formats: PSI-MI XML and PSI-MI TAB. The XML is more complete, and therefore more verbose. PSI-MI TAB is a tabular format.
- Try these queries at IntAct:
Other formats are in progress:
- BioPAX (IntAct example)
- rdf-xml (IntAct example)
- rdf-n3 (IntAct example)
- rdf-n3-triple (IntAct example)
- rdf-turtle (IntAct example)
As these formats are works in progress, some of these links may fail.
- PSICQUIC Registry
The PSICQUIC registry contains a list of the PSICQUIC services available from different providers. It is a web service itself, and it can be accessed remotely using REST. Information can be found about the services, such as the URLs to use, number of interactions provided, versioning, etc. The registry classifies the different services with tags from a PSI ontology. Querying by tags is a work in progress. Instructions on using the registry are at Google Code.
PSICQUIC also defines the Molecular Interactions Query Language (MIQL). MIQL allows more powerful and flexible queries and is the default query syntax for PSIQCUIC. Designed for fast and effective searches on PSI-MI TAB files. All fields (columns) can be searched with specific queries. MIQL is a consensus between the different databases, so you should be able to use the same query across different repositories.
The MIQL syntax is based on the Lucene syntax. A query is broken into terms and operators:
- Terms: single words or phrases (group of words surrounded by quotes). E.g. brca2 AND “pull down”
- Fields: search in specific columns. E.g. brca2 AND species:human
- Term modifiers: wildcard searches, fuzzy searches, proximity and range searches. E.g. brc*
- Operands: OR (or space), AND, NOT, +, -. E.g.
- brca2 AND rpa1 / brca2 NOT mouse / +brca2 –mouse –expansion:spoke
- Grouping and field grouping: brca2 AND (mouse "in vitro")
Creating a PSICQUIC Service
Simplest recipe to implement PSICQUIC
- PSI-MITAB compliant file.
- Subversion: to get the source code.
- Maven: to run the scripts and start the service.
- Generate the MITAB compliant file.
- Get the Reference Implementation (RI)
- Run the script to index the file.
- Start the service with the script provided .
PSICQUIC is already implemented in several existing applications, including Cytoscape 2.7.x, PSICQUIC View, Envision2, and PSICQUIC Client for Android.
There is not currently anything in the GMOD suite that uses PSICQUIC. Should there be?
- Smart PSICQUICs: Identification and removal of redundancy
- Merger and Cluster PSICQUIC services
- PSICQUIC 2.0
- Overcome the current limitations and many fancy features:
- Queries using CV terms not possible in the reference implementation (it is possible in IntAct).
- PSI-MI XML is created from the MITAB, so no n-ary interactions.
- New features:
- Redundancy detection mechanism. ROG/RIG ids by default.
- Built from PSI-MI XML, so complex data available.
- Overcome the current limitations and many fancy features:
- A GMOD component?
Flybase is using Chado Interaction format. Ecoli has lots of interaction. Can we have a Chado service that talks PSICQUIC?
Following the talk a couple of possible actions arised:
- Exporting from Chado to MITAB, so we can just create PSICQUIC services from any Chado-based application.
- Creating a component / adding interaction information to existing components.
Bruno is unfamiliar with Chado, but if someone wants to give it a shot, he is more than willing to help and participate. All information about PSICQUIC can be found at Google Code.
And some basic information about the MITAB format may help.
Morris Swertz, PDF
MolGenIS is a flexible bioinformatics application toolkit for data management and interfacing. XGAP is an •eXtensible Genotype And Phenotype system that was generated with MolGenIS to store and visualize xQTL and GWAS data.
MolGenIS has been used to generate systems for many different types of applications and datatypes. MolGenIS based systems and users include GEN2PHEN, XGAP, UMCG, FIMM, Sysgenet, and many others.
MolGenIS is a system generator. It addresses the recurring issue of generating custom databases for each new application that comes along. The traditional approach requires database design, backend (server) coding, API development, and user interface coding, all of which is bioinformatician intensive. This approach does not have reusability and interoperability as a natural byproduct of development. With MolGenIS system developers provide a system definition which MolGenIS then uses to automatically instantiate a system that implements the definition. Writing a system definition requires learning new skills, but is still much less time intensive then creating a system from scratch.
MolGenIS includes built in support for many features:
- database generation
- server code generation
- User interface generation, including edit interfaces and audit trails
- Import/Export to Excel
- R interoperability
- workflow ready web services using REST, SOAP and RDF
- UML documentation of underlying models
MolGenIS also comes with extensive documentation, including a development manual.
Generated systems can also be customized. The user interface can be extended with plugins implemented in as a Java class, and a layout definition. Similarly, plugins can be added to the server side by defining a Java class.
The database backend currently uses a custom object-relational mapping (ORM). Hibernate was considered six years ago, but was lacking key features. The long term hope is to migrate to a standard ORM such as Hibernate.
XGAP (eXtensible Genotype And Phenotype) was developed for xQTL and GWAS data.
The data is logically in a series of matrices with a different matrix for each datatype (e.g., genotype, microarray, LC/MS, ...). The initial idea was to create a database table for each datatype, but this would have led to a proliferation of structurally similar database tables, and would require schema changes with the addition of each new type in the future. (Imagine Chado's feature table split into gene, ssr, snp, exon, etc. tables.)
XGAP addresses this by embracing a generic matrix model: any trait X any subject. All matrices are stored in a common database table where each row corresponds to a single element in a matrix. Schema changes are not required to add new matrices or new columns to existing matrices. This is all done by adding matrix and column definitions to definition tables in the database.
FuGE (Functional Genomics Experiment) is a standard model for this type of information. XGAP builds on top of this.
GMOD Link Ideas
- XGAP harmonization towards Chado?
- MolGenIS 4 Chado? Did BioSQL a few years ago.
- BioMart / InterMine
- Consume BioMARTdata to auto-annotate experimental data?
- Export XGAP experiments into MART/MINE query environments?
The GMOD Chado Natural Diversity Module
- Manage phenotypic and genotypic data for both field collected and captive bred organisms
- Store collection site information for growing "next gen"-based variation data
- Leverage existing/future Chado modules, GMOD tools and know-how
- Early version:
- HeliconiusDB @ NESCent (National Evolutionary Synthesis Center)
- Inspired by GDPDM (The Genomic Diversity and Phenotype Data Model)
- August 2010
- Natural Diversity module merged into Chado svn trunk
Makes use to the pre-existing stock module. Adds support for Experiment, Geolocation, and Genotype and Phenotype (reusing some existing tables), The talk walked through how three specific use cases would be implemented:
- Cross experiment
- Field collection
- Phenotype assay
CV Terms and APIs
Schema is very flexible. nd_experiment.type and nd_experiment_stock.type are key. There are several ways to do the same thing. The working group is hoping to agree on core CV terms to aid API development. VectorBase is planning a simplified API that abstract the module's tables into:
- experiments, for which we propose at least three subclasses:
- field collections
- phenotyping experiments
- genotyping experiments
Cosmic GBrowse: Visualising cancer mutations in genomic context
The Cancer Genome Project (CGP) started in 2000. COSMIC, the Catalogue Of Somatic Mutations In Cancer was launched on 4 February 2004. COSMIC is a website and backing Oracle database. COSMIC mutation data comes from several sources.
- Three curators who read and annotate publications.
- Other database(s) e.g. TP53 (IARC), International Agency for Research on Cancer
- Sequencing/mutation detection
The project is planning on launching COSMIC GBrowse on 22 September 2010.
- GBrowse and CGP
|Q.||How could we visualise the data deluge from next generation sequencing?|
|A.||GBrowse. (See [Keiran Raine's presentation at the January 2010 GMOD Meeting.) A near instant solution to the problem (days/weeks, rather than months/years for an in house solution). Looked at lots of options. GBrowse looked like the clear winner - it's configurable and meets needs.|
|Q.||COSMIC was designed to be gene centric but what about sequencing whole cancer genomes and visualising mutations in genomic context?|
- Reference genome (GRCh37) + cytogenetic bands
- Ensembl annotations (e! 58)
- Cosmic Transcripts
- Mutations (substitutions, insertions/deletions)
- Copy Number Profiles
- analysis of SNP6 microarray data over 800 cell lines
- % samples which have copy number features (amplification, homozygous deletion, LOH, change)
Configuration and Setup
- 5 Virtual Machines [Debian Linux, 2G RAM) ]
- dev + master + renderfarm slaves (2) + PostgreSQL. The Master talks to the two slaves, both of which talk to the reference and mutations databases.
- 2 databases: ‘Reference’ and ‘Cosmic’
- scripts to query/format/populate these databases
- cosmic css/theme
- perl callbacks: glyphs, colours, hyperlinks, popups/tooltips
COSMIC GBrowse shows:
- genes, COSMIC transcripts, non-coding RNA
- breakpoints with lightning (!) and detailed popups
- Copy number change, with color, and links to CONAN.
- LOH, with color
- Mutations density
- Mutation details (intronic, nonsense, missense, Silent, Non-coding, frameshift, in frame, complex, deletion, insertion), with colors and shapes, provide a key and detailed popups
- See slides for screenshots.
- At COSMIC
- Embed cosmic GBrowse in some cosmic web pages - replace old and slow drawing code and extend functionality.
- Current version is a summarised view of whole cosmic dataset. We need to be able to display subsets of data. How can we display all mutations for a specific sample or group of samples, or from a specific tissue or tumour type? oo many for a static list of data sources, but there is a neat trick ..
- Define data source in the URL, eg sample COLO-829: http://www.sanger.ac.uk/fgb2/gbrowse/sample_COLO-829
- GBrowse.conf ... (need at least 2.09)
[=~sample_.+] description = Cosmic Database v48 (sample filtered) path = /gbrowse/bin/source_config.pl -sample $1 | # path points to a script which generates the config # sample name ‘COLO-829’ is passed to the script from regular expression # track configuration generated for data source COLO-829 … [Mutations] remote feature = http://…/cosmic_export.cgi?sample=COLO-829 # cgi script returns COLO-829 mutation data from COSMIC
- GBrowse Developement
- remote feature - perl callbacks cannot be used until Safe::World is fixed
- init_code - perl callbacks defined with init_code not accessible from slaves
- BAM/SAM read sorting by similarity to reference
- GC plots can give >100% values
CGP committed to using GBrowse as its internal browser for next gen sequencing data, and an external browser for COSMIC data (genomic view of mutations, breakpoints and copy number data). COSMIC GBrowse to be released soon (22/9/2010?). CGP is also involved in GBrowse development. A new developer has been recruited, but details are still being discussed.
GMOD Projects at the Center for Genomics and Bioinformatics
A Simple Web Interface for Configuring GBrowse: WebGBrowse
(By Ram Podicheti, as channeled by Chris)
WebGBrowse is a web interface for configuring GBrowse installations. You can upload GFF files and optionally upload an existing GBrowse config file to use as starting point. From there, you can add, edit, and remove new tracks using web forms. WebGBrowse comes with extensive help embedded in the forms and includes a tutorial. Users can preview their changes at any point in GBrowse. WebGBrowse makes GBrowse more feasible for small projects who can figure out configuration, but don't have the resources to setup their own server.
WebGBrowse can be downloaded and locally installed. There is a mailing list for support, feature requests, and contributions. We want to help you help us add support for more features. WebGBrowse has passed the nomination process and is now a pending GMOD component. Waiting only migration of development environment to a public repository.
WebGBrowse has support GBrowse 2 for quite a while. It does not support callbacks yet (and this is hard due to security considerations),
Web-based Bioinformatics Pipelines for Biologists: ISGA
(By Chris, Aaron Buechlein, Ram, Jeong-Hyeon Choi, and Boshu Liu as channeled by Chris)
ISGA is a workflow management system that can meet the needs of a small sequencing center. It supports flexible pipeline definition for new pipelines, and for incorporating new programs as components. ISGA supports distributed computing environments, if you have a potential need to grow beyond local computing resources. ISGA was created at CGB to minimize CGB staff involvement in running pipelines. ISGA frees up staff resources for building new pipelines.
ISGA is built on top of Ergatis. Ergatis is developed and support by the Institute for Genome Sciences, U. Maryland. Ergatis enables building pipelines from existing programs, supports distributed computing environments, and has robust monitoring of pipeline execution. Ergatis comes with 10+ readily available pipelines, and there are more available in the community. There are currently 220 tool/component definitions that come with Ergatis, and again, there are more in the community. Components and pipelines are defined in XML. XML/BSML is the common data exchange format. XML/BSML is optional, but recommended for reusable components. Includes conversion tools for FASTA, GFF, Chado, etc... This isolates format changes from other programs. Ergatis runs on Condor out of the box.
Ergatis's interface assumes that a computationally savvy biologist will be using it. In practice, this can lead to the informatics staff being the practical interface between biologists and Ergatis. CGB had several goals when developing ISGA:
- Wanted to support single-lab biologists that are self-sufficient but have limited bioinformatics resources and that embrace tools that don’t require extensive training
- Ability for biologists to run pre-configured pipelines quickly
- Option to customize specific tools in a pipeline
- An interface that encourages exploration:
- Remove complexity and information biologists don’t need
- Inline help
- Immediately detect errors and allow biologists to correct them
- Return output in useful formats
- Simple tools for visualizing and searching large result sets
ISGA does this and several other things too: First, it simplifies pipelines by hiding housekeeping components and by grouping components into clusters representing processes. ISGA supports customization. Users can disable components, replace components with pre-computed data, and edit scientifically-active program parameters. It also provides help and validation for all forms, and incorporates visualization and analysis tools. In addition ISGA support the concepts of users and data privacy, and users can upload and download data,
Why develop ISGA as a separate package?
ISGA only re-implements the web interface of Ergatis. Ergatis libraries, component definitions, and method of running and monitoring pipelines is used by ISGA as-is. ISGA adds and removes Ergatis features such as accessing component information and building pipelines from components. ISGA biologist users need to be given limited functionality for simplicity and security. Ergatis bioinformatician users need full functionality and a complex interface to work efficiently. A hybrid ISGA/Ergatis interface wouldn’t serve anyone.
Present and Future
ISGA at Indiana has run over 100 pipelines, and has more than 60 users. There are two external sites evaluating their own ISGA installation that CBG knows of.
Recent developments in ISGA include
- Celera assembly pipeline
- Ability to accept parameters with pipeline inputs
- Ability to iterate components over a list of pipeline inputs
- Conversion scripts for Hawkeye visualization
- Installation instructions :shame
- Administration improvements
- Online configuration
- User classes and pipeline quotas
And there is more in the works:
- SHORE SNP Calling (ISGA)
- Gene clustering over Microbial phylogenies (Ergatis)
- Transcriptome annotation pipeline (Ergatis)
- Methyl-seq (Ergatis)
- Pipeline reproducibility and provenance
- User groups and sharing
- Modular pipeline and toolbox installation
- ISGA pipelines as standalone Ergatis templates
- ISGA pipeline over Amazon EC2 via CLoVR
- Cloud Resources through CloVR
- Execute Ergatis Pipelines over an SGE instance hosted on Amazon EC2 machine images
- CloVR manages creation and shutdown of cloud images as part of pipeline
- Upload input as part of pipeline or access data hosted at Amazon
- Results are retrieved to local machine
- Ergatis assumes a shared filesystem, so some modification is required to manage file transfers
- Using CloVR with ISGA
- ISGA/Ergatis pipelines can be ported to ISGA/CloVR
- ISGA installation communicates with local Ergatis and CloVR
- EC2 presents challenges for billing customers
GMOD RPC API: The almost RESTful GMOD API
Josh started with this scenario:
Fetch me all genes annotated with GO:0003677 (DNA Binding) from D. melanogaster, C. elegans, T. castaneum, and B. mori. Then fetch the current ID, symbol and list of orthologs for each.
We currently do this with a mixture of file downloads, SQL calls to different DB systems, a patchwork of parsing scripts, and screen scraping. Instead, we should be doing:
$ curl http://flybase.org/gmodrpc/v1.1/ontology/gene/GO:0003677 $ curl http://wormbase.org/gmodrpc/v1.1/ontology/gene/GO:0003677
This idea was motivated by a discussion at the July 2008 GMOD Meeting where a simple request, like the one above, required screen scraping. This work uses the REST protocol to gather information. REST is an alternative or successor to CORBA, a heavyweight protocol for sharing information, and SOAP, a more recent, but still too heavy for our purposes protocol for doing the same.
The GMOD RPC API proposal supports a number of information services:
- Full text search
- Gene ontology
- Fetch common gene page
In an ideal world each MOD would provide these services.
The idea is to provide top level classes. FlyBase will provide a specific Chado/Perl based implementation. However, the proposal is trying to be agnostic in terms of what data types are expected. Josh is working on the Perl implementation. Others are working on PHP (Jim Hu) and Java implementations.
- Strict MVC separation.
- Moose used for the model
- Moose is much better than Perl 5 objects.
- GMOD RPC API will provide base code and utility functions. You extend base class of each service to implement based on your environment.
- Template::Toolkit for the view
- Perl’s Dancer for the controller
- Simple and clean with minimal dependencies. Perl implementation of Ruby's Sinatra.
- Easy to install. Decided against Catalyst because of installation and dependencies. Want something simple to get this off the ground.
- Can be run under CGI, PSGI (Plack), and FastCGI on a variety of web servers (Apache, Nginx and lighttpd)
- Moose used for the model
- Log::Log4perl for logging
- Standard Test::More unit tests
- Short term
- Alpha release by end of October 2010
- Beta release by end of December 2010
- Long term
- DAS tie in
- Validation for XML formats
- Java, PHP and Python APIs
- Evaluate additional API features
How to participate
- Subscribe to gmod‐devel
- GMOD REST API
- SVN code repos
Plan is to keep old version APIs around. That is to keep the old URLs accessible by having constant and stable URLs.
There is a mechanism to query what services are available - returned as an XML list of services from above list (organism, ...).
Queries can use taxonomy id, or just genus and species. Can get sequence by asking for a gene and SO terms
Can this REST interface be made a standard feature in a GBrowse implementation? That's ideally what we should shoot for. This is where this should go.
Is there a way to pull this in whole, instead of at a retail level? Not currently, but FlyBase puts gene reports into XML. Then use XSLT to generate the web pages, and then save XML in Lucene for ful text searching.
Overview of current resources and update on DAS Meeting Cambridge 2010
DAS stands for Distributed Annotation System. It is based on HTTP and XML. From a user perspective, you run a client program (e.g., GBrowse, Apollo, Ensembl, ...) pick a coordinate system, connect to a DAS registry to get a DAS server list, and then request a region of interest from the reference and many annotations from the DAS servers.
Some DAS 1.5/1.6 Commands
- Features - used by genomic clients
- Stylesheet - give tips to the client about how the data should be displayed
- Interaction (has largely been superseded by PSICQUIC).
Why use DAS 1.6 over 1.5?
- Clarification of the way DAS is being used - should promote interoperability
- Represent features with more than two levels 1.6
- Represent Genes → Transcript → Exons
- GFF3 will be a supported format (Adapters for servers and databases).
- MyDAS server will support this without the need for a database
- Reliably relate feature types to a more structured ontology
- cvId attributes in the xml for SO: or ECO ids - use of these may become mandatory in a future specification.
- Already have servers that produce 1.6 data.
- Lots of clarification in the standard makes it easier for to write DAS client code
- Cleaner ontology support. Hope is people can more easily use the DAS registry.
- Sources documents advantages:
- have coordinate systems which mean you are mapping annotations to the correct genomes/sequences.
- Smoother running of Ensembl and other DAS clients.
- You can automatically load many DAS sources to the DAS registry using your sources document and the registry should keep in sync with new additions/deletions/alterations.
- MyDAS and Proserver support the use of sources and all other 1.6 specification commands and responses.
- Some example queries
- keywords parameter to sources cmds e.g.
- keywords parameter to coordinatesystem command
- added total, start, end attributes to coordinatesystem request response if rows specified
Easy DAS is a DAS server hosted at the EBI. Easy DAS removes the need to have your own servers and databases. It accepts file uploades in a number of formats. You can use EBI as your sever.
- Can specifiy what formats your server can give to clients.
- Can make up your own format
- Ask that you write up a description of your format on DAS wiki.
- DAS Writeback (implemented, Create Read, Update Delete)
- Apollo could use thse.
- Longer genomic alignments
- Dalliance Thomas Down (example)
- IGV Broad Institute
- Karyodas (Decipher, mykaryoview)
- Apollo - new DAS DataAdapter release soon. (Written by Jonathon about 6 months ago. Needs to be finished and sent to Ed Lee.)
- JBrowse? - in current grant proposal.
Other DAS Clients:
- Ensembl uses DAS to pull in genomic, gene and protein annotations. It also provides data via DAS.
- GBrowse is a generic genome browser, and is both a consumer and provider of DAS.
- IGB is a desktop application for viewing genomic data.
- SPICE is an application for projecting protein annotations onto 3D structures.
- Dasty2 is a web-based viewer for protein annotations
- Jalview is a multiple alignment editor.
- PeppeR is a graphical viewer for 3D electron microscopy data.
- DASMI is an integration portal for protein interaction data.
- DASher is a Java-based viewer for protein annotations.
- EpiC presents structure-function summaries for antibody design.
- STRAP is a STRucture-based sequence Alignment Program.
Jonathan doesn't think it's hard to write a DAS adaptor for existing databases.
InterMine: new Mines and new features
InterMine is a query-optimised data warehouse system. It is a means to integrate all your existing databases into a single coherent database. InterMine is written in Java and uses an object-based data model. It is implemented on top of PostgreSQL, is free and open source (LGPL).
InterMine comes with tools for loading data from popular data sources such as UniProt, Ensembl, Chado, PSI, Inparaoid, and many more. InterMine also provides Java and Perl APIs for loading custom data sources, FASTA, XML, and GFF3. If there are conflicts in data from one or more sources, you can write rules to prioritize one data point over another.
InterMine's web interface works for any data model, includes advanced functionality for bench biologist. It is highly conﬁgurable from within the web interface. The web interface includes:
- QueryBuilder - custom queries, advanced
- Template queries - pre-deﬁned queries
- Report pages - conﬁgurable, templates (GO enrichment, publication enrichment, ...)
- Lists - upload, use in queries, analysis widgets
- Export & API - exports in several formats and has a RESTful query API
- MyMine - users can create an account and save their lists and queries.
Mines for MODs
This project is creating mines at several large MODs. FlyMine was the original application for InterMine. RGD's mine has been released. SGD and ZFIN's mines are in beta. There is a common interface across all of the mines. Each organization supports their with mine with a 0.5 FTE.
InterMine runs alongside the existing web site. It contains MOD data plus data from other sources. The InterMine team is working on better support for embedding of InterMine, and new features for MODs. SGD is using InterMine for searching, replacing older technology. This project is providing lots of input for InterMine development!
These mines can currently be integrated through exporting of InterMine lists from one mine to another. Richard gave an example of:
- Identify a list of (141) upregulated genes in heart at FlyMine.
- Export the orthologues to RGD
- Identify the list of those genes that are related to cardiovascular disease (40).
- Export the orthologues back to FlyMine
- Look for relevant publications, export sequences, ...
They are working on enabling this type of interoperability by fetching results dynamically between mines. This would also allow people to create mashups.
This brand new project will contain data on metabolic diseases such as diabetes and obesity. InterMine fits quite well with this kind of research. Currently researcher look at 10+ web sites for their gene of interest. The idea is to put all that data in one place. This is being done for human, mouse and rat, and from lots of data sources (SNPs, HapMap, expression, ENCODE, ...) many of which parsers already exist for. Users will be able to prioritise candidate genes, compare lists, ﬁnd common attributes, and upload data.
InterMine 0.94 Release
Search has been greatly expanded. The existing Templates and QueryBuilder interfaces are structured searches. 0.94 adds full text searching of the whole database, using Lucene. For FlyMine it takes 45 minutes to create a 2.5 GB index ﬁle, using parallel fetching and precomputes. Each object is a document with attributes as ﬁelds and related data (e.g. GO, pathways) included as well.
The 0.94 release also adds support for faceted search. Facets are different aspects of the current result that can be used to group and filter them. Some common facets might be organism or feature (Sequence Ontology) type. For organism you could include or exclude subsets of data based on what their organism is. You can have a facet on any property, and you may filter with multiple facets. Displaying all this information is hard. InterMine's faceted search uses BOBO from LinkedIn, and integrates with InterMine's lists and templates.
InterMine 0.94 also includes
- cleaner template management and implementation
- significant performance improvements
- a GUI to simplify installation and data source management
- Galaxy integration - any query you run, you can export to Galaxy.
- Improved Perl web service API
- Automatic SO to model generation
- Upload list of genome regions
- CytoscapeWeb plugin
Literature Curation in GMOD
The goal of literature curation is to get information from papers into a format where they can be searched, grouped, reasoned with, and displayed. Curation is often done by PIs, students, the occasional bioinformaticist, and paid curators. This is comparatively time consuming, and there are lots of papers, but few people. It is also relatively costly and depends heavily on the expertise of the curators.
GMOD provides a number of tools to help with curation:
- Chado Publication Module
- Pathway Tools
- Apollo (or Artemis)
Some of these support automated annotation while others support manual annotation.
Daniel closed the talk with some details on distributed / community annotation. EcoliWiki curates PubMed entries and then notifies authors for further refinement. CACAO, a related project, teaches undergraduates how to do annotation, and then puts them in teams, have them compete, and then rewards them with food and drink (and grades). Uniprot has an option to submit suggestions and comments. The RFAM database is a wiki. PubMed has authors tag things. Finally the BeeSpace project did some work with fly and their difficult gene names and did pretty good.
See the summary of the Community Annotation Satellite Meeting for more on this topic.
Towards a GO Annotation Tool: Curation Accelerator Software
FlyBase (and other) literature curators read papers, extract information per paper, recording new constructs, phenotypes, and allele construction. FlyBase also does gene ontology (GO) annotation and wants to make this easy for curators or the community to do. The goal is to avoid having people do things that computers could do in order to save curation time and make it easy for a beginner or community annotator.
Each gene should have up to 3 GO terms associated with it, one for each of GO's 3 areas: molecular function, biological process, and cellular component. Each GO term has a standard definition designed to have the same biological meaning across the entire living world, i.e. for all disciplines, in any organism, in any database. GO ontology terms are designed to standardise the biology.
Creating GO annotation is a common problem in the GMOD community. There is a file standard called GO Annotation Format (GAF) for GO annotations. This format is tabular and is currently being revised to add two new columns (for 17 columns total). We wanted a tool that can create GAF entries in standard GO format or in FlyBase format.
Existing GO curation tools -- usually have a Perl or Java front end and directly input annotation to a database. FlyBase curators edit text files. The curation process at FlyBase is independent of database access. This means the curation and database interfaces can be developed independently, with each doing their own job well.
FlyBase's strategy is to create a curator-designed GUI, that lends itself to both speed and accuracy. Using a modular design, FlyBase have designed a GO "clever editor" of "GO lines" (text lines) that uses an XML conﬁguration file to say how many textual components per line item and what is the allowed content of each sub-line component. An API will permit customisation of inputs. Programming is in Java. A proof of concept desktop GO tool has been created. The basic function of the GO tool is to create, edit and house existing "GO lines" - i.e. lines in GAF files. The tool shows a GAF line as it is being built. The focus is on reducing mouse-work, tool switches, ‘manual’ editing, and textual errors. The latter are avoided by using autocomplete.
There is one window, with buttons for many tools: the next adaptation will be for phenotype/anatomy Controlled Vocabulary annotation. This tool may ultimately be translated -- to a web service. The tool currently runs on Mac OS X (testing) 10.5 and 10.6 preAlpha_v002. It can use as little as 800MB of RAM (testing for less - but depends on inputs for autocomplete). It accepts flat file input but can also talk to PostgreSQL. Plans for the future include Chado queries, and XML conﬁguration for the clever editor, APIs in collaboration, and a JSP or J2EE/SOAP web interface.
See the talk for many screenshots of the prototype.
- Applying Microsoft Live Labs Pivot to Problems in Bioinformatics
Steve is interested in visualization of large numbers of genome regions, and in querying and filtering properties of genome regions. This talk covers his work on the BioPivot tool, based on Microsoft's Pivot tool. He'd also like to have an open discussion of other applications of this technology.
Steve is the head of the Computational Biology Research Group (CBRG) at the University of Oxford. CBRG provides core services for research at Oxford. They use lots of next generation sequencing for mainly human and mouse projects. They visualize all this in GBrowse. CBRG has over 50 different GBrowse databases showing time series, arrays, ChIP-Seq, and RNA-Seq, histone modification data, cis and trans Interaction data, PCR amplified regions, and exome sequencing and SNP detection.
There are challenges with this much data. For example slight problems experimental conditions or antibodies can cause peak finding algorithms to find a lot of false positives. These algorithms have lots of parameters and you must pick a cutoff, and you have to eyeball lots of peaks to do this. Which of my peaks overlap with genes, exons, promoters, CpG islands, etc.
The traditional approach is to make a spreadsheet of data with links to GBrowse/UCSC regions of interest, click/filter various parameters, and add data to spreadsheet after each new analysis. This is slow, boring, and error-prone.
Deep Zoom Technology
Recent technologies that use deep zooming can help us here. Steve included pointers to
- Blaise Aguera y Arcas's TED 2007 talk on Photosynth. This is "Google Maps on steroids."
- Seadragon/Photosynth showcase
- Microsoft Live Labs’ Pivot, which what BioPivot uses.
Bioinformatics has very few compelling interfaces. Wouldn't it be cool to use this in bioinformatics. Take thousands of regions of interest of genomes and view and filter seamlessly on metadata.
Steve then did a BioPivot demo showing over 6000 peaks found in a ChIP-Seq experiment. (Editor's note: There aren't any screen shots of the demo in the slides so this is hard to imagine. My notes during the demo simply said "Holy cow.") Facets were shown on the side, and you could filter the peaks based on their facets, or metadata.
Here are a couple of examples that can be viewed and manipulated with the Silverlight plug-in that can be used on Windows PCs and Mac OS X (though not with Chrome just yet). Eventually this will work on Linux in Moonlight.
These facets came from column 9 of the GFF3 files for the genomes. BioPivot includes tools to create Pivot metadata from GFF3, and to add annotation to a GFF3 file for things like nearest gene, exons, introns, intergenic, intragenic, and TSS/TES up and down stream regions.
BioPivot uses Pivot, but other Zoomable User Interfaces (ZUIs) are available. OpenZoom is one such package that includes an SDK for Flash, Flex, and AIR, as well as APIs.
There are a number of ways they would like to extend BioPivot:
- RNA-Seq parsers e.g. cufflinks, DESeq
- Get feedback from the community
- What else can we do with this tech?
See the BioPivot page at CBRG for more.
CRAWL (Chado RESTful Access Web-service Layer)
- A programmatic interface for querying pathogen genomics data
Editor's note: None of these summaries do the talks justice. However, this summary is especially short of the mark. To really appreciate the talk, download the slides.
Caveat: Vaporware Alert! This talk is about work that is currently in the rapid prototypes, proof of concepts, and requirements gathering stage. Terms and conditions apply. This is by no means a standard (yet), but it might be useful to others.
Giles works at GeneDB, a pathogen database with 45 organisms as of a couple months ago, and getting new ones all the time. GeneDB has a focus on web services and includes a lot of Chado related code. It also include cross-organism computed data such as orthologues and domains.
Chado is a relational database schema that underlies many GMOD installations. It is capable of representing many of the general classes of data frequently encountered in modern biology such as sequence, sequence comparisons, phenotypes, genotypes, ontologies, publications, and phylogeny. It has been designed to handle complex representations of biological knowledge and should be considered one of the most sophisticated relational schemas currently available in molecular biology. The price of this capability is that the new user must spend some time becoming familiar with its fundamentals.
- A database for very deep curation
- An integrated database
- A database that is generic enough to use for any organism
GeneDB's web front end uses Hibernate mappings to Chado, DAO caches, and Lucene search. It has weekly data updates (that take over a day to run). The code is quite meaty.
GeneDB collaborates with EupathDB an umbrella functional genomics integrative resource. It's annotation team is at Sanger, SBRI, and the University of Georgia. GeneDB also integrates and shares data from TriTrypDB and PlasmoDB.
There are several challenges here:
- Need to know what has changed
- Need to be able to get the data
- Remote annotation
- DB-Artemis via VPN
- Current setup works for power users, but need to build Rich Internet Applications to broadly enable this.
- Need to exploit our own data as well!
- Chado-complexity - SQL is hard.
The solutions to these challenges must all support rapid prototyping where it is quick to implement new queries, and it runs directly on top of the database (pure SQL), and there is no time to rebuild caches. Solutions also need to be lightweight and not tax the existing website.
CRAWL uses Python, Jython, CherryPy, and Ropy to address these challenges. CherryPy is a multi-threaded web app server in which it is simple to reuse controller classes in different contexts. Code is built as a library first, implementing the model and controller parts of the MVC architecture. There is a conscious choice to ignore the view part of MVC early on. This enforced a decoupling of the data layer from the view. It also lends itself to building a command line wrapper application, and to supporting unit testing.
The essential purpose of CRAWL is mask the complexity of the SQL as much as possible, and allow you to get on with data analysis development.
EupathDB queries GeneDB daily using a "what's new" web service implemented with CRAWL. This gathers recent annotation changes which result in links back to GeneDB in EupathDB web sites.
The command line wrapper is used locally at Sanger to extract all sorts of information. This capability could be used by non-local bioinformaticians as well.
Collaborative Interfaces and Rich Internet Applications
Rich Internet Applications (RIAs) use AJAX techniques such as data refreshing without page reloads, and autocompletes of text boxes. Google Maps and Flickr are well known examples of RIAs. These fundamentally depend on web services.
Giles then showed several screenshots from what some folks call Web-Artemis, a genome annotation editor implemented in a web browser. This has several potential uses. Genomics visualisation and community annotation are two of them. Since it's web based, annotators won’t have to use a desktop application via a VPN. This is the kind of application we are moving towards. Whole point is that people can build applications without having to use Giles. It is possible (and Giles showed an example, SNP-Mashup) to use things like Web-Artemis in a mashup, embedding it as widget with other independently written widgets, integrating software that integrates data.
The way of CRAWL
- Built as a library first
- Deployed as
- Standalone Web services app
- Command line app
- Deployed as
- Used for
- Collaborating with EupathDB
- Query multi-organism data sets in house without going through WS
- Building RIAs and (s)mashups
CRAWL currently use the Ropy Python REST framework. It speaks Chado. Is it time to implement the GMOD REST interface? GeneDB’s Chado may have little differences. Any work must be tested on other databases.
Lessons from the Apache Community
There was a thought-provoking talk by Ross Gardler at BOSC this year on "Community Development at the Apache Software Foundation". There might be some lessons we can apply to the GMOD community.
- All technical discussions are made in public.
- You have to have a community before a project.
- If you contribute to a project, you get a vote.
- You can approve new projects, and volunteer to work on it.
- You can veto, but you have to say why.
- Make components generic enough so they could be used outside their domain.
- Jakarta managed all the Java projects under Apache. Switched to flat model. Jakarta no longer exists. Every project is now at the same organizational level.
- Really fine grained control of downloads.
- Early uploading of files and sharing.
- Can have user accounts.
- Chado GBrowse now supports full text searching.
- feature table gets a new column, other tables get a new column, and there's a new materialized view, with a new column.
CpG Island and STR Annotator Plugins
Two GBrowse plugins were presented that were designed by Joey Bullard, an undergraduate student at RIT, and a summer student at the Laboratory of Genomic Diversity in Frederick Maryland under the direction of Joan Pontius. CpG islands in mammalian genomes are often an indication of a well conserved region, such as those of genes and promoter regions. ShortTandemRepeats are useful in experiments for mapping phenotypes to a genetic locus.
- Finds total number of CG dimers in a window (default size of 400 bp) and also calculates the expected number of CG dimers based on the basepair composition of the window. A "segments" glyph is used to display the CpG islands, with the height of the glyph representing the number of CG dimers, and the color representing the ratio of observed counts to expected.
- A plugin that uses the segments glyph to show Short Tandem Repeats, 2mer-5mers occurring in tandem a minimum of 5 times. The user can configure the number of tandem repeats.
Also presented was a CGI script that generates and displays in real time, the scatterplot of a GWA study SNP. The CGI script uses an offset file, that has, for each SNP, the offset in the Affymetrix or Illumina file of the genotype calls and fluorescence values of the SNPs. The CGI takes as an argument the SNP id, it then looks up the offset, retrieves the data and generates the scatterplot as a png file.
The CGI can be used in GBrowse to display the scatterplots for individual SNPs as a mouseover. By embedding the call to the CGI inside image tags (<img>), and having the <img> tags inside a balloon click, the scatterplot can be displayed to the GBrowse display instantly (less than a second) and without generating a temporary flat file.
GMOD Sustainability and Organization
The GMOD project has no central organization per se, and has no legal status whatsoever. The GMOD core, such as it is, is not particularly well funded. It currently consists of Scott Cain, the GMOD Project Coordinator, and Dave Clements, the GMOD Help Desk. Scott's funding comes entirely from WormBase, while Dave is only 60% funded (via the JBrowse/GBrowse grant) to work on the Help Desk. This is not particularly good ground to stand on for the long term.
NSF is moving towards funding software infrastructure projects such as GMOD:
- Software Infrastructure for Sustained Innovation (SI2), specifically the Scientific Software Integration (SSI) and Scientific Software Innovation Institutes (S2I2) awards.
What can we do to take advantage of these opportunities, and what other opportunities are there?
Anup Mahurkar suggested also pursuing funding from NIGMS. Joan Pontius wondered if there might be support for a Google Genomes project?
Steve Taylor pointed out that In Britain, the BBSRC has a program for funding this sort of thing. Being the BBSRC, it must have some sort of British focus. (Subsequent followup also identified BILAT-USA and Link2US as a possible sources of general European funding.)
Josh Goodman suggested having a hosted GMOD service as a possible way to support this. This comes up every year and would be a possible way to initially get funding and then generate revenue from support fees.
We also discussed if GMOD events should strive to become larger, and if the GMOD organization should aim to become more "official". Daniel Renfro stressed that going for larger events should not be done at the expense of the relatively small (~40-60 people), informal and networking-intensive GMOD Meetings that we currently have. Scott Cain posed the question of becoming a sister organization to the International Society for Biocuration. That organization is the natural complement to GMOD, and they have (larger) annual meetings and are a legal entity.
Chris Hemmerich raised the possibility of having more frequent hackathons. This question was spawned by the upcoming GMOD Evo Hackathon (November 2010). That hackathon is unique event, given its extensive organizational and financial support from NESCent. The 2007 Hackathon and future hackathons are unlikely to have that degree of support. That said, GMOD is quite willing to organize future hackathons on particular topics, given that they need to be done with probably minimal organizational support and no financial support.
Jerven Bolleman suggested teaming up with user interface specialists and setting up testing suites and doing integration testing. Giles Velarde suggested education and training as a core deliverable. These are all areas that span GMOD and would benefit from more central support, and the need for for training funds is increasing, given that funds for this were greatly reduced in the upcoming Help Desk grant.
Finally, Jerven suggested extending GMOD past its core genomic strength as another growth area.
|Dave Clements||NESCent, GMOD||http://nescent.org http://gmod.org|
|Josh Goodman||FlyBase - Indiana University||http://flybase.org|
|Richard Smith||Cambridge University||http://www.intermine.org|
|Anup Mahurkar||Institute for Genome Sciences University of Maryland School of Medicine|
|joan pontius||SAIC-NCI-FREDERICK Laboratory of Genomic Diversity||http://lgd.abcc.ncifcrf.gov/cgi-bin/gbrowse/cat/|
|Christelle Robert||The Roslin Institute The University of Edinburgh|
|Matthew Eldridge||Cancer Research UK - Cambridge Research Institute|
|Fengyuan Hu||Department of Genetics, University of Cambridge|
|Daniel Renfro||EcoliWiki, SubtilisWiki, Hu lab - Texas A&M University||EcoliWiki, SubtilisWiki, GONUTS|
|Ellen Adlem||Cambridge University Cambridge Institue of Medical Research||http://www.t1dbase.org|
|Kerstin Koch||KWS Saat AG Bioinformatics Grimsehlstr.|
|Oliver Burren||Cambridge University||http://www.t1dbase.org|
|Chris Jiggins||University of Cambridge||http://heliconius.zoo.cam.ac.uk/|
|Jason Swedlow||Wellcome Trust Centre for Gene Regulation and Expression, University of Dundee, The Open Microscopy Environment (OME)||http://gre.lifesci.dundee.ac.uk/staff/jason_swedlow.html, http://www.openmicroscopy.org/|
|Dave Beare||Cancer Genome Project, Wellcome Trust Sanger Institute||http://www.sanger.ac.uk/research/projects/cancergenome.html|
|seth redmond||Imperial College / Vectorbase|
|Emmanuel Quevillon||Institut Pasteur||http://www.pasteur.fr/ip/easysite/go/03b-00000m-0q8/recherche/logiciels-et-banques-de-donnees|
|Bob MacCallum||VectorBase Imperial College London||http://www.vectorbase.org|
|Ewan Mollison||Tun Abdul Razak Research Centre, Hertford||http://www.tarrc.co.uk|
|Jen Harrow||Wellcome Trust Sanger Institute|
|Gos Micklem||University of Cambridge||http://www.sysbiol.cam.ac.uk/index.php?page=dr-gos-micklem|
|Malcolm Hinsley||Wellcome Trust Sanger Institute|
|Gemma Barson||Wellcome Trust Sanger Institute||http://www.sanger.ac.uk/|
|Brett Whitty||Michigan State University||http://buell-lab.plantbiology.msu.edu, http://solanaceae.plantbiology.msu.edu, http://potatogenome.net|
|Morris Swertz||Genomics Coordination Center, University Medical Center Groningen EMBL - European Bioinformatics Institute||http://www.molgenis.org|
|Jerven Bolleman||UniProt Swiss-Prot|
|Alex Kalderimis||InterMine, Cambridge University||http://www.intermine.org, http://www.flymine.org|
|Oksana Riba Grognuz||Swiss Institute of Bioinformatics (SIB) Department of Ecology and Evolution, University of Lausanne|
|Dr Helen Imogen Field||FlyBase Dept Genetics University of Cambridge||http://www.gen.cam.ac.uk/research/flybase.html|
|Kim Rutherford||Cambridge Systems Biology Centre||http://www.pombase.org/|
|Robert Wilson||National Institute for Medical Research, London|
|Gerd Anders||Public research institute: Max-Delbrueck-Centrum Berlin (MDC), Researcher and database developer||http://www.mdc-berlin.de/en/research/core_facilities/cf_massspectromety_bimsb/teammember/index.html http://www.mdc-berlin.de/en/research/core_facilities/cf_bioinformatic/teammember/index.html|
|Joeri van der Velde||University of Groningen, GBIC UMGC, dept. of Genetics Genomics Coordination Center|
|Jonathan Warren||The Sanger Institute||http://www.dasregistry.org|
|Stephen Taylor||CBRG, Oxford University||http://www.cbrg.ox.ac.uk/|
|Bruno Aranda||EMBL-EBI||http://www.ebi.ac.uk/intact, http://psicquic.googlecode.com|
|Mahmut Uludag||European Bioinformatics Institute|
|Giles Velarde||The Sanger Centre||http://www.genedb.org, http://www.sanger.ac.uk|
|Andy Jenkinson||European Bioinformatics Institute|
|Kevin Howe||Wellcome Trust Sanger Institute|
This meeting was held in the Biffen Lecture Theatre, in the Department of Genetics on the University of Cambridge campus.
Thanks to Ian Clark, the Biffen Lecture Theatre had wireless. Dave Judge requested accounts for all attendees for the duration of GMOD Europe 2010.
The Biffen Lecture Theatre did not have power outlets throughout the room. To help us through the days, Gos Micklem secured a 15-socket extension strip which was placed at the back of the room.
- Transportation and Lodging
The GMOD Meeting had a registration fee (£50 early, £65 late) to cover catered lunches, coffee/tea breaks, and other expenses.
The September 2010 GMOD Meeting was sponsored by the Cambridge Computational Biology Institute, which is hosting the meeting and is also the home of InterMine. The CCBI is "set up to bring together the unique strengths of Cambridge in medicine, biology, mathematics and the physical sciences. Its aim is to create a centre of excellence in research and teaching and to promote collaborations both within the Cambridge area and beyond."
Please thank Gos Miclem, Shelley Lawson, and Richard Smith for hosting the event. We could not have done this without their support, effort and time.
Attendees were asked to provide feedback at the end of the meeting.
Q: Would you recommend GMOD meetings to others
- if they are already aware of most of the tools out there - or if more introductory sessions are to be provided.
Q: Please rate the meeting(s) using the following scale: 1 (not at all) to 3 (reasonably) to 5 (exceptionally).
|How useful was the meeting?||0%||6%||13%||63%||19%|
|Was the meeting well run and organized?||0%||0%||25%||25%||50%|
Q: Was the meeting what you expected?
- Smaller and less formal than expected - but have usually attended biology meetings.
- More or less. I was expecting a bigger turnout from experienced UK GMOD developers. This group was mostly fairly new to GMOD, which is good to see from another point of view.
- Almost yes, was attended more around gmod tools dev and future dev
- No, better more interesting
- Every GMOD meeting is different. It was very useful to talk with the attendees from the big European facilities.
- The meeting was not quite what I was expecting in that I would have liked more introductory sessions before the specific jargons of each talk - but having said that I learnt a lot, so I'm not complaining ;-)
Q: Which presentations and sessions at this meeting were the most useful or interesting?
- All were informative
- Jason Swedlow was an inspiration! it is good to learn about collaborative (successful) efforts.
- DAS, all the web service talks, GBrowse updates and InterMine.
- GMOD and PSICQUIC GMOD RPC (aka REST API) CRAWL
- The State of GMOD, GMOD RPC (aka REST API), BioPivot: Applying Microsoft Live Labs Pivot to Problems in Bioinformatics
- Keynote: The Open Microscopy Environment
- They were all pretty good.
- GMOD RPC, InterMine, Literature Curation in GMOD, Update from the Help Desk, The State of GMOD
- DAS, InterMine
- the presentations that included demo's and code
- Update from the Help Desk
- The keynote was extremely interesting. The presentations on web services were useful and I hope to see work continue on this. Finally, the talks on visualization and web interfaces were useful.
- the keynote speakers
- GBrowse talks + BioPivot + InterMine
Q: Do you have suggestions for improving GMOD meetings in the future?
- Cambridgshire is a good location for these due to the large bioinformatics community - lecture theatre was ok - food not so good.
- It's not your fault, but the savoury food was a bit nasty - it had been out of the fridge for an hour, but it would have been better if the 'hot' items had been out of the oven for an hour (or less).
- I liked the idea of engaging in a community. It would be good to have a central location, publication as it took this meeting to get my head around what GMOD was about (it is not really obvious from the website). It would be a nice service for GMOD to use BioPivot to create an OSI bioinformatics 'one-stop-shop': using as images the output screen shots. Sorting perhaps by data type (stuff for microarrays, stuff to annotate genomes, stuff to integrate annotation).
- Great job as always!
- I liked the open discussion format of the satellite meetings.
- It's a very friendly and open forum for discussing work.
- Thanks for the meeting.
- As always, the mix of GMOD developers and users as well as prospective users and external developers makes for a lot of interesting discussion.
- Please can you have at least the next meeting venue decided each time: this will determine who can come. If you had this decided before the meeting you might then also be able to set the agenda - and encourage the right audience to show up. eg [a GMOD veteran] felt things were too low-level this time, but for me as a newbie the agenda was a brilliant introduction to the community and the tools that are around, and how they can enhance bioinformatics services... So advance planning would really keep the ship steering well. Thanks, I enjoyed meeting everyone and learned a lot - and hope to contribute
The Next Meeting
The next GMOD Meeting will be held in March 2011 in Durham, North Carolina at NESCent, as part of GMOD Americas 2011. This event will also include Satellite Meetings and the 2011 GMOD Spring Training GMOD School.