Post Reference Genome Tools
Post Reference Genome Tools
September 2010 GMOD Meeting
15 September 2010
How are we going to visualize and exploit (or even cope with) the world three years from now, when small labs may be able to fully sequence 500 individuals or species (or more) in a month? How can we visualize and link together 500, 1000, or 10,000 genomes? Many existing tools assume a reference genome. Will a reference make sense in the future, or will it hold us back?
|Dave Clements||NESCent, GMOD||http://nescent.org http://gmod.org|
|Fengyuan Hu||Department of Genetics, University of Cambridge|
|Ellen Adlem||Cambridge University Cambridge Institue of Medical Research||http://www.t1dbase.org|
|seth redmond||Imperial College / Vectorbase|
|Jerven Bolleman||UniProt Swiss-Prot|
|Oksana Riba Grognuz||Swiss Institute of Bioinformatics (SIB) Department of Ecology and Evolution, University of Lausanne|
|Kim Rutherford||Cambridge Systems Biology Centre||http://www.pombase.org/|
|Stephen Taylor||CBRG, Oxford University||http://www.cbrg.ox.ac.uk/|
|joan pontius||SAIC-NCI-FREDERICK Laboratory of Genomic Diversity||http://lgd.abcc.ncifcrf.gov/cgi-bin/gbrowse/cat/|
|Don Gilbert||Indiana University (Don participated in a key pre-meeting discussion)||http://arthropods.eugenes.org/|
The world is shifting away from the concept of a single reference genome in a number of ways:
- For many applications we are moving to a network model ("islands of stability in a sea of variation") where navigation and thinking are likely to shift from today's linear, top-down paradigm, to a bottom-up, networked view. This will require tools that are significantly different from what we have today.
- Some applications will shift from the concept of a single static reference to a dynamically selected reference: "This is my genome of interest, show me how it relates to other information." This will not be as radical of a cognative change as the shift to a networked model. Tools that operate on this model may also have visualizations that are recognizable from today's tools, but the underlying algorithms and data structures will be quite different.
- Cutting across all areas will be the need to support data analysis and multi-dimensional querying embedded in tools of all kinds. The datasets are simply getting too big and too varied to not have sophisticated tools for narrowing and analyzing them.
The discussion went on for over three hours. It is divided into two sections here:
- How does our conceptual model of the information change in this new world.
- Towards Solutions
- Some ideas on how we might tackle this new world with software.
How does our conceptual model change in this new world?
Non Linear Thinking
In the current model, the reference chromosome is the dominant frame of reference. Many projects may have ESTs or contigs, but the hope is usually to eventually map them to a reference or use them to build a reference. This top-down, global frame of reference has many advantages, not the least of which it is easy to understand.
However, in a post-reference world, it may be more useful to shift to a bottom-up view of genomic data, where regions of interest are identified and then we investigate surrounding regions. As we zoom out, the view around the edges starts to reflect that consensus becomes more tenuous as the region under consideration expands. One participant described this as "a graph with islands of stability." This "islands of stability in a sea of variation" model is closer to actual biology, and having tools that reflect it may lead to better biology.
This model is not a radical departure from what biologists often do now. Genomes are often navigated via BLAST or by typing in a gene or EST name, going to the matching region, and then zooming out. What is different is that there may not be a canonical version of the matching region, but rather a weighted statistical view of that region across the population, and that as you zoom out, you will start to see a non-linear view of the matching regions.
There is already some work being done in microbial research on pangenomes. Pangenomes describe the genetic complement of a set of organisms, often a species, rather than an individual. They describe both the core set of genetic material that exists in every studied individual as well as the genetic material that exists in various subsets of those individuals. Subsets can be formed by external factors (geography), or internal factors (has this haplotype). There are many issues with pangenomes that also exist in a post-reference paradigm: How do you store and visualize commonality? How about differences?
How much does order matter? In this new world view, chromosomes, with their complete ordering of everything on them, are deemphasized, and regions where commonalities (or uniqueness) occur are emphasized in their place. Information isn't lost: It should still be possible to reconstruct a linear ordering for any individual for which you have enough data. However, there is a danger in this type of amalgamation of making conclusions about abstract genomes that would never actually exist in nature.
Another option is not to abandon the concept of a reference, but rather make it dynamic: Users can specify "this is my genome of interest; show me how it relates to other information." In this paradigm the frame of reference changes to align to whatever individual the user wants to. With this approach, reference-based tools that exist now could be used Subject to computational tractability), but with quite different preprocessing.
How do these new concepts affect how we will visualize information?
We have lots genomes - can we just stack them up? One can imagine a Sybil like display where the "reference" genome is selected by the user and then clusters of other genomes are dynamically ordered in the display, based on similarity to the selected reference. This could work for a whole genome view, although summarization would be required once the number of genomes exceeds screen resolution. The interface could also support "please sort genomes by similarity to this region." It would probably not be a straight sort of individual genomes, but rather clusters of similarity.
User-driven reference selection with sorting of other genomes relative to that, could be scaled all the way down to the sequence level. The specific visualization and the metric of similarity might change as you drill down, but the basic concept could stay the same.
This is an inherently linear approach. The data taken as a whole, however, is not linear - it is a graph from which any individual's linear genome can be reconstructed by following connections through islands of common sequence (edges to nodes in graph parlance) that are shared with other individuals, and through islands that are unique to that individual.
Future visualizations may take advantage of this graph to show relationships between different sequence regions. It may be that linear approaches will remain popular for questions that start with "Tell me what is similar to this," and graph based approaches for group or population based queries.
We also need tools for specifying what subsets of individuals and regions we want to see/use. Current tools use sequence similarity (e.g., BLAST) and nomenclature/orthology to select subregions. Future tools should also support selection on a wide variety of facets in the data. For example, "show me regions that have these n characteristics from the m detected characteristics of the current region." Selection can be made arbitrarily complex by seamless integration with analysis tools such as Galaxy, BioMart, and InterMine.
Future visualizations are not constrained by current ones, but we can learn from them. The group also discussed the UCSC Cancer Genome Browser, and walked through the video tutorial. There was much discussion on gene sets, and viewing regions that have many characteristics in common. (Might be able to use GBrowse_syn code to show related regions.) The UCSC cancer genome browser has a very useful option to sort their "wiggleplots" according to some aspect of the sample that each row represents, so that any trends in the heatplot for example, of expression or chromosome rearrangements, can be more easily seen. Imagine a TopoView glyph with the ability to reorder data. GBrowse might be able to do something like that with subtracks.
One participant suggested using a Prezi style of navigation to move between related regions. Another suggested gaining inspiration from current network tools such as Cytoscape or Pathway Tools. Muave, a comparative genomics browser was also mentioned; Flickr for genomes was also suggested. Ortholog databases such as OMA, Eggnog, ergononome, hovergen, inparanoid, orthodb, and phylomeDB can also provide guidance. An Amazon "people interested in this region were also interested in these regions" model could also be adopted.
Key ideas here are network representation and navigation.
Clade databases can also provide some insights here. An excerpt from and email from Don Gilbert:
Part of the answer likely involves clade genomics, i.e. don't peg your new genome to one reference, but to a consensus of several related. We do that to some effect w/ the insects / arthropods.
Drosophila melanogaster turns out to be a poor reference genome for non-dipteran insects, as it has diverged quite a bit. Its extensive functional, expt. literature is critical to understanding any genomes that are related. Biologists need to be careful about the exceptions and differences, and be able to identify where a single species reference helps and where it may not, which a clade database can help with.
This is shows my approach w/ euGenes/ Arthropod genomes.
The basics are similar to Ensembl's and others in collecting related species genes, doing orthology analysis and grouping genes with some consensus annotation, to serve as a reference for new genomes. The expanding collection of plant genomes are showing I think where one can go with many related genomes, drawing on them all instead of just Arabidopsis as a reference. USDA seems to be shepherding the plant databases to work in sync. Gramene and other plant databases may be good examples for post-reference genome informatics.