GSoC/IDEA 9

From GMOD
Revision as of 08:56, 24 March 2011 by DanBolser (Talk | contribs)

Jump to: navigation, search

<-- Back to GSoC

Page for project discussion, ideas and some other third thing.


Similar projects

Correspondence

I found there was a lot of 'invisible' information floating round by email. Here it is for reference:

Email 01

I suggest you start work immediately, researching the common assembly format, ACE, and it's implementation in BioPerl (recently overhauled by Rob Buels). We then need a preliminary appraisal of weather this format is suitable for 'community assembly' based on NGS technology, limitations, improvements etc.

Do you know of any other genome assembly formats? I guess AGP is another... there are currently no BioPerl modules for handling AGP, but I have some AGP code that could be added to BioPerl, and I know Rob Buels does too.

A very concrete sub-project would be to refactor the assembly manipulation tools available in BioPerl. I've CC'ed Chris Fields for his input on that.

Also, we need to work out how to integrate structured data with inherently 'flat' version control such as GIT.

P.S. You may like

Reply

BioPerl supports the following assembly formats:

-http://www.bioperl.org/wiki/Module:Bio::Assembly::IO



Email 02

I should stress
that, if you decided to work on this project, I wouldn't be able to
provide much time in the form of mentoring. Sorry about that, but I'm
just being realistic. The main requirement, therefore, would be that
you could work independently with only 'high level' guidance. This
isn't a challenge, it's for you to decide what is best for you.


> I am writing to express my interest in the GSOC 2011 project of GMOD. I am
> particularly interested in the Idea 9: Develop collaborative genome assembly
> tools and databases. I would rather appreciate if you could introduce more
> about this idea. Here are some details I would like to know:

> 1. As far as I know, current sequence databases, such as dbSNP and UCSC
> Genome Browser, do have central databases and versioning (although I do not
> know how they implement the versioning). Why we need another one?

Well... a big organization putting a version on a certain database
release is very different from what I'm proposing, which is a version
control system specifically for genome assembly. The human genome
build version is handled by a large centralised organization with very
well laid out policy and guidelines and they are working on a very
important genome. Here there is no problem, and we all can respect
their authority and use their version codes.

As sequencing gets cheaper, however, specialists get more disparate,
and less and less investment goes into generating the data. In such
cases, it behoves us to have a good way to coordinate community
activities behind the best 'working draft' assembly. There are no
organizations ready to step in, no money to spend, and no policy in
place.

I'm thinking of something like Chado (which is for genome annotation)
for genome assembly.


> 2. From my understanding, this tool is designed to be used by a community of
> researchers, who may have access to certain genome assembly. This community
> contributes to the completion and analysis on the same genome. Is it right?

Yes. In fact Chado (and GMoD) does a very good job of housing genome
analysis results (such a new metabolomics, proteomics or
transcriptomics experiments), however, each genome sits behind an
institutional facade. I'm thinking of a community based approach (like
Wikipedia).


> So it is not designed to integrate genome assembly from various sources? Am
> I right?

Sounds right. Rather it hosts the results of such integration. I'm not
suggesting to implement an assembler, just a 'wiki-style' tool to
store and edit an assembly.


> 3. From my background, I am strong in Java, as well as the object oriented
> principles. I have also...

<snip, to protect the innocent>

> However, I have never used Git, Chado and Catalyst. I am
> confident that I can learn very fast based on my background and capability.
> But I am a little bit concerned that whether these skills are compulsory or
> not.

Certainly nothing is compulsory. Unlike some of the other ideas, this
idea is very speculative and open ended. I'm relying on a high level
of competence and independence.


> 4. And Finally, how is the current progress of this project? Have you
> designed any model or architecture for the tool? Where should I start if I
> would like to learn more about the implementation details?

Unlike the other ideas, this is currently a pure 'vapour-ware'
project. However, I think there are several core components in place
that you can begin to research. Firstly, the Chado database system
could form the core of an 'assembly data model'. Secondly, many of the
Bio* projects (BioPerl, BioJava, BioEtc) have 'assembly' objects.
Learning about how to handle assemblies in these languages will surely
be important. Thirdly, I think we can build on the MediaWiki,
Semantic-MediaWiki and Semantic Forms tools for creating the 'wiki'
component.

One distinct sub project would be to focus on an 'annotation wiki',
which I think could serve the community very well, if done in a
sufficiently generic and useful way (i.e. not just another 'me too'
project).

Email 03

OK, think about a genome assembly... how is it done? How may it be edited and improved over time? Think about biologists with specific interests in specific regions of the genome doing work to refine those regions. How will they want to contribute their work back to the 'community genome'?

Think about (and research) what a genome assembly is, how it is built, the important information it carries, and the kinds of ways that people may want to edit it.

This comes down to thinking about data structures, databases and data models and algorithms for editing those data structures.

So far so good... Now we want to layer provenance and version control on top, creating a community maintained structured assembly database.

I think this is a very ambitious proposal, so it will need a huge amount of work to get even close to creating something useful. I don't want to put you off, but we need to be realistic!

IRC LOG

Edited for clarity

19:36 <@rbuels> mmlevitt: chado and chadoxml probably isn't something that would be good for 
                storing and versioning an assembly
19:37 <@rbuels> mmlevitt: xml isn't really a good format for holding *data*, it's good for 
                *documents* that have kind of a looser structure
19:37 <@rbuels> mmlevitt: dbolser and i have discussed this version-control-for-genomes thing 
                before, i think it's a good idea
19:38 <@rbuels> mmlevitt: and git is probably the right version control system to be looking at
19:38 <@rbuels> mmlevitt: as for file formats ... it depends.
19:39  * rbuels thinks
19:40 <@rbuels> i dunno.  there is a lot of variation in how to represent genome assemblies.
19:41 <@rbuels> you would need to have something that almost anything could be stored as ...
19:42 <@rbuels> NCBI, for file formats, seems to be standardizing on AGP for representing the 
                finished assembly of how the contigs fit together
19:42 <@rbuels> as for the contigs themselves, i'm not sure if they want to store how the 
                contigs are assembled from reads
19:43 <@rbuels> lots of file formats for storing that kind of thing, .ace might be the most 
                common
19:44 <@rbuels> AGP is probably the emerging standard for representing how the 
                contigs fit together into the finished assembly
19:44 <@rbuels> mmlevitt: and the contigs themselves, if you are looking at their 
                sub-assemblies, .ace (ACE) is a very common format, but there are lots of other 
                common ones

19:46 <@rbuels> if you're going to make an assembly-versioning system, you probably don't want 
                to use a relational database like chado
19:47 <@rbuels> and certainly not a highly-normalized, super-general relational database 
                schema, which chado also is.
19:47 <@rbuels> performance is going to be a big deal here ... your typical assembly runs into 
                many GB of data.
19:48 <@rbuels> at least eukaryotic ones do
19:48 <@rbuels> i guess if you're doing bacteria genomes and such, it's smaller
19:48 <@rbuels> but still big
19:49 <@rbuels> the key to all of this, as i see it, is *how do you represent and visualize the 
                differences between assemblies*
19:49 <@rbuels> and how do you do that quickly and easily
19:49 <@rbuels> that's a tough one.
19:51 <@rbuels> cause with git, you can make all kinds of diffs between different places in the 
                history
19:51 <@rbuels> git's super powerful for that
19:52 <@rbuels> a colorized text diff like git makes is great for telling differences in source 
                code
19:52 <@rbuels> but an assembly is not source code.
19:53 <@rbuels> you can certainly store the assembly in git, but git isn't going to help that 
                much in visualizing the differences between the assemblies.
19:53 <@rbuels> between the versions of the assemblies, that is.


05:03 < dbolser> I've been using dnadiff within MUMmer to compare assemblies. It's fast, but it 
                 fails at visualizing the differences clearly.
05:04 < dbolser> for storing the reads / contigs, I'd recommend BAM
05:04 < dbolser> BAM is web-scale, ace is pre-2007
05:05 < dbolser> Seriously though, the really nice thing about BAM is that you can stream it 
                 over an FTP / HTTP connection very efficiently using its index

14:14 < dbolser> I only just got mmlevitt's idea ... BioForge would be like SourceForge, but 
                 instead of hosting source code projects, it'd host genome assemblies! Pretty 
                 nice idea sir!


Email 04

Perhaps it's worth distinguishing between assembly data (AGP for
pseudomolecules / genome scale structures and BAM or ACE for contigs
and scaffolds) and assembly metadata, that could be used for more
rapid assembly-assembly comparison.

I recommend that you learn a bit about assembly (how it is done),
think a bit about data structures for assembly data (how it could be
stored), and read a bit about AGP, ACE and BAM.

The next step becomes thinking about how to edit these data structures
and how to represent the edits in a biologically meaningful and
computationally tractable way. This is where assembly meta-data may
come into it.

I think there are some more fundamental
decisions to be made (including simply judging the feasibility of this
proposal) before we worry too much about details of Git vs. SVN.