GMOD

MWAS Tutorial

1 Maker Web Annotation Service
2 Understanding MWAS
3 Getting Started with MWAS

Maker Web Annotation Service

The MAKER Web Annotation Service (MWAS) is an easily configurable web-accesible genome annotation pipeline. It’s purpose is to allow research groups with small to intermediate amounts of eukaryotic and prokaryotic genome sequence (i.e. BAC clones, small whole genomes, preliminary sequencing data, etc.) to independently annotate and analyse their data and produce output that can be loaded into a genome database. MWAS is build on the stand alone genome annotation pipeline MAKER, and users who wish to annotate datasets that are too large to submit to MWAS are free to download MAKER for use on their own systems.

Understanding MWAS

The first half of this page gives general background to genome annotation as well as describes validation data for the MAKER Web Annotation Service, MWAS. The stand alone annotation pipeline MAKER is at the heart of MWAS, and MWAS has been configured to present the user with configuration options that match those of the command line program MAKER as closely as possible.

Introduction to Genome Annotation

What Are Annotations?

Annotations are descriptions of different features of the genome, and they can be both structural or functional in nature.

Examples:

Structural Annotations: exons, introns, UTRs, splice forms etc.
Functional Annotations: process a gene is involved in (metabolism), molecular function (hydrolase), location of expression (expressed in the mitochondria), etc.

It is especially important that all genome annotations include with themselves an evidence trail that describes in detail the evidence that was used to both suggest and support each annotation. This assists in quality control and downstream management of genome annotations.

Examples of evidence supporting a structural annotation:

Ab initio gene predictions
ESTs
Protein homology

Importance of Genome Annotations

Why should the average biologist care about genome annotations? Genome sequence itself is not very useful. The main question when any genome is sequenced is, “where are the genes?” To identify the genes we need to annotate the genome. And while most researchers probably don’t give annotations a lot of thought, they use them everyday.

Examples of Annotation Databases:

Every time we use techniques such as RNAi, PCR, gene expression arrays, targeted gene knockout, or CHIP we are basing our experiments on the information derived from a digitally stored genome annotation. If the annotation is correct, then these experiments should succeed; however, if an annotation is incorrect these experiments are bound to fail. Which brings up a major point:

Incorrect and incomplete genome annotations poison every experiment that uses them.

Quality control and evidence management are therefore essential components to any annotation process.

Effect of Next Generation Sequencing on the Annotation Process

It’s generally accepted that within the next few years it will be possible to sequence even human sized genomes for as little as $1,000 and in a short time frame. When these expectations finally become reality, then whole genome sequencing will likely become routine for even small laboratories. Unfortunately, advances in annotation technology have not kept pace with genome sequencing, and annotation is rapidly becoming a major bottleneck affecting modern genomics research.

For example:

As of October 2009, 222 eukaryotic genomes were fully sequenced yet unpublished (this is an ever growing backlog).
Currently (Jan 2010) there are over 900 eukaryotic genome projects underway, assuming 10,000 genes per genome, that’s 9,000,000 new annotations (with this many new annotations, quality control and maintenance become an issue).
While there are organizations dedicated to producing and distributing genome annotations (i.e ENSEMBL and VectorBase), the shear volume of newly sequenced genomes exceeds both their capacity and stated purview.
Many small research groups (which often lack bioinformatics experience) must therefore confront the difficulties associated with genome annotation on their own.

The MAKER Web Annotation Service is a tool to assist research groups in converting the mountain of genomic data provided by next generation sequencing technologies into a usable resource, and for larger datasets, research groups can use a local installation of the annotation pipeline MAKER.

What does MWAS do?

Identifies and masks out repeat elements
Aligns ESTs to the genome
Aligns proteins to the genome
Produces ab initio gene predictions
Synthesizes these data into final annotations
Produces evidence-based quality values for downstream annotation management

MAKER generated annotations, shown in Apollo.

What sets MAKER and MWAS apart from other tools (ab initio gene predictors etc.)?

MAKER is an annotation pipeline, not a gene predictor. MAKER does not predict genes, rather MAKER leverages existing software tools (some of which are gene predictors) and integrates their output to produce what MAKER believes to be the best possible gene model for a given location based on evidence alignments.

gene prediction ≠ gene annotation

gene predictions are gene models.
gene annotations are gene models but should include a documented evidence trail supporting the model in addition to quality control metrics.

This may seem like just a matter of semantics since the primary output for both ab initio gene predictors and the MAKER pipeline is the same, a collection of gene models. However there are a few very significant consequences to the differences between these programs that I will explain shortly.

Emerging vs. Model Genomes

Emerging model organism genomes each come with there own set of issues that are not necessarily found in classic model genomes. These include difficulties associated with Repeat identification, gene finder training, and other complex analyses. Unfortunately emerging model organisms are often studied by very small research communities which often lack the resources and bioinformatics experience necessary to tackle these issues.

Classic Model Organisms	Emerging Model Organisms
Well developed experimental systems	New experimental systems Genome will be the central resource for work in these systems
Much prior knowledge about genome	Little prior knowledge about genome Usually no genetics
Large community	Small communities
Big $	Less $
Examples: D. melanogaster, C. elegans, human, etc.	Examples: oomycetes, flat worms, cone snail, etc.

Comparison of Algorithm Performance on Model vs. Emerging Genomes

If you have ever looked at comparisons of gene predictor performance on classic model organisms such as C. elegans you would conclude that ab initio gene predictors match or even outperform state of the art annotation pipelines, and the truth is that, with enough training data, they do. However, it is important to keep in mind that ab initio gene predictors have been specifically optimized to perform well on model organisms such as Drosophila and C. elegans, organisms for which we have large amount of pre-existing data to both train and tweak the prediction parameters.

Table: MAKER's Performance on the C. elegans genome
Performance Category	Ab initio		Evidence Based
Performance Category	SNAP	Augustus	MAKER	Gramene
Genomic Overlap (gene)
SP	82.48	88.09	91.69	93.49
SN	95.44	96.78	89.81	88.74
Exon Overlap
SP	18.88	22.87	25.58	27.38
SN	87.63	93.09	91.17	94.84

What about emerging model organisms for which little data is available? Gene prediction in classic model organisms is relatively simple because there are already a large number of experimentally determined and verified gene models, but with emerging model organisms, we are lucky to have a handful of gene models to train with. As a result ab initio gene predictors generally perform very poorly on emerging genomes.

MAKER will:

Identify legacy annotation most consistent with new data
Automatically revise it in light of new data
If no existing annotation, create new one

Categories:

Documentation

Community

Tools

Browse properties
Last updated at 22:03 on 3 October 2012.
Content is available under a GNU Free Documentation License unless otherwise noted.