MWAS Tutorial

From GMOD
Revision as of 23:19, 31 December 2009 by Carsonholt (Talk | contribs)

Jump to: navigation, search
{{#icon: MAKERLogo.png|MAKER|200|MAKER}}


MAKER Web Annotation Service Session

__NOTITLE__


This tutorial walks you through running the MAKER Web Annotation Service.


Maker Overview

The first half of this page describes the basics of MAKER - the easy-to-use genome annotation pipeline.


Introduction to Genome Annotation

What Are Annotations?

Annotations are descriptions of different features of the genome, and they can be both structural or functional in nature.

Examples:

  • Structural Annotations: exons, introns, UTRs, splice forms etc.
  • Functional Annotations: process a gene is involved in (metabolism), molecular function (hydrolase), location of expression (expressed in the mitochondria), etc.


It is especially important that all genome annotations include with themselves an evidence trail that describes in detail the evidence that was used to both suggest and support each annotation. This assists in quality control and downstream management of genome annotations.

Examples of evidence supporting a structural annotation:

  • Ab initio gene predictions
  • ESTs
  • Protein homology

Importance of Genome Annotations

Why should the average biologist care about genome annotations? Genome sequence itself is not very useful. The main question when any genome is sequenced is, "where are the genes?" To identify the genes we need to annotate the genome. And while most researchers probably don't give annotations a lot of thought, they use them everyday.


Examples of Annotation Databases:


Every time we use techniques such as RNAi, PCR, gene expression arrays, targeted gene knockout, or CHIP we are basing our experiments on the information derived from a digitally stored genome annotation. If the annotation is correct, then these experiments should succeed; however, if an annotation is incorrect these experiments are bound to fail. Which brings up a major point:

  • Incorrect and incomplete genome annotations poison every experiment that uses them.

Quality control and evidence management are therefore essential components to any annotation process.

Effect of Next Generation Sequencing on the Annotation Process

It’s generally accepted that within the next few years it will be possible to sequence even human sized genomes for as little as $1,000 and in a short time frame. Pacific Biosciences is claiming they will be able to sequence a human sized genome in fifteen minutes by 2013. If the hype is to be believed, then whole genome sequencing will become routine for even small labs in the not so distant future. Unfortunately, however, advances in annotation technology have not kept pace with genome sequencing, and annotation is rapidly becoming a major bottleneck affecting modern genomics research.

For example:

  • As of February 2009, 173 eukaryotic genomes were fully sequenced yet unpublished (this is an ever growing backlog).
  • Currently there are over 1,000 eukaryotic genome projects underway, assuming 10,000 genes per genome, that’s 10,000,000 new annotations (with this many new annotations, quality control and maintenance become an issue).
  • While there are organizations dedicated to producing and distributing genome annotations (i.e ENSEMBL and VectorBase), the shear volume of newly sequenced genomes exceeds both their capacity and stated purview.
  • Many small research groups (which often lack bioinformatics experience) must therefore confront the difficulties associated with genome annotation on their own.


MAKER is an easy-to-use annotation pipeline designed to help smaller research groups convert the mountain of genomic data provided by next generation sequencing technologies into a usable resource.


What does MAKER do?

  • Identifies and masks out repeat elements
  • Aligns ESTs to the genome
  • Aligns proteins to the genome
  • Produces ab initio gene predictions
  • Synthesizes these data into final annotations
  • Produces evidence-based quality values for downstream annotation management


File:Apollo view.jpg
MAKER generated annotations, shown in Apollo.


What sets MAKER apart from tools (ab initio gene predictors etc.)?

MAKER is an annotation pipeline, not a gene predictor. MAKER does not predict genes, rather MAKER leverages existing software tools (some of which are gene predictors) and integrates their output to produce what MAKER believes to be the best possible gene model for a given location based on evidence alignments.


gene prediction ≠ gene annotation

  • gene predictions are gene models.
  • gene annotations are gene models but should include a documented evidence trail supporting the model in addition to quality control metrics.


This may seem like just a matter of semantics since the primary output for both ab initio gene predictors and the MAKER pipeline is the same, a collection of gene models. However there are a few very significant consequences to the differences between these programs that I will explain shortly.


Emerging vs. Model Genomes

Emerging model organism genomes each come with there own set of issues that are not necessarily found in classic model genomes. These include difficulties associated with Repeat identification, gene finder training, and other complex analyses. Unfortunately emerging model organisms are often studied by very small research communities which often lack the resources and bioinformatics experience necessary to tackle these issues.

Classic Model Organisms Emerging Model Organisms

Well developed experimental systems

New experimental systems

  • Genome will be the central resource for work in these systems

Much prior knowledge about genome

Little prior knowledge about genome

  • Usually no genetics
Large community Small communities
Big $ Less $
Examples: D. melanogaster, C. elegans, human, etc. Examples: oomycetes, flat worms, cone snail, etc.

Comparison of Algorithm Performance on Model vs. Emerging Genomes

If you have ever looked at comparisons of gene predictor performance on classic model organisms such as C. elegans you would conclude that ab initio gene predictors match or even outperform state of the art annotation pipelines, and the truth is that, with enough training data, they do. However, it is important to keep in mind that ab initio gene predictors have been specifically optimized to perform well on model organisms such as Drosophila and C. elegans, organisms for which we have large amount of pre-existing data to both train and tweak the prediction parameters.


Table: MAKER's Performance on the C. elegans genome

Performance

Category

Ab initio Evidence Based
SNAP Augustus MAKER Gramene
Genomic Overlap (gene)
SP 82.48 88.09 91.69 93.49
SN 95.44 96.78 89.81 88.74
Exon Overlap
SP 18.88 22.87 25.58 27.38
SN 87.63 93.09 91.17 94.84

What about emerging model organisms for which little data is available? Gene prediction in classic model organisms is relatively simple because there are already a large number of experimentally determined and verified gene models, but with emerging model organisms, we are lucky to have a handful of gene models to train with. As a result ab initio gene predictors generally perform very poorly on emerging genomes.

Figure: MAKER's Performance on the S. mediterranea Emerging Model Organism Genome. Pfam domain content of gene models determined using rpsblast


By using ab inito gene predictors inside of the MAKER pipeline instead of as stand alone applications you get certain benefit:

  • Provide gene models as well as an evidence trail correlations for quality control and manual curation
  • Provide a mechanism to train and retrain ab initio gene predictors for even better performance.
  • Output can be easily loaded into a GMOD compatible database for annotation distribution (including evidence associations).
  • Annotations can be automatically updated with new evidence by simply passing existing annotation sets back into the pipeline


Getting Started with MWAS

Registration

MWAS is free to all users and has no login requirement, but registration is recommended as it allows for easier file and job management and registered users can upload more sequence.

RUNNING MWAS WITH EXAMPLE DATA

MWAS comes with some example files to familiarize the user with how to run MAKER. You can pre-load the fields for a new job by selecting one of the examples from the drop down menu on the "New Job" page.


Next we need to tell MAKER all the details about how we want the annotation process to proceed. Because there can be many variables and options involved in annotation you will need to review each option carefully. At the very least you should provide a genome sequence file, an EST sequence file, and a protein homology sequence file.

Details of What is Going on Inside of MAKER

Repeat Masking

The first step to MAKER is repeat masking, but why do we need to do this? Repetitive elements can make up a significant portion of the genome. Some of these repeats are simple/low-complexity repeats where you have runs of C's or G's or maybe even something like AAGGAAGGAAGG. Other repeats are more complex, i.e. transposable elements. These high-complexity repeats often encode real proteins like rerotranscriptase or even Gag, Pol, and Env viral proteins. Because they encode real proteins, they can play havoc with ab initio gene predictors. For example, a transposable element that occurs next to or even within the intron of a real protein encoding gene might cause a gene predictor to include extra exons as part of a gene model, sequence which really only belongs to the transposable element and not to the coding sequence of the gene. You will also get hundreds of instances where identical transportable element proteins get annotated as being part of an organisms proteome. In addition these issues, low-complexity repeat regions can align with high statistical significance to low-complexity protein regions creating a false sense of homology throughout the genome. To avoid these complications it is convenient to identify and mask any repeat elements before doing other analyses.


MAKER identifies repeats in two steps.

  • First a program called RepeatMasker is used to identify low-complexity and high-complexity repeats that match entries in the RepBase repeat library, or any species specific repeat library supplied by the user.
  • Next MAKER uses RepeatRunner to identify transposable element and viral proteins from the RepeatRunner protein database. Because protein sequence diverges at a slower rate than nucleotide sequence, this step helps pick up the most problematic regions of divergent repeats that are missed by RepeatMasker, which searches in nucleotide space.


Regions identified during repeat analysis are masked out so as not to complicate other downstream annotation analyses.

  • High-complexity repeats are hard-masked, a technique in which nucleotide sequence is replaced with the letter N to prohibit any alignments to that region.
  • Low-complexity regions are soft-masked, a technique in which nucleotides are made lower case so they can be treated as masked under certain situations without losing sequence information. I will discuss some of the applications and effects of soft-masking later.


Now the idea of masking out sequence might seem on the surface like we're losing a lot of information, and it is true that there can be proteins that have integrated repeats into their structure, so repeat masking will affect our ability to annotate these proteins. However, these proteins are rare and the number of gene models and homology alignments improved by this step far exceed the few gene models that may be negatively affected. You do have the option to run ab initio gene predictors on both the masked and unmasked sequence if repeat masking worries you though. You do this by setting unmask:1 in the maker_opt.ctl configuration file.

Ab Initio Gene Prediction

Following repeat masking, MAKER runs ab initio gene predictors specified by the user to produce preliminary gene models. Ab initio gene predictors produce gene predictions based on underlying mathematical models describing patterns of intron/exon structure and consensus start signals. Gene models are not produced by directly using experimental evidence. Because the patterns of gene structure are going to differ from organism to organism, you must train gene predictors before you can use them. I will discuss how to do this later on.


MWAS currently supports:

  • SNAP
  • Augustus
  • GeneMark
  • FGENESH (Not shown on public site)


You must specify in the maker_opts.ctl file the training parameters file you want to use use when running each of these algorithms.


EST and Protein Evidence Alignment

A simple way to indicate if a sequence region is likely associated with a gene is to identify (A) if the region is actively being transcribed or (B) if the region has homology to a known protein. This can be done by aligning Expressed Sequence Tags (ESTs) and proteins to the genome using alignment algorithms.

  • ESTs are sequences derived from a cDNA library. Because of the difficulties associated with working with mRNA and depending on how the cDNA library was prepared, EST databases usually represent bits and pieces of transcribed mRNAs with only a few full length transcripts. MAKER aligns these sequences to the genome using BLASTN. If ESTs from the organism being annotated are unavailable or sparse, you can use ESTs from a closely related organism. However, ESTs from closely related organisms are unlikely to align using BLASTN since nucleotide sequences can diverge quite rapidly. For these ESTs, MAKER uses TBLASTX to align them in protein space.
  • Protein sequence generally diverges quite slowly over large evolutionary distances, as a result proteins from even evolutionarily distant organisms can be aligned against raw genomic sequence to try and identify regions of homology. MAKER does this using BLASTX.


Remember now that we are aligning against the repeat-masked genomic sequence. How is this going to affect our alignments? For one thing we won't be able to align against low-complexity regions. Some real proteins contain low-complexity regions and it would be nice to identify those, but if I let anything align to a low-complexity region, then I will get spurious alignments all over the genome. Wouldn't it be nice if there was a way to allow BLAST to extend alignments through low-complexity regions, but only if there is is already alignment somewhere else? You can do this with soft-masking. If you remember soft-masking is using lower case letters to mask sequence without losing the sequence information. BLAST allows you to use soft-masking to keep alignments from seeding in low-complexity regions, but allows you to extend through them. This of course will allow some of the spurious alignments you were trying to avoid, but overall you still end up suppressing the majority of poor alignments while letting through enough real alignments to justify the cost. You can turn this behavior off though if it bothers you by setting softmask:0 in the maker_bopt.ctl file.


Polishing Evidence Alignments

Because of oddities associated with how BLAST statistics work, BLAST alignments are not as informative as they could be. BLAST will align regions any where it can, even if the algorithm aligns regions out of order, with multiple overlapping alignments in the exact same region, or with slight overhangs around splice sites.


To get more informative alignments MAKER uses the program Exonerate to polish BLAST hits. Exonerate realigns each sequences identified by BLAST around splice sites and forces the alignments to occur in order. The result is a high quality alignment that can be used to suggest near exact intron/exon positions. Polished alignments are produced using the est2genome and protein2genome options for Exonerate.


One of the benefits of polishing EST alignments is the ability to identify the strand an EST derives from. Because of amplification steps involved in building an EST library and limitations involved in some high throughput sequencing technologies, you don't necessarily know whether you're really aligning the forward or reverse transcript of an mRNA. However, if you take splice sites into account, you can only align to one strand correctly.


Integrating Evidence to Synthesize Final Annotations

Once you have ab initio predictions, EST alignments, and protein alignments you can integrate this evidence to produce even better gene predictions. MAKER does this by "talking" to the gene prediction programs. MAKER takes all the evidence, generates "hints" to where splice sites and protein coding regions are located, and then passes these "hints" to programs that will accept them.


MAKER produces hint based predictors for:

  • SNAP
  • Augustus
  • FGENESH
  • GeneMark (under development)


MAKER then takes the entire pool of ab initio and evidence informed gene predictions, updates features such as 5' and 3' UTRs based on EST evidence, tries to determine alternative splice forms where EST data permits, produces quality control metrics for each gene model (this is included in the output), and then MAKER chooses from among all the gene model possibilities the one that best matches the evidence. This is done using a modified sensitivity/specificity distance metric.


MAKER's Output

Once your job is finished and you download the data, you will see that MAKER has created an output directory called something like 2434.maker.output. The name of the output directory is based off of the job id assigned to your sequence file.


You should now see a list of directories and files created by MAKER.

drwxr-xr-x 3 gmod gmod 4096 2009-07-12 23:23 2434_datastore
-rw-r--r-- 1 gmod gmod  135 2009-07-12 23:27 2434_master_datastore_index.log
-rw-r--r-- 1 gmod gmod 1579 2009-07-12 23:23 maker_bopts.log
-rw-r--r-- 1 gmod gmod 1250 2009-07-12 23:23 maker_exe.log
-rw-r--r-- 1 gmod gmod 4016 2009-07-12 23:23 maker_opts.log
drwxr-xr-x 2 gmod gmod 4096 2009-07-12 23:23 mpi_blastdb
  • The maker_opt.log, maker_exe.log, and maker_bopts.log files are logs of the control files used for this run of MAKER.
  • The mpi_blastdb directory contains fasta indexes and BLAST database files created from the input EST, protein, and repeat databases.
  • The 2434_master_datastore_index.log contains information on both the run status of individual contigs and information on where individual contig data is stored.
  • The 2434_datastore directory contains a set of subfolders, each containing the final MAKER output for individual contigs from the genomic fasta file.


Once a MAKER run is finished the most important file to look at is the 2434_master_datastore_index.log to see if there were any failures.

less 2434_master_datastore_index.log.  MWAS provides a summery of this file when you click on results to download a job.  MWAS also displays run errors in the log option button that you can click on when in the MWAS main queue page.


If everything proceeded correctly you should see the following in your 2434_master_datastore_index.log file.

contig-dpp-500-500      2434_datastore/contig-dpp-500-500 STARTED
contig-dpp-500-500      2434_datastore/contig-dpp-500-500 FINISHED


There are only entries describing a single contig because there was only one contig in the example file. These lines indicate that the contig 'contig-dpp-500-500' STARTED and then FINISHED without incident. Other possible entries include:

  • DIED - indicates a failed run on this contig, MAKER will retry these
  • RETRY - indicates that MAKER is retrying a contig that failed
  • SKIPPED_SMALL - indicates the contig was too short
  • DIED_SKIPPED_PERMANENT - indicates a failed contig that MAKER will not attempt to retry


The entries in the 2434_master_datastore_index.log file also indicate that the output files for this contig are stored in the directory dpp_contig_datastore/contig-dpp-500-500/. Knowing where the output is stored may seem rather trivial; however, input genome fasta files can contain thousands even hundreds-of-thousands of contigs, and many file-systems have performance problems with large numbers of sub-directories and files within a single directory. Even when the underlying file-systems handle things gracefully, access via network file-systems can be an issue. To deal with this situation, MAKER uses a datastore module to create a hierarchy of sub-directory layers, starting from a 'base', and mapping identifiers to corresponding sub-directories. For situations where the input genome fasta file contains more than 1,000 contigs, the datastore structure is used automatically, and the master_datastore_index.log file becomes essential for identifying where the output for a given contig is stored.


now let's take a look at what MAKER produced for the contig 'contig-dpp-500-500'.

cd 2434_datastore/contig-dpp-500-500
ls -l

The directory should contain a number of files.

-rw-r--r-- 1 gmod gmod 47437 2009-07-12 23:27 contig-dpp-500-500.gff
-rw-r--r-- 1 gmod gmod   189 2009-07-12 23:27 contig-dpp-500-500.maker.non_overlapping_ab_initio.proteins.fasta
-rw-r--r-- 1 gmod gmod   399 2009-07-12 23:27 contig-dpp-500-500.maker.non_overlapping_ab_initio.transcripts.fasta
-rw-r--r-- 1 gmod gmod   704 2009-07-12 23:27 contig-dpp-500-500.maker.proteins.fasta
-rw-r--r-- 1 gmod gmod   901 2009-07-12 23:27 contig-dpp-500-500.maker.snap_masked.proteins.fasta
-rw-r--r-- 1 gmod gmod  4837 2009-07-12 23:27 contig-dpp-500-500.maker.snap_masked.transcripts.fasta
-rw-r--r-- 1 gmod gmod  4430 2009-07-12 23:27 contig-dpp-500-500.maker.transcripts.fasta


  • The contig-dpp-500-500.gff contains all annotations and evidence alignments in GFF3 format. This is the important file for use with Apollo or GBrowse.
  • The contig-dpp-500-500.maker.transcripts.fasta and contig-dpp-500-500.maker.proteins.fasta files contain the transcript and protein sequences for MAKER produced gene annotations.
  • The contig-dpp-500-500.maker.snap_masked.transcripts.fasta and contig-dpp-500-500.maker.snap_masked.proteins.fasta files contain the transcript and protein sequences for all SNAP ab initio gene predictions. If you use other ab initio gene predictors, those sequence files will follow a similar naming pattern.
  • The contig-dpp-500-500.maker.non_overlapping_ab_initio.transcripts.fasta and contig-dpp-500-500.maker.non_overlapping_ab_initio.proteins.fasta files contain the set of best ab initio gene predictions that do not overlap a MAKER gene annotation. These files can be analyzed to see if there is any reason to promote them to the status of gene annotations. For example: you can run iprscan to see if they contain known protein domains.


Viewing MAKER Annotations

Viewing the raw GFF3 file produced by MAKER really isn't that meaningful.


For sanity checking purposes it would be nice to have a graphical view of what's in the GFF3 file. To do this GFF3 files can be loaded into programs like Apollo and GBrowse. MWAS allows you to view the files in Apollo directly on the website. You also get summery statistics from SOBA.


Apollo

Select the file in Apollo, and open it. You will notice that there are a number of bars representing the gene annotations and the evidence alignments supporting those annotations. Annotations are in the middle light colored panel, and evidence alignments are in the dark panels at the top and bottom.


All the evidence in the dark panels is in the same color which makes it difficult to identify the source of each piece of evidence without manually clicking on them. MAKER comes with a configuration file for Apollo which gives a more more colorful view of MAKER produced annotations and evidence. Let's close Apollo, copy this configuration file and then reload the annotations.

You can now see the annotations and evidence in nice color. Click on each piece of evidence and you will see it's source in the table at the bottom of the Apollo screen.

Possible Sources Include:

  • BLASTN - BLASTN alignment of EST evidence
  • BLASTX - BLASTX alignment of protein evidence
  • TBLASTX - TBLASTX alignment of EST evidence from closely related organisms
  • EST2Genome - Polished EST alignment from Exonerate
  • Protein2Genome - Polished protein alignment from Exonerate
  • SNAP - SNAP ab inito gene prediction
  • GENEMARK - GeneMarkab inito gene prediction
  • Augustus - Augustus ab inito gene prediction
  • FgenesH - FGENESH ab inito gene prediction
  • Repeatmasker - RepeatMasker identified repeat
  • Blastx:Repeatmask - RepeatRunner identified repeat from the repeat protein database


Basic Input Files

All the basic input files for MAKER should be in fasta format.


  • genome - Genomic sequence file
  • est - ESTs from the same organism or from a very very closely related organism (i.e. chimpanzee to human). These are aligned first via BLASTN with very strict filtering so any sequence divergence can prohibit the alignment.
  • altest - These are ESTs from other closely related organisms (i.e. mouse to human). They are aligned via TBLASTX in protein space, so greater sequence divergence is permitted.
  • protein - proteins from the same or other organisms. These are aligned via BLASTX against the genome. Proteins that align to a region will not necessarily be orthologous or paralogous. The alignment may just be based on short regions such as a shared domain. You may also get alignments to pseudogenes. Polishing BLASTX hits with Exonerate helps identify what are likely true paralogs and orthologs.


Repeat Masking Options

Repeat masking is important for improving gene predictor performance and avoiding protein alignments to what are likely just transposons. You also expect a certain amount of genomic contamination in the EST database, much of this contamination maps back to repeat regions. By repeat masking we can avoid issues with all types of input data.


  • model_org - This is a RepeatMasker option that lets you limit the repeat database to specific organisms or groups of organisms (i.e. vertebrates, Nematodes, Drosophila, primates etc). By default MAKER sets this to 'all'.
  • repeat_protein - This is a fasta file of transposon and virus related proteins. MAKER has an internal RepeatRunner database it uses by default.
  • rmlib - This is a fasta file of nucleotide repeats provided by the user. You can create a species specific repeat database using programs like PILER.


Gene Prediction Options

Gene prediction options affect the final gene annotations more than any other option type. This brings up the point that electronically produced gene annotations will only be as good as the gene predictions they are based on.


  • predictor - This tells MAKER what programs to run for generating annotations.
    • est2genome - Allows high quality spliced Exonerate EST alignments to become gene annotations. This only happens when there is no gene prediction overlapping the region. This is useful for generating gene annotations in the absence of a trained gene predictor.
    • model_gff - This allows user defined models to be used
    • snap
    • augustus
    • genemark
    • fgenesh
  • unmask - Produce ab initio gene predictions for unmasked sequence as well as for masked sequence
  • snaphmm - SNAP training file (SNAP has some species files already available in the snap/HMM/ directory)
  • gmhmm - GeneMark training file (GeneMark self-trains and produces the resulting training file in the output mod/ directory)
  • augustus_species - Augustus species ID (Augustus uses an internal species index rather than a simple set of training files. Type 'augustus --species=help' to see the values you can choose)
  • fgenesh_par_file - FGENESH training file


Other MAKER Options

  • evaluate - runs an experimental annotation quality analysis program (Evaluator) on each annotation. Provides quantitative metrics for ranking annotations and identifying the features most in need of review. I'd like to emphasize that this is experimental.
  • max_dna_len - sets the length for dividing up contigs into chunks for processing. Larger chunks require more memory; smaller chunks require less memory. Allows the user to control system memory usage.
  • min_contig - sets the minimum length a contig must have or else it will be skipped.
  • min_protein - sets the minimum length a predicted protein must have (in amino acids) to be annotated.
  • split_hit - sets the expected max intron size for evidence alignments
  • pred_flank - sets the length for the sequence surrounding clusters of EST and protein evidence that will be used when building hint based gene predictions.
  • single_exon - tells MAKER to consider single exon EST evidence when generating annotations. Single exon ESTs are more likely to be genomic contamination.
  • single_length - sets the minimum length required for single exon ESTs if 'single_exon' is enabled
  • keep_preds - adds non-overlapping ab-inito gene prediction to the final annotation set rather than pushing them off into a separate file for the user to analyse. These predictions by definition do not overlap any form of supporting evidence.
  • retry - sets the number of times to retry a contig if there is a failure
  • clean_try - removes all data from previous MAKER runs before retrying a contig
  • clean_up - removes theVoid directory with individual raw analysis files at the end of the MAKER run
  • TMP - specifies a directory other than the system default temporary directory (/tmp) for writing temporary files. On some Linux systems the primary hard drive that also holds the default temporary directory is small, and most of the systems storage space is located on secondary hard drives mounted in directories elsewhere on the system. This is often true of computer clusters where each node has it's own small hard drive for booting purposes, and most storage space is network mounted. Temporary files created by MAKER are deleted as the program advances, but individual files related to BLAST jobs can be quite large, so setting TMP to another location can be useful.


Training ab initio Gene Predictors

If you are involved in a genome project for an emerging model organism, you should already have an EST database which would have been generated as part of the original sequencing project. A protein database can be collected from closely related organism genome databases or by using the UniProt/SwissProt protein database or the NCBI NR protein database. However a trained ab initio gene predictor is a much more difficult thing to generate. Gene predictors require existing gene models on which to base prediction parameters. However, with emerging model organisms there are no pre-existing gene models. So how then are you supposed to train your gene prediction programs?


MAKER gives the user the option to produce gene annotations directly from the EST evidence. You can then use these imperfect gene models to train gene predictor program. Once you have re-run MAKER with the newly trained gene predictor, you can use the second set of gene annotations to train the gene predictors yet again. This boot-strap process allows you to iteratively improve the performance of ab initio gene predictors.

GFF3 Pass-through

What if I'm not working on a new genome project, but rather I have an existing annotation set, and I just want to update my genome database to reflect new protein and EST evidence. Here you can use a feature in MAKER called GFF3 pass-through, which allows you to pass existing annotations into the program and combine them w

mRNAseq

mRNAseq is a high throughput technique for sequencing the entire transcriptome, and it holds the promise of allowing researchers to identify all exons and alternative splice forms for every gene in the genome with a single experiment. It may soon make gene predictors (mostly) a thing of the past.

  • Still need to de-convolute reads & evidence (for now)
  • Still need to archive, manage, and distribute annotations


MRNAseq.jpg


We are currently working on native support for mRNAseq data within the MAKER pipeline. However, because of the GFF3 pass-through option, there is a way to take advantage of mRNAseq reads right now. By mapping mRNAseq reads using BowTie and TopHat, you can create GFF3 files of read islands and junctions. This data can then be passed in as EST evidence and will be used for generating hint based gene prediction and for choosing final annotations.


Merge/Resolve Legacy Annotations

Legacy annotations

  • Many are no longer maintained by original creators
  • In some cases more than one group has annotated the same genome, using very different procedures, even different assemblies
  • Many investigators have their own genome-scale data and would like a private set of annotations that reflect these data
  • There will be a need to revise, merge, evaluate, and verify legacy annotation sets in light of RNA-seq and other data


Legacy.png


MAKER will:

  • Identify legacy annotation most consistent with new data
  • Automatically revise it in light of new data
  • If no existing annotation, create new one