Difference between revisions of "MWAS Tutorial"

From GMOD
Jump to: navigation, search
m (reformatting, removing #icon stuff)
 
(31 intermediate revisions by one other user not shown)
Line 1: Line 1:
{| class="tutorialheader"
+
==Maker Web Annotation Service==
| align="right" | {{#icon: MAKERLogo.png|MAKER|200|MAKER}}<br />
+
The MAKER Web Annotation Service (MWAS) is an easily configurable web-accesible genome annotation pipeline. It's purpose is to allow research groups with small to intermediate amounts of eukaryotic and prokaryotic genome sequence (i.e. BAC clones, small whole genomes, preliminary sequencing data, etc.) to independently annotate and analyse their data and produce output that can be loaded into a genome database. MWAS is build on the stand alone genome annotation pipeline [[MAKER]], and users who wish to annotate datasets that are too large to submit to MWAS are free to [http://www.yandell-lab.org/software/ download MAKER] for use on their own systems.
<br />
+
| {{TutorialTitleLine|[[MAKER Web Annotation Service]]}}<br />
+
|}
+
__NOTITLE__
+
 
+
 
+
This [[:Category:Tutorials|tutorial]] walks you through setting up and running the [[MAKER Web Annotation Service]].
+
 
+
 
+
__TOC__
+
 
+
 
+
= Caveats =
+
 
+
{{TutorialCaveats}}
+
  
=Maker Overview, Installation, and Basic Configuration for Annotating Genomic Sequence=
 
The first half of this page describes the basics of [[MAKER]] - the easy-to-use genome [[annotation]] pipeline.
 
  
<span style="font-size: 80%">[[Media:2009SumSchMAKER.pdf|See accompanying MAKER presentation (~16 MB)]].</span>
+
==Understanding MWAS==
 +
The first half of this page gives general background to genome annotation as well as describes validation data for the [[MAKER]] Web Annotation Service, MWAS.  The stand alone annotation pipeline MAKER is at the heart of MWAS, and MWAS has been configured to present the user with configuration options that match those of the command line program MAKER as closely as possible.
  
  
==Introduction to Genome Annotation==
+
===Introduction to Genome Annotation===
===What Are Annotations?===
+
====What Are Annotations?====
 
Annotations are descriptions of different features of the genome, and they can be both structural or functional in nature.
 
Annotations are descriptions of different features of the genome, and they can be both structural or functional in nature.
  
Line 39: Line 23:
 
*Protein homology
 
*Protein homology
  
===Importance of Genome Annotations===
+
====Importance of Genome Annotations====
 
Why should the average biologist care about genome annotations?  Genome sequence itself is not very useful.  The main question when any genome is sequenced is, "where are the genes?"  To identify the genes we need to annotate the genome.  And while most researchers probably don't give annotations a lot of thought, they use them everyday.
 
Why should the average biologist care about genome annotations?  Genome sequence itself is not very useful.  The main question when any genome is sequenced is, "where are the genes?"  To identify the genes we need to annotate the genome.  And while most researchers probably don't give annotations a lot of thought, they use them everyday.
  
Line 54: Line 38:
 
Quality control and evidence management are therefore essential components to any annotation process.
 
Quality control and evidence management are therefore essential components to any annotation process.
  
===Effect of [[Next Generation Sequencing]] on the Annotation Process===
+
====Effect of [[Next Generation Sequencing]] on the Annotation Process====
It’s generally accepted that within the next few years it will be possible to sequence even human sized genomes for as little as $1,000 and in a short time frame.  Pacific Biosciences is claiming they will be able to sequence a human sized genome in fifteen minutes by 2013.  If the hype is to be believed, then whole genome sequencing will become ''routine'' for even small labs in the not so distant future.  Unfortunately, however, advances in annotation technology have not kept pace with genome sequencing, and annotation is rapidly becoming a major bottleneck affecting modern genomics research.
+
It’s generally accepted that within the next few years it will be possible to sequence even human sized genomes for as little as $1,000 and in a short time frame.  When these expectations finally become reality, then whole genome sequencing will likely become ''routine'' for even small laboratories.  Unfortunately, advances in annotation technology have not kept pace with genome sequencing, and annotation is rapidly becoming a major bottleneck affecting modern genomics research.
  
 
For example:
 
For example:
*As of February 2009, 173 eukaryotic genomes were fully sequenced yet unpublished (this is an ever growing backlog).
+
*As of October 2009, 222 eukaryotic genomes were fully sequenced yet unpublished (this is an ever growing backlog).
*Currently there are over 1,000 eukaryotic genome projects underway, assuming 10,000 genes per genome, that’s 10,000,000 new annotations (with this many new annotations, quality control and maintenance become an issue).
+
*Currently ''(Jan 2010)'' there are over 900 eukaryotic genome projects underway, assuming 10,000 genes per genome, that’s 9,000,000 new annotations (with this many new annotations, quality control and maintenance become an issue).
 
*While there are organizations dedicated to producing and distributing genome annotations (i.e ENSEMBL and VectorBase), the shear volume of newly sequenced genomes exceeds both their capacity and stated purview.
 
*While there are organizations dedicated to producing and distributing genome annotations (i.e ENSEMBL and VectorBase), the shear volume of newly sequenced genomes exceeds both their capacity and stated purview.
 
*Many small research groups (which often lack bioinformatics experience) must therefore confront the difficulties associated with genome annotation on their own.
 
*Many small research groups (which often lack bioinformatics experience) must therefore confront the difficulties associated with genome annotation on their own.
  
  
 +
The MAKER Web Annotation Service is a tool to assist research groups in converting the mountain of genomic data provided by next generation sequencing technologies into a usable resource, and for larger datasets, research groups can use a local installation of the annotation pipeline MAKER.
  
MAKER is an easy-to-use annotation pipeline designed to help smaller research groups convert the mountain of genomic data provided by next generation sequencing technologies into a usable resource.
+
===What does MWAS do?===
 
+
==MAKER Overview==
+
[[Image:MAKERLogo.png]]
+
 
+
The easy-to-use  annotation pipeline.
+
 
+
{| class="wikitable"
+
! User Requirements:
+
| Can be run by a single individual with little bioinformatics experience
+
|-
+
! System Requirements:
+
| Can run on laptop or desktop computers running [[:Category:Linux|Linux]] or Mac OS X
+
|-
+
! Program Output:
+
| Output is compatible with popular GMOD annotation tools like [[Apollo]] and [[GBrowse]]
+
|-
+
! Availability:
+
| Free open source application (for academic use)
+
|}
+
 
+
 
+
===What does MAKER do?===
+
 
*Identifies and masks out repeat elements
 
*Identifies and masks out repeat elements
 
*Aligns ESTs to the genome
 
*Aligns ESTs to the genome
Line 96: Line 59:
  
  
{| cellpadding="5px"
+
[[File:Apollo_view.jpg|thumb|MAKER generated annotations, shown in [[Apollo]].]]
|-
+
| valign="top" style="border: 1px solid gray" | [[Image:Apollo_view.jpg | border ]]
+
|-
+
|valign="top" align="center" |MAKER generated annotations, shown in [[Apollo]].
+
|}
+
  
  
===What sets MAKER apart from tools (''ab initio'' gene predictors etc.)?===
+
===What sets MAKER and MWAS apart from other tools (''ab initio'' gene predictors etc.)?===
 
MAKER is an annotation pipeline, not a gene predictor.  MAKER does not predict genes, rather MAKER leverages existing software tools (some of which are gene predictors) and integrates their output to produce what MAKER believes to be the best possible gene model for a given location based on evidence alignments.
 
MAKER is an annotation pipeline, not a gene predictor.  MAKER does not predict genes, rather MAKER leverages existing software tools (some of which are gene predictors) and integrates their output to produce what MAKER believes to be the best possible gene model for a given location based on evidence alignments.
  
Line 116: Line 74:
  
  
===Emerging vs. Model Genomes===
+
====Emerging vs. Model Genomes====
 
Emerging model organism genomes each come with there own set of issues that are not necessarily found in classic model genomes.  These include difficulties associated with Repeat identification, gene finder training, and other complex analyses.  Unfortunately emerging model organisms are often studied by very small research communities which often lack the resources and bioinformatics experience necessary to tackle these issues.
 
Emerging model organism genomes each come with there own set of issues that are not necessarily found in classic model genomes.  These include difficulties associated with Repeat identification, gene finder training, and other complex analyses.  Unfortunately emerging model organisms are often studied by very small research communities which often lack the resources and bioinformatics experience necessary to tackle these issues.
  
Line 146: Line 104:
 
|}
 
|}
  
===Comparison of Algorithm Performance on Model vs. Emerging Genomes===
+
====Comparison of Algorithm Performance on Model vs. Emerging Genomes====
 
If you have ever looked at comparisons of gene predictor performance on classic model organisms such as ''C. elegans'' you would conclude that ''ab initio'' gene predictors match or even outperform state of the art annotation pipelines, and the truth is that, with enough training data, they do.  However, it is important to keep in mind that ''ab initio'' gene predictors have been specifically optimized to perform well on model organisms such as ''Drosophila'' and ''C. elegans'', organisms for which we have large amount of pre-existing data to both train and tweak the prediction parameters.
 
If you have ever looked at comparisons of gene predictor performance on classic model organisms such as ''C. elegans'' you would conclude that ''ab initio'' gene predictors match or even outperform state of the art annotation pipelines, and the truth is that, with enough training data, they do.  However, it is important to keep in mind that ''ab initio'' gene predictors have been specifically optimized to perform well on model organisms such as ''Drosophila'' and ''C. elegans'', organisms for which we have large amount of pre-existing data to both train and tweak the prediction parameters.
  
Line 198: Line 156:
 
What about emerging model organisms for which little data is available?  Gene prediction in classic model organisms is relatively simple because there are already a large number of experimentally determined and verified gene models, but with emerging model organisms, we are lucky to have a handful of gene models to train with.  As a result ''ab initio'' gene predictors generally perform very poorly on emerging genomes.
 
What about emerging model organisms for which little data is available?  Gene prediction in classic model organisms is relatively simple because there are already a large number of experimentally determined and verified gene models, but with emerging model organisms, we are lucky to have a handful of gene models to train with.  As a result ''ab initio'' gene predictors generally perform very poorly on emerging genomes.
  
{|
+
[[File:Maker_performance.jpg|thumb|848px|MAKER's Performance on the ''S. mediterranea'' Emerging Model Organism Genome. Pfam domain content of gene models determined using rpsblast]]
| |
+
[[Image:Maker_performance.jpg|thumb|848px|'''Figure:''' MAKER's Performance on the ''S. mediterranea'' Emerging Model Organism Genome. Pfam domain content of gene models determined using rpsblast]]
+
|}
+
  
  
Line 210: Line 165:
 
*Annotations can be automatically updated with new evidence by simply passing existing annotation sets back into the pipeline
 
*Annotations can be automatically updated with new evidence by simply passing existing annotation sets back into the pipeline
  
==Installation==
+
==Getting Started with MWAS==
 +
====Registration====
 +
MWAS is free to all users for academic use and has no login requirement, but registration is recommended as it allows for easier file and job management and registered users are allowed to upload more sequence.
  
===Prerequisites===
+
===Running MWAS with Example Data===
*[http://www.perl.org/ Perl] 5.8.0 or Higher
+
*[http://www.bioperl.org/ BioPerl] 1.6 or higher
+
*[http://homepage.mac.com/iankorf/ SNAP] version 2009-02-03  or higher
+
*[http://www.repeatmasker.org/ RepeatMasker] 3.1.6  or higher
+
*[http://www.ebi.ac.uk/~guy/exonerate/ Exonerate] 1.4  or higher
+
  
 +
MWAS comes with some example files to familiarize the user with how to run an annotation job. You can pre-load the fields for an example job by selecting one of the examples from the drop down menu on the "New Job" page and then selecting "Load".  This will fill out options on the "New Job" form for you.  Review the options carefully, and then submit the example job for execution by pressing the "Submit to Queue" button at the bottom of the page.
  
You must also install one of the following:
+
Start with the "Drosophila melanogaster : DPP example". This will load the region of the D. melanogaster genome encoding decapentaplegic along with cDNA and protein evidence overlapping the region. Select "Drosophila melanogaster : DPP example" from the drop down example menu. Then select load to fill in the form.
*[http://blast.wustl.edu/ WU-BLAST] 2.0 or higher (Now [http://www.advbiocomp.com/ AB-BLAST])
+
*[http://www.ncbi.nlm.nih.gov/Ftp/ NCBI BLAST] 2.2.X or higher
+
  
 +
If you scroll down through the form, you will notice that the genome file, EST file, protein file, and prediction method sections have been filled out for you.  Click on "Submit to Queue", to start the job.
  
Optional Components:
+
You should be redirected to the MWAS start page upon submisssion, and the job you have submitted should be visible in the job status section. Click "Refresh Job Status" to update the run status of your job. Within a few moments, your job will complete, at which point you can view the results
*[http://augustus.gobics.de/ Augustus] 2.0 or higher
+
*[http://exon.biology.gatech.edu/ GeneMark-ES] 2.3a or higher
+
*[http://www.softberry.com/ FGENESH] 2.6 or higher
+
  
 +
Click on "View Results".  You can now download the results for local analysis on your own system or you can click on "View in Apollo" to seen gene models loaded directly in the Apollo genome browser.  This option will install a Java Web Start version of Apollo if it is not already installed.  You can also view summery statistics of the annotation from the Sequence Ontologies SOBA tool by clickin on "SOBA Statistics".
  
Requird for optional MPI support:
 
*[http://www.mcs.anl.gov/research/projects/mpich2/ MPICH2]
 
  
  
 +
===Details of What is Going on Inside of MWAS===
  
===The MAKER Package===
+
====Repeat Masking====
Because of the number of prerequisites, we will not cover the details of installing these other programs; they have already been installed for you.  But even though I did pre-install most programs for you, I'm still going to have you perform basic post installation configurations, so lets get started.
+
MWAS runs MAKER internally, an the first step to MAKER is repeat masking, but why do we need to do this?  Repetitive elements can make up a significant portion of the genome.  Some of these repeats are simple/low-complexity repeats where you have runs of C's or G's or maybe even something like AAGGAAGGAAGG.  Other repeats are more complex, ''i.e.'' transposable elements.  These high-complexity repeats often encode real proteins like rerotranscriptase or even Gag, Pol, and Env viral proteins.  Because they encode real proteins, they can play havoc with ''ab initio'' gene predictors.  For example, a transposable element that occurs next to or even within the intron of a real protein encoding gene might cause a gene predictor to include extra exons as part of a gene model, sequence which really only belongs to the transposable element and not to the coding sequence of the gene.  You will also get hundreds of instances where identical transportable element proteins get annotated as being part of an organisms proteome.  In addition these issues, low-complexity repeat regions can align with high statistical significance to low-complexity protein regions creating a false sense of homology throughout the genome.  To avoid these complications it is convenient to identify and mask any repeat elements before doing other analyses.
 
+
 
+
MAKER can be downloaded from:
+
*http://www.yandell-lab.org/ - but don't do it
+
 
+
 
+
To keep everyone from hitting the server at once though, I placed a tarball in the <tt>~/software/maker/</tt> directory.  Let's unpack this to the directory <tt>/usr/local/</tt>.
+
cd /usr/local
+
sudo tar -zxvf ~/software/maker/maker.tar.gz
+
cd maker
+
ls -l
+
 
+
 
+
You should now see the following:
+
drwxr-xr-x  2 gmod gmod  4096 2009-03-25 13:24 Apollo
+
drwxr-xr-x  3 gmod gmod  4096 2009-07-12 22:50 bin
+
drwxr-xr-x  3 gmod gmod  4096 2009-07-12 23:37 data
+
-rw-r--r--  1 gmod gmod  7746 2009-07-12 22:50 INSTALL
+
drwxr-xr-x 18 gmod gmod  4096 2009-07-12 22:50 lib
+
drwxr-xr-x  3 gmod gmod  4096 2009-07-12 22:50 MPI
+
drwxr-xr-x  7 gmod gmod  4096 2009-07-12 23:07 perl
+
-rw-r--r--  1 gmod gmod 18653 2009-07-12 22:50 README
+
 
+
 
+
There are two files in particular that you would want to look at when installing MAKER -  <tt>INSTALL</tt> and <tt>README</tt>.  <tt>INSTALL</tt> gives a brief overview of MAKER and pre-requisite installation.  Lets take a look at this.
+
less INSTALL
+
 
+
 
+
You can see there is a step by step guide for installing pre-requisites as well as MAKER.  Since the pre-requisites are already installed, jump to the MAKER installation section (press space bar to scroll down).
+
 
+
7.  Install MAKER.  Download from http://www.yandell-lab.org
+
 
+
  a.  Unpack the MAKER tar file into the directory of your choice (i.e.
+
      /usr/local).
+
  b.  Change to the directory maker/perl and run Install.PL by typing:
+
      perl Install.PL
+
  c.  Now add the following to your .bash_profile if you haven't already:
+
        export WUBLASTFILTER="where_wublast_is/filter"
+
        export WUBLASTMAT="where_wublast_is/matrix"
+
        export ZOE="where_snap_is/Zoe"
+
        export AUGUSTUS_CONFIG_PATH="where_augustus_is/config
+
  d.  Add the location where you installed MAKER to your PATH variable in
+
      .bash_profile (i.e. export PATH="/usr/local/maker/bin:$PATH").
+
  e.  You can now run a test of MAKER by following the instructions in the MAKER
+
      README file.
+
 
+
 
+
  See the README file for details on installing mpi_maker
+
 
+
According to the documentation we need to run the <tt>Install.PL</tt> script.
+
cd perl
+
sudo perl Install.PL
+
 
+
 
+
Now we're going to need to add a few entries to your user profile.  So lets open it in a [[Linux Text Editors|text editor]] (I use emacs, you can use whatever you want).
+
emacs ~/.profile
+
{{TextEditorLink|emacs}}
+
 
+
Add the following to your user profile.
+
 
+
For bash:
+
PATH=$PATH:/usr/local/NCBI_blast/bin
+
PATH=$PATH:/usr/local/RepeatMasker
+
PATH=$PATH:/usr/local/exonerate/bin
+
PATH=$PATH:/usr/local/snap
+
PATH=$PATH:/usr/local/augustus/bin
+
PATH=$PATH:/usr/local/gmes
+
PATH=$PATH:/usr/local/maker/bin
+
export PATH
+
 
+
export ZOE=/usr/local/snap
+
export AUGUSTUS_CONFIG_PATH=/usr/local/augustus/config
+
 
+
 
+
Now reload your profile.
+
source ~/.profile
+
 
+
 
+
MAKER should now be installed.  Let's test the executable.  We should see the usage statement.
+
maker -help
+
 
+
==Getting Started with MAKER==
+
===RUNNING MAKER WITH EXAMPLE DATA===
+
MAKER comes with some example files to test the installation and to familiarize the user with how to run MAKER.  The example files are found in the <tt>maker/data</tt> directory.
+
ls -l /usr/local/maker/data
+
 
+
-rw-r--r-- 1 gmod gmod    32712 2009-03-25 13:24 dpp_contig.fasta
+
-rw-r--r-- 1 gmod gmod    3045 2009-03-25 13:24 dpp_proteins.fasta
+
-rw-r--r-- 1 gmod gmod    19138 2009-03-25 13:24 dpp_transcripts.fasta
+
-rw-r--r-- 1 gmod gmod 19744232 2009-07-12 22:50 te_proteins.fasta
+
 
+
For convenience we are going to copy these files before running MAKER.  First we need to make a new directory that will hold all MAKER input and output files.
+
mkdir ~/maker_run1
+
cd ~/maker_run1
+
 
+
Now copy the example files to the new directory.
+
cp /usr/local/maker/data/dpp* ~/maker_run1
+
 
+
 
+
Next we need to tell MAKER all the details about how we want the annotation process to proceed.  Because there can be many variables and options involved in annotation, command line options would be too numerous and cumbersome.  Instead MAKER uses a set of configuration files which guide each run.  You can create a set of generic configuration files in the current working directory by typing the following.
+
maker -CTL
+
 
+
 
+
This creates three files.
+
{|
+
! <tt>maker_exe.ctl</tt>
+
| contains the path information for needed executables.
+
|-
+
! <tt>maker_bopt.ctl</tt>
+
| contains filtering statistics for BLAST and Exonerate
+
|-
+
! <tt>maker_opt.ctl</tt>
+
| contains all other information for MAKER, including the location of the input genome file.
+
|}
+
 
+
Control files are run specific and separate control files will need to be built for each genome given to MAKER. MAKER will look for control files in the current working directory, so it is recommended that MAKER should be run in a separate directory containing unique control files for each genome.
+
 
+
Let's take a look at the <tt>maker_exe.ctl</tt> file.
+
emacs maker_exe.ctl
+
{{TextEditorLink|emacs}}
+
 
+
You will see the names of a number of MAKER supported executables as well as the path to their location.  If you followed the installation instructions correctly, including the instructions for installing pre-requisite programs, all executable paths should show up automatically for you.  However if the location to any of the executables is not set in your PATH environmental variable, as per installation instructions, you will have to add these manually to the <tt>maker_exe.ctl</tt> file every time you run MAKER.
+
 
+
Lines in the MAKER control files have the format key:value with no spaces before or after the colon(:).  If the value is a file name, you can use relative paths and environmental variables, ''i.e.'' <tt>snap:$HOME/snap</tt>.  Note that for all control files the comments written to help users begin with a pound sign(#).  In addition, options before the colon(:) can not be changed, nor should there be a space before or after the colon.
+
 
+
Now let's take a look at the <tt>maker_bopts.ctl</tt> file.
+
emacs maker_bopts.ctl
+
{{TextEditorLink|emacs}}
+
 
+
In this file you will find values you can edit for downstream filtering of BLAST and Exonerate alignments.  At the very top of the file you will see that I have the option to tell MAKER whether I prefer to use WU-BLAST or NCBI-BLAST.  We want to set this to NCBI-BLAST, since that is what we have installed.  We can just leave the remaining values as the default.
+
blast_type:ncbi
+
 
+
Now let's take a look at the <tt>maker_opts.ctl</tt> file.
+
emacs maker_opts.ctl
+
{{TextEditorLink|emacs}}
+
 
+
This is the primary configuration file for MAKER specific options.  Here we need to set the location of the genome, EST, and protein input files we will be using.  These come from the supplied example files.  We also need to set repeat masking options, as well as a number of other configurations.  We'll discuss these options in more detail later on, but for now just adjust the following values.
+
genome:dpp_contig.fasta
+
est:dpp_transcripts.fasta
+
protein:dpp_proteins.fasta
+
snaphmm:fly
+
predictor:snap
+
 
+
 
+
Now let's run MAKER.
+
maker maker_exe.ctl maker_opts.ctl maker_bopts.ctl
+
 
+
 
+
You should now see a large amount of status information flowing past your screen.  If you don't want to see this you can run MAKER with the <tt>-q</tt> option for "quiet" on future runs.
+
 
+
==Details of What is Going on Inside of MAKER==
+
 
+
===Repeat Masking===
+
The first step to MAKER is repeat masking, but why do we need to do this?  Repetitive elements can make up a significant portion of the genome.  Some of these repeats are simple/low-complexity repeats where you have runs of C's or G's or maybe even something like AAGGAAGGAAGG.  Other repeats are more complex, ''i.e.'' transposable elements.  These high-complexity repeats often encode real proteins like rerotranscriptase or even Gag, Pol, and Env viral proteins.  Because they encode real proteins, they can play havoc with ''ab initio'' gene predictors.  For example, a transposable element that occurs next to or even within the intron of a real protein encoding gene might cause a gene predictor to include extra exons as part of a gene model, sequence which really only belongs to the transposable element and not to the coding sequence of the gene.  You will also get hundreds of instances where identical transportable element proteins get annotated as being part of an organisms proteome.  In addition these issues, low-complexity repeat regions can align with high statistical significance to low-complexity protein regions creating a false sense of homology throughout the genome.  To avoid these complications it is convenient to identify and mask any repeat elements before doing other analyses.
+
  
  
Line 404: Line 199:
  
  
Now the idea of masking out sequence might seem on the surface like we're losing a lot of information, and it is true that there can be proteins that have integrated repeats into their structure, so repeat masking will affect our ability to annotate these proteins.  However, these proteins are rare and the number of gene models and homology alignments improved by this step far exceed the few gene models that may be negatively affected.  You do have the option to run ''ab initio'' gene predictors on both the masked and unmasked sequence if repeat masking worries you though.  You do this by setting unmask:1 in the <tt>maker_opt.ctl</tt> configuration file.
+
Now the idea of masking out sequence might seem on the surface like we're losing a lot of information, and it is true that there can be proteins that have integrated repeats into their structure, so repeat masking will affect our ability to annotate these proteins.  However, these proteins are rare and the number of gene models and homology alignments improved by this step far exceed the few gene models that may be negatively affected.
  
===''Ab Initio'' Gene Prediction===
+
====''Ab Initio'' Gene Prediction====
 
Following repeat masking, MAKER runs ''ab initio'' gene predictors specified by the user to produce preliminary gene models.  ''Ab initio'' gene predictors produce gene predictions based on underlying mathematical models describing patterns of intron/exon structure and consensus start signals.  Gene models are not produced by directly using experimental evidence.  Because the patterns of gene structure are going to differ from organism to organism, you must train gene predictors before you can use them.  I will discuss how to do this later on.
 
Following repeat masking, MAKER runs ''ab initio'' gene predictors specified by the user to produce preliminary gene models.  ''Ab initio'' gene predictors produce gene predictions based on underlying mathematical models describing patterns of intron/exon structure and consensus start signals.  Gene models are not produced by directly using experimental evidence.  Because the patterns of gene structure are going to differ from organism to organism, you must train gene predictors before you can use them.  I will discuss how to do this later on.
  
Line 414: Line 209:
 
*Augustus
 
*Augustus
 
*GeneMark
 
*GeneMark
*FGENESH
+
*FGENESH (Disabled on public MWAS site)
  
  
You must specify in the <tt>maker_opts.ctl</tt> file the training parameters file you want to use use when running each of these algorithms.
+
You must specify HMM files you want to use use when running each of these algorithms.
  
 
+
====EST and Protein Evidence Alignment====
===EST and Protein Evidence Alignment===
+
 
A simple way to indicate if a sequence region is likely associated with a gene is to identify (A) if the region is actively being transcribed or (B) if the region has homology to a known protein.  This can be done by aligning Expressed Sequence Tags (ESTs) and proteins to the genome using alignment algorithms.
 
A simple way to indicate if a sequence region is likely associated with a gene is to identify (A) if the region is actively being transcribed or (B) if the region has homology to a known protein.  This can be done by aligning Expressed Sequence Tags (ESTs) and proteins to the genome using alignment algorithms.
 
*ESTs are sequences derived from a cDNA library.  Because of the difficulties associated with working with mRNA and depending on how the cDNA library was prepared, EST databases usually represent bits and pieces of transcribed mRNAs with only a few full length transcripts.  MAKER aligns these sequences to the genome using BLASTN.  If ESTs from the organism being annotated are unavailable or sparse, you can use ESTs from a closely related organism.  However, ESTs from closely related organisms are unlikely to align using BLASTN since nucleotide sequences can diverge quite rapidly.  For these ESTs, MAKER uses TBLASTX to align them in protein space.
 
*ESTs are sequences derived from a cDNA library.  Because of the difficulties associated with working with mRNA and depending on how the cDNA library was prepared, EST databases usually represent bits and pieces of transcribed mRNAs with only a few full length transcripts.  MAKER aligns these sequences to the genome using BLASTN.  If ESTs from the organism being annotated are unavailable or sparse, you can use ESTs from a closely related organism.  However, ESTs from closely related organisms are unlikely to align using BLASTN since nucleotide sequences can diverge quite rapidly.  For these ESTs, MAKER uses TBLASTX to align them in protein space.
Line 426: Line 220:
  
  
Remember now that we are aligning against the repeat-masked genomic sequence.  How is this going to affect our alignments?  For one thing we won't be able to align against low-complexity regions.  Some real proteins contain low-complexity regions and it would be nice to identify those, but if I let anything align to a low-complexity region, then I will get spurious alignments all over the genome.  Wouldn't it be nice if there was a way to allow BLAST to extend alignments through low-complexity regions, but only if there is is already alignment somewhere else?  You can do this with soft-masking.  If you remember soft-masking is using lower case letters to mask sequence without losing the sequence information.  BLAST allows you to use soft-masking to keep alignments from seeding in low-complexity regions, but allows you to extend through them.  This of course will allow some of the spurious alignments you were trying to avoid, but overall you still end up suppressing the majority of poor alignments while letting through enough real alignments to justify the cost. You can turn this behavior off though if it bothers you by setting <tt>softmask:0</tt> in the <tt>maker_bopt.ctl</tt> file.
+
Remember now that we are aligning against the repeat-masked genomic sequence.  How is this going to affect our alignments?  For one thing we won't be able to align against low-complexity regions.  Some real proteins contain low-complexity regions and it would be nice to identify those, but if I let anything align to a low-complexity region, then I will get spurious alignments all over the genome.  Wouldn't it be nice if there was a way to allow BLAST to extend alignments through low-complexity regions, but only if there is is already alignment somewhere else?  You can do this with soft-masking.  If you remember soft-masking is using lower case letters to mask sequence without losing the sequence information.  BLAST allows you to use soft-masking to keep alignments from seeding in low-complexity regions, but allows you to extend through them.  This of course will allow some of the spurious alignments you were trying to avoid, but overall you still end up suppressing the majority of poor alignments while letting through enough real alignments to justify the cost.
 
+
  
===Polishing Evidence Alignments===
+
====Polishing Evidence Alignments====
 
Because of oddities associated with how BLAST statistics work, BLAST alignments are not as informative as they could be.  BLAST will align regions any where it can, even if the algorithm aligns regions out of order, with multiple overlapping alignments in the exact same region, or with slight overhangs around splice sites.
 
Because of oddities associated with how BLAST statistics work, BLAST alignments are not as informative as they could be.  BLAST will align regions any where it can, even if the algorithm aligns regions out of order, with multiple overlapping alignments in the exact same region, or with slight overhangs around splice sites.
  
Line 439: Line 232:
  
  
===Integrating Evidence to Synthesize Final Annotations===
+
====Integrating Evidence to Synthesize Final Annotations====
 
Once you have ''ab initio'' predictions, EST alignments, and protein alignments you can integrate this evidence to produce even better gene predictions.  MAKER does this by "talking" to the gene prediction programs.  MAKER takes all the evidence, generates "hints" to where splice sites and protein coding regions are located, and then passes these "hints" to programs that will accept them.
 
Once you have ''ab initio'' predictions, EST alignments, and protein alignments you can integrate this evidence to produce even better gene predictions.  MAKER does this by "talking" to the gene prediction programs.  MAKER takes all the evidence, generates "hints" to where splice sites and protein coding regions are located, and then passes these "hints" to programs that will accept them.
  
Line 453: Line 246:
  
  
==MAKER's Output==
 
If you look in the current working directory, you will see that MAKER has created an output directory called <tt>dpp_contig.maker.output</tt>.  The name of the output directory is based off of the input genomic sequence file, which in this case was <tt>dpp_contig.fasta</tt>.
 
  
  
Now let's see what's inside the output directory.
+
===Running MWAS with your Own Data===
  cd dpp_contig.maker.output
+
When using your own data, you need to tell MWAS all the details about how you want the annotation process to proceed.  Because there can be many variables and options involved in annotation you will need to review each option carefully. At the very least you should provide a genome sequence file, an EST sequence file, and a protein homology sequence file for new annotation jobs.
  ls -l
+
 
 +
===MWAS Job Configuration===
 +
====Basic Input Files====
 +
All the basic input files for MWAS should be in fasta format.
 +
 
 +
*''genome'' - Genomic sequence file
 +
*''est'' - ESTs from the same organism or from a very very closely related organism (i.e. chimpanzee to human).  These are aligned first via BLASTN with very strict filtering so any sequence divergence can prohibit the alignment.
 +
*''altest'' - These are ESTs from other closely related organisms (i.e. mouse to human).  They are aligned via TBLASTX in protein space, so greater sequence divergence is permitted.
 +
*''protein'' - proteins from the same or other organisms.  These are aligned via BLASTX against the genome.  Proteins that align to a region will not necessarily be orthologous or paralogous.  The alignment may just be based on short regions such as a shared domain.  You may also get alignments to pseudogenes.  Polishing BLASTX hits with Exonerate helps identify what are likely true paralogs and orthologs.
 +
 
 +
 
 +
====Repeat Masking Options====
 +
Repeat masking is important for improving gene predictor performance and avoiding protein alignments to what are likely just transposons.  You also expect a certain amount of genomic contamination in the EST database, much of this contamination maps back to repeat regions.  By repeat masking we can avoid issues with all types of input data.
 +
 
 +
 
 +
*''RepeatMasker'' - Performs repeat masking using the RepBase libraries.
 +
*''RepeatRunner'' - This is a fasta file of transposon and virus related proteins. The serve provides an internal database to use by default.
 +
*Users can also supply a fasta file of species specific nucleotide repeats or a GFF3 file of pre-defined repeat regions. Species specific repeat database can be built using programs like PILER and uploaded for use with MAKER.
 +
 
 +
 
 +
====Gene Prediction Options====
 +
Gene prediction options affect the final gene annotations more than any other option type. This brings up the point that electronically produced gene annotations will only be as good as the gene predictions they are based on.
 +
 
 +
 
 +
*''Predictor Options'' - Tell MWAS which programs to use when generating gene models.
 +
**SNAP
 +
**Augustus
 +
**GeneMark
 +
**Est2Genome - Allows high quality spliced Exonerate EST alignments to become gene annotations.  This only happens when there is no gene prediction overlapping the region.  This is useful for generating gene annotations in the absence of a trained gene predictor.
 +
**Protein2Genome - Used only for Prokaryotic genomes. Will try and build gene models based solely on the presence of open reading frames and protein alignments to other species.
 +
**User supplied gene predictions - These are gene predictions in GFF3 format from any source you have available to you. They will be treated the same as any gene predictions derived from MWAS supported sources.
 +
**User supplied gene models - These are pre-existing gene models from the same assembly as the contigs being annotated.  They can be integrated and automatically updated by MAKER to reflect new evidence (i.e. add UTR etc.).  MAKER can also pull names forward from these pre-existing gene models onto new updated genome annotations.
 +
 
 +
====Other MAKER Options====
 +
*Sets the minimum length a contig must have or else it will be skipped.
 +
*Sets the minimum length a predicted protein must have (in amino acids) to be annotated.
 +
*Set the expected max intron size for evidence alignments
 +
*Tells MAKER to consider single exon EST evidence when generating annotations.  Single exon ESTs are more likely to be genomic contamination.
 +
*'Sets the minimum length required for single exon ESTs if 'single_exon' is enabled
 +
 
 +
===MWAS Results===
 +
The results provided to the user from the MWAS can either be downloaded or directly viewed online using a Java Web Start version of the Apollo genome annotation curration tool.
 +
 
 +
If you choose to download your data you will be presented with a tarball that when unpacked will produce an output directory called something like <tt>2434.maker.output</tt>.  The name of the output directory is based off of the job id assigned to your sequence file.
  
  
You should now see a list of directories and files created by MAKER.
+
When you examine the contents of this directory, you should see a list of directories and files created by MAKER.
  drwxr-xr-x 3 gmod gmod 4096 2009-07-12 23:23 dpp_contig_datastore
+
  drwxr-xr-x 3 gmod gmod 4096 2009-07-12 23:23 2434_datastore
  -rw-r--r-- 1 gmod gmod  135 2009-07-12 23:27 dpp_contig_master_datastore_index.log
+
  -rw-r--r-- 1 gmod gmod  135 2009-07-12 23:27 2434_master_datastore_index.log
 
  -rw-r--r-- 1 gmod gmod 1579 2009-07-12 23:23 maker_bopts.log
 
  -rw-r--r-- 1 gmod gmod 1579 2009-07-12 23:23 maker_bopts.log
 
  -rw-r--r-- 1 gmod gmod 1250 2009-07-12 23:23 maker_exe.log
 
  -rw-r--r-- 1 gmod gmod 1250 2009-07-12 23:23 maker_exe.log
Line 472: Line 306:
 
*The <tt>maker_opt.log</tt>, <tt>maker_exe.log</tt>, and <tt>maker_bopts.log</tt> files are logs of the control files used for this run of MAKER.
 
*The <tt>maker_opt.log</tt>, <tt>maker_exe.log</tt>, and <tt>maker_bopts.log</tt> files are logs of the control files used for this run of MAKER.
 
*The <tt>mpi_blastdb</tt> directory contains [[Glossary#FASTA|fasta]] indexes and BLAST database files created from the input EST, protein, and repeat databases.
 
*The <tt>mpi_blastdb</tt> directory contains [[Glossary#FASTA|fasta]] indexes and BLAST database files created from the input EST, protein, and repeat databases.
*The <tt>dpp_contig_master_datastore_index.log</tt> contains information on both the run status of individual contigs and information on where individual contig data is stored.
+
*The <tt>2434_master_datastore_index.log</tt> contains information on both the run status of individual contigs and information on where individual contig data is stored.
*The <tt>dpp_contig_datastore</tt> directory contains a set of subfolders, each containing the final MAKER output for individual contigs from the genomic fasta file.
+
*The <tt>2434_datastore</tt> directory contains a set of subfolders, each containing the final MAKER output for individual contigs from the genomic fasta file.
  
  
Once a MAKER run is finished the most important file to look at is the <tt>dpp_contig_master_datastore_index.log</tt> to see if there were any failures.
+
Once a MAKER run is finished the most important file to look at is the <tt>2434_master_datastore_index.log</tt> to see if there were any failures.
  less dpp_contig_master_datastore_index.log
+
  less 2434_master_datastore_index.log.  MWAS provides a summery of this file when you click on results to download a job.  MWAS also displays run errors in the log option button that you can click on when in the MWAS main queue page.
  
 
+
If everything proceeded correctly you should see the following in your 2434_master_datastore_index.log file.
If everything proceeded correctly you should see the following.
+
  contig-dpp-500-500      2434_datastore/contig-dpp-500-500 STARTED
  contig-dpp-500-500      dpp_contig_datastore/contig-dpp-500-500 STARTED
+
  contig-dpp-500-500      2434_datastore/contig-dpp-500-500 FINISHED
  contig-dpp-500-500      dpp_contig_datastore/contig-dpp-500-500 FINISHED
+
  
  
Line 492: Line 325:
  
  
The entries in the <tt>dpp_contig_master_datastore_index.log</tt> file also indicate that the output files for this contig are stored in the directory dpp_contig_datastore/contig-dpp-500-500/.  Knowing where the output is stored may seem rather trivial; however, input genome fasta files can contain thousands even hundreds-of-thousands of contigs, and many file-systems have performance problems with large numbers of sub-directories and files within a single directory.  Even when the underlying file-systems handle things gracefully, access via network file-systems can be an issue.  To deal with this situation, MAKER uses a datastore module to create a hierarchy of sub-directory layers, starting from a 'base', and mapping identifiers to corresponding sub-directories.  For situations where the input genome fasta file contains more than 1,000 contigs, the datastore structure is used automatically, and the <tt>master_datastore_index.log</tt> file becomes essential for identifying where the output for a given contig is stored.
+
The entries in the <tt>2434_master_datastore_index.log</tt> file also indicate that the output files for this contig are stored in the directory dpp_contig_datastore/contig-dpp-500-500/.  Knowing where the output is stored may seem rather trivial; however, input genome fasta files can contain thousands even hundreds-of-thousands of contigs, and many file-systems have performance problems with large numbers of sub-directories and files within a single directory.  Even when the underlying file-systems handle things gracefully, access via network file-systems can be an issue.  To deal with this situation, MAKER uses a datastore module to create a hierarchy of sub-directory layers, starting from a 'base', and mapping identifiers to corresponding sub-directories.  For situations where the input genome fasta file contains more than 1,000 contigs, the datastore structure is used automatically, and the <tt>master_datastore_index.log</tt> file becomes essential for identifying where the output for a given contig is stored.
  
  
 
now let's take a look at what MAKER produced for the contig 'contig-dpp-500-500'.
 
now let's take a look at what MAKER produced for the contig 'contig-dpp-500-500'.
  cd dpp_contig_datastore/contig-dpp-500-500
+
  cd 2434_datastore/contig-dpp-500-500
 
  ls -l
 
  ls -l
  
Line 507: Line 340:
 
  -rw-r--r-- 1 gmod gmod  4837 2009-07-12 23:27 contig-dpp-500-500.maker.snap_masked.transcripts.fasta
 
  -rw-r--r-- 1 gmod gmod  4837 2009-07-12 23:27 contig-dpp-500-500.maker.snap_masked.transcripts.fasta
 
  -rw-r--r-- 1 gmod gmod  4430 2009-07-12 23:27 contig-dpp-500-500.maker.transcripts.fasta
 
  -rw-r--r-- 1 gmod gmod  4430 2009-07-12 23:27 contig-dpp-500-500.maker.transcripts.fasta
drwxr-xr-x 2 gmod gmod  4096 2009-07-12 23:27 theVoid.contig-dpp-500-500
 
  
  
Line 514: Line 346:
 
*The <tt>contig-dpp-500-500.maker.snap_masked.transcripts.fasta</tt> and <tt>contig-dpp-500-500.maker.snap_masked.proteins.fasta</tt> files contain the transcript and protein sequences for all SNAP ''ab initio'' gene predictions.  If you use other ''ab initio'' gene predictors, those sequence files will follow a similar naming pattern.
 
*The <tt>contig-dpp-500-500.maker.snap_masked.transcripts.fasta</tt> and <tt>contig-dpp-500-500.maker.snap_masked.proteins.fasta</tt> files contain the transcript and protein sequences for all SNAP ''ab initio'' gene predictions.  If you use other ''ab initio'' gene predictors, those sequence files will follow a similar naming pattern.
 
*The <tt>contig-dpp-500-500.maker.non_overlapping_ab_initio.transcripts.fasta</tt> and <tt>contig-dpp-500-500.maker.non_overlapping_ab_initio.proteins.fasta</tt> files contain the set of best ''ab initio'' gene predictions that do not overlap a MAKER gene annotation.  These files can be analyzed to see if there is any reason to promote them to the status of gene annotations.  For example: you can run iprscan to see if they contain known protein domains.
 
*The <tt>contig-dpp-500-500.maker.non_overlapping_ab_initio.transcripts.fasta</tt> and <tt>contig-dpp-500-500.maker.non_overlapping_ab_initio.proteins.fasta</tt> files contain the set of best ''ab initio'' gene predictions that do not overlap a MAKER gene annotation.  These files can be analyzed to see if there is any reason to promote them to the status of gene annotations.  For example: you can run iprscan to see if they contain known protein domains.
*The directory <tt>theVoid.contig-dpp-500-500</tt> contains raw output files from all the programs MAKER wraps around (BLAST, SNAP, RepeatMasker, etc.).  You can usually ignore this directory and it's contents.
 
  
 
+
===Viewing MAKER Annotations===
==Viewing MAKER Annotations==
+
 
Viewing the raw [[GFF3]] file produced by MAKER really isn't that meaningful.
 
Viewing the raw [[GFF3]] file produced by MAKER really isn't that meaningful.
  
  
For sanity checking purposes it would be nice to have a graphical view of what's in the GFF3 file.  To do this GFF3 files can be loaded into programs like [[Apollo]] and [[GBrowse]].
+
For sanity checking purposes it would be nice to have a graphical view of what's in the GFF3 file.  To do this GFF3 files can be loaded into programs like [[Apollo]] and [[GBrowse]]. MWAS allows you to view the files in Apollo directly on the website.  You can also get summery statistics of annotation features using the tool SOBA from the Sequence Ontology Consortium.
 
+
 
+
===Apollo===
+
Let's load the <tt>contig-dpp-500-500.gff</tt> into [[Apollo]] and take a look at what MAKER produced. Copy the <tt>contig-dpp-500-500.gff</tt> file to your home directory to make it easy to locate, and then let's start Apollo.
+
cp contig-dpp-500-500.gff ~
+
 
+
 
+
Select the file in Apollo, and open it.  You will notice that there are a number of bars representing the gene annotations and the evidence alignments supporting those annotations.  Annotations are in the middle light colored panel, and evidence alignments are in the dark panels at the top and bottom.
+
 
+
 
+
All the evidence in the dark panels is in the same color which makes it difficult to identify the source of each piece of evidence without manually clicking on them.  MAKER comes with a configuration file for Apollo which gives a more more colorful view of MAKER produced annotations and evidence.  Let's close Apollo, copy this configuration file and then reload the annotations.
+
 
+
 
+
The configuration file should be place in the <tt>~/.apollo</tt> directory.  Create this directory if it does not exist.
+
cd ~
+
mkdir .apollo
+
  
  
Now copy the configuration file to that directory.
+
====Apollo====
  cp /usr/local/maker/Apollo/gff3.tiers ~/.apollo/
+
On the results screen choose a contig from a job and click "View in Apollo"A Java Web Start version of Apollo will then install itself automatically on your computer, if not already installed. Apollo will then automatically load the contig you indicated into the browser. You will notice that there are a number of bars representing the gene annotations and the evidence alignments supporting those annotations.  Annotations are in the middle light colored panel, and evidence alignments are in the dark panels at the top and bottom.
  
  
Open the <tt>contig-dpp-500-500.gff</tt> file again in Apollo.  You can now see the annotations and evidence in nice color.  Click on each piece of evidence and you will see it's source in the table at the bottom of the Apollo screen.
+
All the evidence in the dark panels will be a different color depending on the source each piece of evidence was derived from (i.e. RepeatMasker, BLASTX, etc.).  To identify which source a feature belongs to, just manually clicking on one and the name of the source will be displayed in the table at the bottom of the Apollo screen.
  
 
Possible Sources Include:
 
Possible Sources Include:
Line 557: Line 371:
 
*FgenesH  - FGENESH ''ab inito'' gene prediction
 
*FgenesH  - FGENESH ''ab inito'' gene prediction
 
*Repeatmasker - RepeatMasker identified repeat
 
*Repeatmasker - RepeatMasker identified repeat
*Blastx:Repeatmask - RepeatRunner identified repeat from the repeat protein database
+
*RepeatRunner - RepeatRunner identified repeat from the repeat protein database
  
 
+
===Training ''ab initio'' Gene Predictors===
===GBrowse===
+
Previous versions of [[GBrowse]] required explicit UTR features in the [[GFF3]] file.  This may or may not still be the case.  If you need these features, there is a MAKER accessory script you can use.
+
add_utr_gff.pl <gff3_directory>
+
 
+
 
+
The directory can contain multiple GFF3 files.
+
 
+
 
+
''(See [[GBrowse]] documentation to set up GBrowse)''
+
 
+
 
+
 
+
=Advanced MAKER Configuration, Re-annotation Options, and Improving Annotation Quality =
+
The remainder of this page mainly presents issues that can be encountered during the annotation process.  I then describe how MAKER can be used to resolve each issue.
+
 
+
<span style="font-size: 80%">[[Media:2009SumSchMAKER.pdf|See accompanying MAKER presentation (~16 MB)]].</span>
+
 
+
==Configuration Files in Detail==
+
Let's take a closer look at the configuration options in the maker_opt.ctl file.
+
 
+
 
+
===Basic Input Files===
+
All the basic input files for MAKER should be in fasta format.
+
 
+
 
+
*''genome'' - Genomic sequence file
+
*''est'' - ESTs from the same organism or from a very very closely related organism (i.e. chimpanzee to human).  These are aligned first via BLASTN with very strict filtering so any sequence divergence can prohibit the alignment.
+
*''altest'' - These are ESTs from other closely related organisms (i.e. mouse to human).  They are aligned via TBLASTX in protein space, so greater sequence divergence is permitted.
+
*''protein'' - proteins from the same or other organisms.  These are aligned via BLASTX against the genome.  Proteins that align to a region will not necessarily be orthologous or paralogous.  The alignment may just be based on short regions such as a shared domain.  You may also get alignments to pseudogenes.  Polishing BLASTX hits with Exonerate helps identify what are likely true paralogs and orthologs.
+
 
+
 
+
===Repeat Masking Options===
+
Repeat masking is important for improving gene predictor performance and avoiding protein alignments to what are likely just transposons.  You also expect a certain amount of genomic contamination in the EST database, much of this contamination maps back to repeat regions.  By repeat masking we can avoid issues with all types of input data.
+
 
+
 
+
*''model_org'' - This is a RepeatMasker option that lets you limit the repeat database to specific organisms or groups of organisms (i.e. vertebrates, Nematodes, ''Drosophila'', primates etc).  By default MAKER sets this to 'all'.
+
*''repeat_protein'' - This is a fasta file of transposon and virus related proteins.  MAKER has an internal RepeatRunner database it uses by default.
+
*''rmlib'' - This is a fasta file of nucleotide repeats provided by the user.  You can create a species specific repeat database using programs like PILER.
+
 
+
 
+
===Gene Prediction Options===
+
Gene prediction options affect the final gene annotations more than any other option type.  This brings up the point that electronically produced gene annotations will only be as good as the gene predictions they are based on.
+
 
+
 
+
*''predictor'' - This tells MAKER what programs to run for generating annotations.
+
**est2genome - Allows high quality spliced Exonerate EST alignments to become gene annotations.  This only happens when there is no gene prediction overlapping the region.  This is useful for generating gene annotations in the absence of a trained gene predictor.
+
**model_gff - This allows user defined models to be used
+
**snap
+
**augustus
+
**genemark
+
**fgenesh
+
*''unmask'' - Produce ''ab initio'' gene predictions for unmasked sequence as well as for masked sequence
+
*''snaphmm'' - SNAP training file (SNAP has some species files already available in the snap/HMM/ directory)
+
*''gmhmm'' -  GeneMark training file (GeneMark self-trains and produces the resulting training file in the output mod/ directory)
+
*''augustus_species'' - Augustus species ID (Augustus uses an internal species index rather than a simple set of training files.  Type 'augustus --species=help' to see the values you can choose)
+
*''fgenesh_par_file'' - FGENESH training file
+
 
+
 
+
===Other MAKER Options===
+
*''evaluate'' - runs an experimental annotation quality analysis program (Evaluator) on each annotation.  Provides quantitative metrics for ranking annotations and identifying the features most in need of review.  I'd like to emphasize that this is experimental.
+
*''max_dna_len'' -  sets the length for dividing up contigs into chunks for processing.  Larger chunks require more memory; smaller chunks require less memory.  Allows the user to control system memory usage.
+
*''min_contig'' - sets the minimum length a contig must have or else it will be skipped.
+
*''min_protein'' - sets the minimum length a predicted protein must have (in amino acids) to be annotated.
+
*''split_hit'' - sets the expected max intron size for evidence alignments
+
*''pred_flank'' - sets the length for the sequence surrounding clusters of EST and protein evidence that will be used when building hint based gene predictions.
+
*''single_exon'' - tells MAKER to consider single exon EST evidence when generating annotations.  Single exon ESTs are more likely to be genomic contamination.
+
*''single_length'' - sets the minimum length required for single exon ESTs if 'single_exon' is enabled
+
*''keep_preds'' - adds non-overlapping ab-inito gene prediction to the final annotation set rather than pushing them off into a separate file for the user to analyse.  These predictions by definition do not overlap any form of supporting evidence.
+
*''retry'' - sets the number of times to retry a contig if there is a failure
+
*''clean_try'' - removes all data from previous MAKER runs before retrying a contig
+
*''clean_up'' - removes theVoid directory with individual raw analysis files at the end of the MAKER run
+
*''TMP'' - specifies a directory other than the system default temporary directory (/tmp) for writing temporary files.  On some Linux systems the primary hard drive that also holds the default temporary directory is small, and most of the systems storage space is located on secondary hard drives mounted in directories elsewhere on the system.  This is often true of computer clusters where each node has it's own small hard drive for booting purposes, and most storage space is network mounted.  Temporary files created by MAKER are deleted as the program advances, but individual files related to BLAST jobs can be quite large, so setting TMP to another location can be useful.
+
 
+
 
+
==Training ''ab initio'' Gene Predictors==
+
 
If you are involved in a genome project for an emerging model organism, you should already have an EST database which would have been generated as part of the original sequencing project.  A protein database can be collected from closely related organism genome databases or by using the UniProt/SwissProt protein database or the NCBI NR protein database.  However a trained ''ab initio'' gene predictor is a much more difficult thing to generate.  Gene predictors require existing gene models on which to base prediction parameters.  However, with emerging model organisms there are no pre-existing gene models.  So how then are you supposed to train your gene prediction programs?
 
If you are involved in a genome project for an emerging model organism, you should already have an EST database which would have been generated as part of the original sequencing project.  A protein database can be collected from closely related organism genome databases or by using the UniProt/SwissProt protein database or the NCBI NR protein database.  However a trained ''ab initio'' gene predictor is a much more difficult thing to generate.  Gene predictors require existing gene models on which to base prediction parameters.  However, with emerging model organisms there are no pre-existing gene models.  So how then are you supposed to train your gene prediction programs?
  
  
MAKER gives the user the option to produce gene annotations directly from the EST evidence.  You can then use these imperfect gene models to train gene predictor program.  Once you have re-run MAKER with the newly trained gene predictor, you can use the second set of gene annotations
+
MWAS gives the user the option to produce gene annotations directly from the EST evidence.  You can then use these imperfect gene models to train gene predictor program.  Once you have re-run MWAS with the newly trained gene predictor, you can use the second set of gene annotations
 
to train the gene predictors yet again.  This boot-strap process allows you to iteratively improve the performance of ''ab initio'' gene predictors.
 
to train the gene predictors yet again.  This boot-strap process allows you to iteratively improve the performance of ''ab initio'' gene predictors.
  
 +
===GFF3 Pass-through===
 +
What if I'm not working on a new genome project, but rather I have an existing annotation set, and I just want to update my genome database to reflect new protein and EST evidence.  Here you can use a feature in MAKER called GFF3 pass-through, which allows you to pass existing annotations into the program and combine them with new evidence for use in the annotation process.
  
I've created an example file set so you can learn to train the gene predictor SNAP using this procedure.
+
===mRNAseq===
 
+
 
+
First let's copy the data and setup a working directory.
+
cd ~
+
tar -zxf ~/software/maker/train.tar.gz
+
cd train
+
ls -al
+
 
+
 
+
You should see four files in the directory
+
genome.fasta
+
est.fasta
+
protein.fasta
+
repeat_protein.fasta
+
 
+
 
+
We need to build maker configuration files and populate the appropriate values.
+
maker -CTL
+
emacs maker_opts.ctl
+
{{TextEditorLink|emacs}}
+
 
+
 
+
Edit the following:
+
genome:genome.fasta
+
est:est.fasta
+
protein:protein.fasta
+
repeat_protein:repeat_protein.fasta
+
predictor:est2genome
+
 
+
 
+
MAKER is now configured to generate annotations from the EST data, so start the program.
+
maker
+
 
+
 
+
Now load the file genome.maker.output/genome_datastore/scf1117875581239.gff into Apollo.  You will see that there are far more regions with evidence alignments than there are gene annotations.  This is because there are so few spliced ESTs that are capable of generating gene models.
+
 
+
 
+
Now exit Apollo. We now need to convert the GFF3 gene models to ZFF format.  This is the format SNAP requires for training.  To do this wee need to collect all GFF3 files into a single directory.
+
mkdir gff
+
find genome.maker.output --name scf*.gff --exec cp {} gff \;
+
cd gff
+
maker2zff.pl . Pult
+
ls -l
+
 
+
 
+
There should now be two new files. The first is the ZFF format file and the second is a fasta the coordinates can be referenced against. These will be used to train SNAP.
+
Pult.ann
+
Pult.dna
+
 
+
 
+
Training SNAP.
+
fathom -categorize 1000 Pult.ann Pult.dna
+
fathom -export 1000 -plus uni.ann uni.dna
+
forge export.ann export.dna
+
hmm-assembler.pl Pult . > ../Pult.hmm
+
cd ..
+
 
+
 
+
The final training parameters file is Pult.hmm.  We do not expect SNAP to perform that well with this training file; however, it is a good starting point for further training.
+
 
+
 
+
We need to run MAKER again.
+
emacs maker_opts.ctl
+
{{TextEditorLink|emacs}}
+
 
+
predictor:snap,est2genome
+
snaphmm:Pult.hmm
+
 
+
maker
+
 
+
 
+
Now lets look at the output once again in Apollo.
+
 
+
 
+
Close Apollo, retrain SNAP, and run MAKER again.
+
rm gff/*
+
find genome.maker.output --name scf*.gff --exec cp {} gff \;
+
cd gff
+
maker2zff.pl . Pult
+
fathom -categorize 1000 Pult.ann Pult.dna
+
fathom -export 1000 -plus uni.ann uni.dna
+
forge export.ann export.dna
+
hmm-assembler.pl Pult . > ../Pult2.hmm
+
cd ..
+
emacs maker_opt.ctl
+
 
+
 
+
Change configuration file.
+
snaphmm:Pult2.hmm
+
 
+
 
+
Run maker.
+
maker
+
 
+
 
+
Let's examine the GFF3 file one last time in Apollo.  As you can see there, is a marked degree of improvement in the gene models.
+
 
+
==GFF3 Pass-through==
+
What if I'm not working on a new genome project, but rather I have an existing annotation set, and I just want to update my genome database to reflect new protein and EST evidence.  Here you can use a feature in MAKER called GFF3 pass-through, which allows you to pass existing annotations into the program and combine them with updated EST and protein alignments.
+
 
+
 
+
Let's begin by copying the GFF-passthrough example data and preparing MAKER.
+
cd ~
+
tar -zxf ~/software/maker/pass.tar.gz
+
cd pass
+
ls -al
+
 
+
 
+
You will see a number of files.  Not all of them are important (for now).
+
genome.fasta
+
est.fasta
+
protein.fasta
+
repeat_protein.fasta
+
model.gff
+
est.gff
+
pred.gff
+
Pult.hmm
+
 
+
 
+
We now need to generate MAKER configuration files and edit them.
+
maker -CTL
+
emacs maker_opt.ctl
+
 
+
genome:genome.fasta
+
est:est.fasta
+
protein:protein.fasta
+
repeat_protein:repeat_protein.fasta
+
model_gff:model.gff
+
predictor:model_gff
+
{{TextEditorLink|emacs}}
+
 
+
Now run MAKER.
+
maker
+
 
+
 
+
Load the output GFF3 file into Apollo.  You will see that the annotations and the updated evidence have all been bundled together.  The results can now be loaded into the genome database for distribution.
+
 
+
 
+
What if I also want to modify existing annotations to take into account the updated evidence.  Can that be done? Yes.  We just need to modify the configuration parameters.
+
emacs maker_opt.ctl
+
 
+
 
+
Now cahnge these values.
+
predictor:model_gff,snap
+
snaphmm:Pult.hmm
+
{{TextEditorLink|emacs}}
+
 
+
 
+
MAKER is now configured to produce SNAP gene models that will compete against the existing passed through GFF3 models.
+
 
+
 
+
Start MAKER.
+
maker
+
 
+
 
+
Load the resulting GFF3 output file into Apollo and you will see that new annotations replace old annotations where the evidence was sufficient to suggest a different model.  Note that if you want to maintain old gene names when models are replaced, set map_forward:1 in the maker_opt.ctl file.  You can then run maker again and view the results in Apollo.  You will see that the gene models are the same as the previous example, but the legacy names have been pulled forward into the updated models.
+
 
+
 
+
You've seen how GFF3 pass-through let's you use existing gene models, but if I can pass through existing gene models, wouldn't it be nice to have the ability to pass through any type of data?
+
 
+
 
+
MAKER also allows you to pass through exiting EST, protein, repeat, and prediction data in GFF3 format.  Even though the data may have originated from other programs, MAKER treats it as if it originated from within the pipeline.  MAKER even has an other_gff option, so you can pass-through features that don't necessarily fit into categories that MAKER can use.  These get passed strait through into the output file, so it's an easy way to keep user defined features.
+
 
+
 
+
With the GFF3 pass-through option, you can now imagine including gene predictions from programs like TwinScanor or EST alignments from programs like BLAT, both of which are unsupported by MAKER.  Let's do that.
+
emacs maker_opt.ctl
+
 
+
 
+
Change the configuration options.
+
predictor:model_gff,snap,pred_gff
+
est_gff:est.gff
+
pred_gff:pred.gff
+
 
+
 
+
Run maker.
+
maker
+
{{TextEditorLink|emacs}}
+
 
+
 
+
Now examine the output in Apollo, you will see new evidence features from TwinScan and BLAT. There are even a few annotations that now derive from the TwinScan predictions.
+
 
+
==mRNAseq==
+
 
mRNAseq is a high throughput technique for sequencing the entire transcriptome, and it holds the promise of allowing researchers to identify all exons and alternative splice forms for every gene in the genome with a single experiment.  It may soon make gene predictors (mostly) a thing of the past.
 
mRNAseq is a high throughput technique for sequencing the entire transcriptome, and it holds the promise of allowing researchers to identify all exons and alternative splice forms for every gene in the genome with a single experiment.  It may soon make gene predictors (mostly) a thing of the past.
 
*Still need to de-convolute reads & evidence (for now)
 
*Still need to de-convolute reads & evidence (for now)
Line 830: Line 389:
  
  
[[Image:MRNAseq.jpg | 843px | ]]
+
[[File:MRNAseq.jpg|500px]]
  
  
Line 836: Line 395:
  
  
==Merge/Resolve Legacy Annotations==
+
===Merge/Resolve Legacy Annotations===
 
Legacy annotations
 
Legacy annotations
 
*Many are no longer maintained by original creators
 
*Many are no longer maintained by original creators
Line 844: Line 403:
  
  
[[Image:Legacy.png]]
+
[[File:Legacy.png|500px]]
  
  
Line 852: Line 411:
 
*If no existing annotation, create new one
 
*If no existing annotation, create new one
  
 
+
[[Category:MAKER]]
Let's look at an example: ~/software/maker/legacy.tar.gz
+
[[Category:Tutorials]]
cd ~
+
tar -zxvf ~/software/maker/legacy.tar.gz
+
cd legacy
+
ls -l
+
 
+
genome.fasta
+
est.fasta
+
protein.fasta
+
repeat_protein.fasta
+
legacy1.gff
+
legacy2.gff
+
Pult.hmm
+
 
+
 
+
You need to merge the legacy GFF3 files since maker only accepts one input model_gff file.  In future versions of MAKER you will be able to use a comma separated list.
+
gff3_merge legacy1.gff legacy2.gff -o legacy.gff
+
 
+
 
+
Edit configuration files.
+
maker -CTL
+
emacs maker_opts.ctl
+
 
+
 
+
Change the following configuration values. We are going to use the legacy annotations in conjunction with SNAP.  SNAP can then create and update annotations whenever the evidence permits.
+
genome:genome.fasta
+
est:est.fasta
+
protein:protein.fasta
+
repeat_protein:repeat_protein.fasta
+
model_gff:legacy.gff
+
predictor:model_gff,snap
+
snaphmm:Pult.hmm
+
 
+
 
+
Copy the Pult.hmm file to your current working directory from the previous GFF3 pass-through example.  We need this file for SNAP.
+
cp ../pass/Pult.hmm .
+
 
+
 
+
Now run MAKER.
+
maker
+
 
+
==MAKER Accessory Scripts==
+
MAKER comes with a number of accessory scripts that are meant to assist in manipulations of the MAKER input and output files.
+
 
+
 
+
Scripts:
+
*''add_utr_gff.pl'' - Adds explicit 5' and 3' UTR features to the GFF3 output file
+
add_utr_gff.pl <gff3_directory>
+
 
+
 
+
*''add_utr_start_stop_gff'' - Adds explicit 5' and 3' UTR as well as start and stop codon features to the GFF3 output file
+
add_utr_start_stop_gff <gff3_file>
+
 
+
 
+
*''fasta_merge'' - Collects all of MAKER's fasta file output for each contig and merges them to make genome level fastas
+
fasta_merge -d <datastore_index> -o <outfile>
+
 
+
 
+
*''gff3_merge'' - Collects all of MAKER's GFF3 file output for each contig and merges them to make a single genome level GFF3
+
gff3_merge -d <datastore_index> -o <outfile>
+
 
+
 
+
*''gff3_2_gtf'' - Converts MAKER GFF3 files to GTF format (run add_utr_start_stop_gff first to get UTR features)
+
gff3_2_gtf <gff3_file>
+
 
+
 
+
*''gff3_preds2models'' - Converts the gene prediction match/match_part format to annotation gene/mRNA/exon/CDS format
+
gff3_preds2models <gff3 file> <pred list>
+
 
+
 
+
*''iprscan2gff3'' - Takes InerproScan (iprscan) output and generates GFF3 features representing domains. Interesting tier for GBrowse.
+
iprscan2gff3 <iprscan_file> <gff3_fasta>
+
 
+
 
+
*''iprscan_batch'' - Wrapper for iprscan to take advantage of multiprocessor systems.
+
iprscan_batch <file_name> <cpus> <log_file>
+
 
+
 
+
*''ipr_update_gff'' - Takes InterproScan (iptrscan) output and maps domain IDs and GO terms to the Dbxref and Ontology_term attributes in the GFF3 file.
+
ipr_update_gff <gff3_file> <iprscan_file>
+
 
+
 
+
*''maker2zff.pl'' - Pulls out MAKER gene models from the MAKER GFF3 output and convert them into ZFF format for SNAP training.
+
maker2zff.pl <gff3_file>
+
 
+
 
+
*''maker_functional_fasta'' - Maps putative functions identified from BLASTP against UniProt/SwissProt to the MAKER produced tarnscript and protein fasta files.
+
maker_functional_fasta <uniprot_fasta> <blast_output> <fasta1> <fasta2> <fasta3> ...
+
 
+
 
+
*''maker_functional_gff'' - Maps putative functions identified from BLASTP against UniProt/SwissProt to the MAKER produced GFF3 files in the Note attribute.
+
maker_functional_gff <uniprot_fasta> <blast_output> <gff3_1>
+
 
+
 
+
*''maker_map_ids'' - Build shorter IDs/Names for MAKER genes and transcripts following the NCBI suggested naming format.
+
maker_map_ids --prefix PYU1_ --justify 6 genome.all.gff > genome.all.id.map
+
 
+
 
+
*''map_fasta_ids'' - Maps short IDs/Names to MAKER fasta files.
+
map_fasta_ids <map_file> <fasta_file>
+
 
+
 
+
*''map_gff_ids'' -  Maps short IDs/Names to MAKER GFF3 files, old IDs/Names are mapped to to the Alias attribute.
+
map_gff_ids <map_file> <gff3_file>
+
 
+
 
+
*''split_fasta'' - Splits multi-fasta files into the number of files specified y the user.  Useful for breaking up MAKER jobs.
+
split_fasta [count] <input_fasta>
+
 
+
==MPI Support==
+
MAKER optionally supports Message Passing Interface (MPI), a parallel computation communication protocol primarily used on computer clusters.  This allows for MAKER jobs to be broken up across multiple nodes/processors for increased performance and scalability.
+
 
+
 
+
[[Image:Mpi_maker.png]]
+
 
+
<div class="emphasisbox">
+
The steps below should get MPI to work on your machine.  However, we did not actually run them during the [[Americas]] course, so MPI does not work on the VMware images produced by that course.</div>
+
 
+
To use this feature, you must have MPICH2 installed with the the --enable-sharedlibs flag set during installation (See MPICH2 Installer's Guide).  I have installed this for you.  So lets set up MPI_MAKER and run the example file that comes with MAKER.  For some reason we cannot install via sudo because it destroys the PATH environmental variable that tells where MPICH2 executables are install, so instead we need to install explicitly as the root user.
+
sudo su
+
source /home/gmod/.bashrc
+
source /home/gmod/.profile
+
cd /usr/local/maker/MPI/
+
perl Install.pl
+
 
+
 
+
Now press control and d together (^d) to exit the root user.
+
 
+
 
+
You should now see the executable mpi_maker listed among the other MAKER scripts.  Let's run some example data to see if MPI_MAKER is working properly.
+
cd ~
+
mkdir ~/maker_run2
+
cd maker_run2
+
cp /usr/local/data/dpp* ~/maker_run2
+
maker -CTL
+
emacs maker_opt.ctl
+
 
+
 
+
Set values in maker configuration files.
+
genome:dpp_contig.fasta
+
est:dpp_transcripts.fasta
+
protein:dpp_proteins.fasta
+
predictor:snap
+
snaphmm:fly
+
 
+
 
+
We need to set up a few more things for MPI to work.  Type mpd to see a list of instructions.
+
mpd
+
 
+
 
+
You should see the following.
+
configuration file /home/gmod/mpd.conf not found
+
A file named .mpd.conf file must be present in the user's home
+
directory (/etc/mpd.conf if root) with read and write access
+
only for the user, and must contain at least a line with:
+
MPD_SECRETWORD=<secretword>
+
One way to safely create this file is to do the following:
+
  cd $HOME
+
  touch .mpd.conf
+
  chmod 600 .mpd.conf
+
and then use an editor to insert a line like
+
  MPD_SECRETWORD=mr45-j9z
+
into the file.  (Of course use some other secret word than mr45-j9z.)
+
 
+
 
+
Follow the instructions to set this file up, and start the mpi environment with mpdboot.  Then run mpi_maker through the MPI manager mpiexec.
+
mpdboot
+
mpiexec -n 2 mpi_maker
+
 
+
 
+
mpiexec is a wrapper that handles the MPI environment.  The -n 2 flag tells mpiexec to use 2 cpus/nodes when running mpi_maker.  For a large cluster, this could be set to something like 100.  You should now know how to start a MAKER job via MPI.
+
 
+
==MAKER Web-Service==
+
If you don't want to install MAKER, there is also a MAKER Web-Service that makes the annotation process even easier.  So now you can annotate a genome from you iPhone (there's an app for that. :-) ...
+
 
+
 
+
There are still quite a few bugs, but you can experiment and give me feedback if you want.
+
 
+
 
+
[[Image:MAKERWeb.jpg]]
+

Latest revision as of 22:03, 3 October 2012

Maker Web Annotation Service

The MAKER Web Annotation Service (MWAS) is an easily configurable web-accesible genome annotation pipeline. It's purpose is to allow research groups with small to intermediate amounts of eukaryotic and prokaryotic genome sequence (i.e. BAC clones, small whole genomes, preliminary sequencing data, etc.) to independently annotate and analyse their data and produce output that can be loaded into a genome database. MWAS is build on the stand alone genome annotation pipeline MAKER, and users who wish to annotate datasets that are too large to submit to MWAS are free to download MAKER for use on their own systems.


Understanding MWAS

The first half of this page gives general background to genome annotation as well as describes validation data for the MAKER Web Annotation Service, MWAS. The stand alone annotation pipeline MAKER is at the heart of MWAS, and MWAS has been configured to present the user with configuration options that match those of the command line program MAKER as closely as possible.


Introduction to Genome Annotation

What Are Annotations?

Annotations are descriptions of different features of the genome, and they can be both structural or functional in nature.

Examples:

  • Structural Annotations: exons, introns, UTRs, splice forms etc.
  • Functional Annotations: process a gene is involved in (metabolism), molecular function (hydrolase), location of expression (expressed in the mitochondria), etc.


It is especially important that all genome annotations include with themselves an evidence trail that describes in detail the evidence that was used to both suggest and support each annotation. This assists in quality control and downstream management of genome annotations.

Examples of evidence supporting a structural annotation:

  • Ab initio gene predictions
  • ESTs
  • Protein homology

Importance of Genome Annotations

Why should the average biologist care about genome annotations? Genome sequence itself is not very useful. The main question when any genome is sequenced is, "where are the genes?" To identify the genes we need to annotate the genome. And while most researchers probably don't give annotations a lot of thought, they use them everyday.


Examples of Annotation Databases:


Every time we use techniques such as RNAi, PCR, gene expression arrays, targeted gene knockout, or CHIP we are basing our experiments on the information derived from a digitally stored genome annotation. If the annotation is correct, then these experiments should succeed; however, if an annotation is incorrect these experiments are bound to fail. Which brings up a major point:

  • Incorrect and incomplete genome annotations poison every experiment that uses them.

Quality control and evidence management are therefore essential components to any annotation process.

Effect of Next Generation Sequencing on the Annotation Process

It’s generally accepted that within the next few years it will be possible to sequence even human sized genomes for as little as $1,000 and in a short time frame. When these expectations finally become reality, then whole genome sequencing will likely become routine for even small laboratories. Unfortunately, advances in annotation technology have not kept pace with genome sequencing, and annotation is rapidly becoming a major bottleneck affecting modern genomics research.

For example:

  • As of October 2009, 222 eukaryotic genomes were fully sequenced yet unpublished (this is an ever growing backlog).
  • Currently (Jan 2010) there are over 900 eukaryotic genome projects underway, assuming 10,000 genes per genome, that’s 9,000,000 new annotations (with this many new annotations, quality control and maintenance become an issue).
  • While there are organizations dedicated to producing and distributing genome annotations (i.e ENSEMBL and VectorBase), the shear volume of newly sequenced genomes exceeds both their capacity and stated purview.
  • Many small research groups (which often lack bioinformatics experience) must therefore confront the difficulties associated with genome annotation on their own.


The MAKER Web Annotation Service is a tool to assist research groups in converting the mountain of genomic data provided by next generation sequencing technologies into a usable resource, and for larger datasets, research groups can use a local installation of the annotation pipeline MAKER.

What does MWAS do?

  • Identifies and masks out repeat elements
  • Aligns ESTs to the genome
  • Aligns proteins to the genome
  • Produces ab initio gene predictions
  • Synthesizes these data into final annotations
  • Produces evidence-based quality values for downstream annotation management


File:Apollo view.jpg
MAKER generated annotations, shown in Apollo.


What sets MAKER and MWAS apart from other tools (ab initio gene predictors etc.)?

MAKER is an annotation pipeline, not a gene predictor. MAKER does not predict genes, rather MAKER leverages existing software tools (some of which are gene predictors) and integrates their output to produce what MAKER believes to be the best possible gene model for a given location based on evidence alignments.


gene prediction ≠ gene annotation

  • gene predictions are gene models.
  • gene annotations are gene models but should include a documented evidence trail supporting the model in addition to quality control metrics.


This may seem like just a matter of semantics since the primary output for both ab initio gene predictors and the MAKER pipeline is the same, a collection of gene models. However there are a few very significant consequences to the differences between these programs that I will explain shortly.


Emerging vs. Model Genomes

Emerging model organism genomes each come with there own set of issues that are not necessarily found in classic model genomes. These include difficulties associated with Repeat identification, gene finder training, and other complex analyses. Unfortunately emerging model organisms are often studied by very small research communities which often lack the resources and bioinformatics experience necessary to tackle these issues.

Classic Model Organisms Emerging Model Organisms

Well developed experimental systems

New experimental systems

  • Genome will be the central resource for work in these systems

Much prior knowledge about genome

Little prior knowledge about genome

  • Usually no genetics
Large community Small communities
Big $ Less $
Examples: D. melanogaster, C. elegans, human, etc. Examples: oomycetes, flat worms, cone snail, etc.

Comparison of Algorithm Performance on Model vs. Emerging Genomes

If you have ever looked at comparisons of gene predictor performance on classic model organisms such as C. elegans you would conclude that ab initio gene predictors match or even outperform state of the art annotation pipelines, and the truth is that, with enough training data, they do. However, it is important to keep in mind that ab initio gene predictors have been specifically optimized to perform well on model organisms such as Drosophila and C. elegans, organisms for which we have large amount of pre-existing data to both train and tweak the prediction parameters.


Table: MAKER's Performance on the C. elegans genome

Performance

Category

Ab initio Evidence Based
SNAP Augustus MAKER Gramene
Genomic Overlap (gene)
SP 82.48 88.09 91.69 93.49
SN 95.44 96.78 89.81 88.74
Exon Overlap
SP 18.88 22.87 25.58 27.38
SN 87.63 93.09 91.17 94.84

What about emerging model organisms for which little data is available? Gene prediction in classic model organisms is relatively simple because there are already a large number of experimentally determined and verified gene models, but with emerging model organisms, we are lucky to have a handful of gene models to train with. As a result ab initio gene predictors generally perform very poorly on emerging genomes.

MAKER's Performance on the S. mediterranea Emerging Model Organism Genome. Pfam domain content of gene models determined using rpsblast


By using ab inito gene predictors inside of the MAKER pipeline instead of as stand alone applications you get certain benefit:

  • Provide gene models as well as an evidence trail correlations for quality control and manual curation
  • Provide a mechanism to train and retrain ab initio gene predictors for even better performance.
  • Output can be easily loaded into a GMOD compatible database for annotation distribution (including evidence associations).
  • Annotations can be automatically updated with new evidence by simply passing existing annotation sets back into the pipeline

Getting Started with MWAS

Registration

MWAS is free to all users for academic use and has no login requirement, but registration is recommended as it allows for easier file and job management and registered users are allowed to upload more sequence.

Running MWAS with Example Data

MWAS comes with some example files to familiarize the user with how to run an annotation job. You can pre-load the fields for an example job by selecting one of the examples from the drop down menu on the "New Job" page and then selecting "Load". This will fill out options on the "New Job" form for you. Review the options carefully, and then submit the example job for execution by pressing the "Submit to Queue" button at the bottom of the page.

Start with the "Drosophila melanogaster : DPP example". This will load the region of the D. melanogaster genome encoding decapentaplegic along with cDNA and protein evidence overlapping the region. Select "Drosophila melanogaster : DPP example" from the drop down example menu. Then select load to fill in the form.

If you scroll down through the form, you will notice that the genome file, EST file, protein file, and prediction method sections have been filled out for you. Click on "Submit to Queue", to start the job.

You should be redirected to the MWAS start page upon submisssion, and the job you have submitted should be visible in the job status section. Click "Refresh Job Status" to update the run status of your job. Within a few moments, your job will complete, at which point you can view the results

Click on "View Results". You can now download the results for local analysis on your own system or you can click on "View in Apollo" to seen gene models loaded directly in the Apollo genome browser. This option will install a Java Web Start version of Apollo if it is not already installed. You can also view summery statistics of the annotation from the Sequence Ontologies SOBA tool by clickin on "SOBA Statistics".


Details of What is Going on Inside of MWAS

Repeat Masking

MWAS runs MAKER internally, an the first step to MAKER is repeat masking, but why do we need to do this? Repetitive elements can make up a significant portion of the genome. Some of these repeats are simple/low-complexity repeats where you have runs of C's or G's or maybe even something like AAGGAAGGAAGG. Other repeats are more complex, i.e. transposable elements. These high-complexity repeats often encode real proteins like rerotranscriptase or even Gag, Pol, and Env viral proteins. Because they encode real proteins, they can play havoc with ab initio gene predictors. For example, a transposable element that occurs next to or even within the intron of a real protein encoding gene might cause a gene predictor to include extra exons as part of a gene model, sequence which really only belongs to the transposable element and not to the coding sequence of the gene. You will also get hundreds of instances where identical transportable element proteins get annotated as being part of an organisms proteome. In addition these issues, low-complexity repeat regions can align with high statistical significance to low-complexity protein regions creating a false sense of homology throughout the genome. To avoid these complications it is convenient to identify and mask any repeat elements before doing other analyses.


MAKER identifies repeats in two steps.

  • First a program called RepeatMasker is used to identify low-complexity and high-complexity repeats that match entries in the RepBase repeat library, or any species specific repeat library supplied by the user.
  • Next MAKER uses RepeatRunner to identify transposable element and viral proteins from the RepeatRunner protein database. Because protein sequence diverges at a slower rate than nucleotide sequence, this step helps pick up the most problematic regions of divergent repeats that are missed by RepeatMasker, which searches in nucleotide space.


Regions identified during repeat analysis are masked out so as not to complicate other downstream annotation analyses.

  • High-complexity repeats are hard-masked, a technique in which nucleotide sequence is replaced with the letter N to prohibit any alignments to that region.
  • Low-complexity regions are soft-masked, a technique in which nucleotides are made lower case so they can be treated as masked under certain situations without losing sequence information. I will discuss some of the applications and effects of soft-masking later.


Now the idea of masking out sequence might seem on the surface like we're losing a lot of information, and it is true that there can be proteins that have integrated repeats into their structure, so repeat masking will affect our ability to annotate these proteins. However, these proteins are rare and the number of gene models and homology alignments improved by this step far exceed the few gene models that may be negatively affected.

Ab Initio Gene Prediction

Following repeat masking, MAKER runs ab initio gene predictors specified by the user to produce preliminary gene models. Ab initio gene predictors produce gene predictions based on underlying mathematical models describing patterns of intron/exon structure and consensus start signals. Gene models are not produced by directly using experimental evidence. Because the patterns of gene structure are going to differ from organism to organism, you must train gene predictors before you can use them. I will discuss how to do this later on.


MAKER currently supports:

  • SNAP
  • Augustus
  • GeneMark
  • FGENESH (Disabled on public MWAS site)


You must specify HMM files you want to use use when running each of these algorithms.

EST and Protein Evidence Alignment

A simple way to indicate if a sequence region is likely associated with a gene is to identify (A) if the region is actively being transcribed or (B) if the region has homology to a known protein. This can be done by aligning Expressed Sequence Tags (ESTs) and proteins to the genome using alignment algorithms.

  • ESTs are sequences derived from a cDNA library. Because of the difficulties associated with working with mRNA and depending on how the cDNA library was prepared, EST databases usually represent bits and pieces of transcribed mRNAs with only a few full length transcripts. MAKER aligns these sequences to the genome using BLASTN. If ESTs from the organism being annotated are unavailable or sparse, you can use ESTs from a closely related organism. However, ESTs from closely related organisms are unlikely to align using BLASTN since nucleotide sequences can diverge quite rapidly. For these ESTs, MAKER uses TBLASTX to align them in protein space.
  • Protein sequence generally diverges quite slowly over large evolutionary distances, as a result proteins from even evolutionarily distant organisms can be aligned against raw genomic sequence to try and identify regions of homology. MAKER does this using BLASTX.


Remember now that we are aligning against the repeat-masked genomic sequence. How is this going to affect our alignments? For one thing we won't be able to align against low-complexity regions. Some real proteins contain low-complexity regions and it would be nice to identify those, but if I let anything align to a low-complexity region, then I will get spurious alignments all over the genome. Wouldn't it be nice if there was a way to allow BLAST to extend alignments through low-complexity regions, but only if there is is already alignment somewhere else? You can do this with soft-masking. If you remember soft-masking is using lower case letters to mask sequence without losing the sequence information. BLAST allows you to use soft-masking to keep alignments from seeding in low-complexity regions, but allows you to extend through them. This of course will allow some of the spurious alignments you were trying to avoid, but overall you still end up suppressing the majority of poor alignments while letting through enough real alignments to justify the cost.

Polishing Evidence Alignments

Because of oddities associated with how BLAST statistics work, BLAST alignments are not as informative as they could be. BLAST will align regions any where it can, even if the algorithm aligns regions out of order, with multiple overlapping alignments in the exact same region, or with slight overhangs around splice sites.


To get more informative alignments MAKER uses the program Exonerate to polish BLAST hits. Exonerate realigns each sequences identified by BLAST around splice sites and forces the alignments to occur in order. The result is a high quality alignment that can be used to suggest near exact intron/exon positions. Polished alignments are produced using the est2genome and protein2genome options for Exonerate.


One of the benefits of polishing EST alignments is the ability to identify the strand an EST derives from. Because of amplification steps involved in building an EST library and limitations involved in some high throughput sequencing technologies, you don't necessarily know whether you're really aligning the forward or reverse transcript of an mRNA. However, if you take splice sites into account, you can only align to one strand correctly.


Integrating Evidence to Synthesize Final Annotations

Once you have ab initio predictions, EST alignments, and protein alignments you can integrate this evidence to produce even better gene predictions. MAKER does this by "talking" to the gene prediction programs. MAKER takes all the evidence, generates "hints" to where splice sites and protein coding regions are located, and then passes these "hints" to programs that will accept them.


MAKER produces hint based predictors for:

  • SNAP
  • Augustus
  • FGENESH
  • GeneMark (under development)


MAKER then takes the entire pool of ab initio and evidence informed gene predictions, updates features such as 5' and 3' UTRs based on EST evidence, tries to determine alternative splice forms where EST data permits, produces quality control metrics for each gene model (this is included in the output), and then MAKER chooses from among all the gene model possibilities the one that best matches the evidence. This is done using a modified sensitivity/specificity distance metric.



Running MWAS with your Own Data

When using your own data, you need to tell MWAS all the details about how you want the annotation process to proceed. Because there can be many variables and options involved in annotation you will need to review each option carefully. At the very least you should provide a genome sequence file, an EST sequence file, and a protein homology sequence file for new annotation jobs.

MWAS Job Configuration

Basic Input Files

All the basic input files for MWAS should be in fasta format.

  • genome - Genomic sequence file
  • est - ESTs from the same organism or from a very very closely related organism (i.e. chimpanzee to human). These are aligned first via BLASTN with very strict filtering so any sequence divergence can prohibit the alignment.
  • altest - These are ESTs from other closely related organisms (i.e. mouse to human). They are aligned via TBLASTX in protein space, so greater sequence divergence is permitted.
  • protein - proteins from the same or other organisms. These are aligned via BLASTX against the genome. Proteins that align to a region will not necessarily be orthologous or paralogous. The alignment may just be based on short regions such as a shared domain. You may also get alignments to pseudogenes. Polishing BLASTX hits with Exonerate helps identify what are likely true paralogs and orthologs.


Repeat Masking Options

Repeat masking is important for improving gene predictor performance and avoiding protein alignments to what are likely just transposons. You also expect a certain amount of genomic contamination in the EST database, much of this contamination maps back to repeat regions. By repeat masking we can avoid issues with all types of input data.


  • RepeatMasker - Performs repeat masking using the RepBase libraries.
  • RepeatRunner - This is a fasta file of transposon and virus related proteins. The serve provides an internal database to use by default.
  • Users can also supply a fasta file of species specific nucleotide repeats or a GFF3 file of pre-defined repeat regions. Species specific repeat database can be built using programs like PILER and uploaded for use with MAKER.


Gene Prediction Options

Gene prediction options affect the final gene annotations more than any other option type. This brings up the point that electronically produced gene annotations will only be as good as the gene predictions they are based on.


  • Predictor Options - Tell MWAS which programs to use when generating gene models.
    • SNAP
    • Augustus
    • GeneMark
    • Est2Genome - Allows high quality spliced Exonerate EST alignments to become gene annotations. This only happens when there is no gene prediction overlapping the region. This is useful for generating gene annotations in the absence of a trained gene predictor.
    • Protein2Genome - Used only for Prokaryotic genomes. Will try and build gene models based solely on the presence of open reading frames and protein alignments to other species.
    • User supplied gene predictions - These are gene predictions in GFF3 format from any source you have available to you. They will be treated the same as any gene predictions derived from MWAS supported sources.
    • User supplied gene models - These are pre-existing gene models from the same assembly as the contigs being annotated. They can be integrated and automatically updated by MAKER to reflect new evidence (i.e. add UTR etc.). MAKER can also pull names forward from these pre-existing gene models onto new updated genome annotations.

Other MAKER Options

  • Sets the minimum length a contig must have or else it will be skipped.
  • Sets the minimum length a predicted protein must have (in amino acids) to be annotated.
  • Set the expected max intron size for evidence alignments
  • Tells MAKER to consider single exon EST evidence when generating annotations. Single exon ESTs are more likely to be genomic contamination.
  • 'Sets the minimum length required for single exon ESTs if 'single_exon' is enabled

MWAS Results

The results provided to the user from the MWAS can either be downloaded or directly viewed online using a Java Web Start version of the Apollo genome annotation curration tool.

If you choose to download your data you will be presented with a tarball that when unpacked will produce an output directory called something like 2434.maker.output. The name of the output directory is based off of the job id assigned to your sequence file.


When you examine the contents of this directory, you should see a list of directories and files created by MAKER.

drwxr-xr-x 3 gmod gmod 4096 2009-07-12 23:23 2434_datastore
-rw-r--r-- 1 gmod gmod  135 2009-07-12 23:27 2434_master_datastore_index.log
-rw-r--r-- 1 gmod gmod 1579 2009-07-12 23:23 maker_bopts.log
-rw-r--r-- 1 gmod gmod 1250 2009-07-12 23:23 maker_exe.log
-rw-r--r-- 1 gmod gmod 4016 2009-07-12 23:23 maker_opts.log
drwxr-xr-x 2 gmod gmod 4096 2009-07-12 23:23 mpi_blastdb
  • The maker_opt.log, maker_exe.log, and maker_bopts.log files are logs of the control files used for this run of MAKER.
  • The mpi_blastdb directory contains fasta indexes and BLAST database files created from the input EST, protein, and repeat databases.
  • The 2434_master_datastore_index.log contains information on both the run status of individual contigs and information on where individual contig data is stored.
  • The 2434_datastore directory contains a set of subfolders, each containing the final MAKER output for individual contigs from the genomic fasta file.


Once a MAKER run is finished the most important file to look at is the 2434_master_datastore_index.log to see if there were any failures.

less 2434_master_datastore_index.log.  MWAS provides a summery of this file when you click on results to download a job.  MWAS also displays run errors in the log option button that you can click on when in the MWAS main queue page.

If everything proceeded correctly you should see the following in your 2434_master_datastore_index.log file.

contig-dpp-500-500      2434_datastore/contig-dpp-500-500 STARTED
contig-dpp-500-500      2434_datastore/contig-dpp-500-500 FINISHED


There are only entries describing a single contig because there was only one contig in the example file. These lines indicate that the contig 'contig-dpp-500-500' STARTED and then FINISHED without incident. Other possible entries include:

  • DIED - indicates a failed run on this contig, MAKER will retry these
  • RETRY - indicates that MAKER is retrying a contig that failed
  • SKIPPED_SMALL - indicates the contig was too short
  • DIED_SKIPPED_PERMANENT - indicates a failed contig that MAKER will not attempt to retry


The entries in the 2434_master_datastore_index.log file also indicate that the output files for this contig are stored in the directory dpp_contig_datastore/contig-dpp-500-500/. Knowing where the output is stored may seem rather trivial; however, input genome fasta files can contain thousands even hundreds-of-thousands of contigs, and many file-systems have performance problems with large numbers of sub-directories and files within a single directory. Even when the underlying file-systems handle things gracefully, access via network file-systems can be an issue. To deal with this situation, MAKER uses a datastore module to create a hierarchy of sub-directory layers, starting from a 'base', and mapping identifiers to corresponding sub-directories. For situations where the input genome fasta file contains more than 1,000 contigs, the datastore structure is used automatically, and the master_datastore_index.log file becomes essential for identifying where the output for a given contig is stored.


now let's take a look at what MAKER produced for the contig 'contig-dpp-500-500'.

cd 2434_datastore/contig-dpp-500-500
ls -l

The directory should contain a number of files.

-rw-r--r-- 1 gmod gmod 47437 2009-07-12 23:27 contig-dpp-500-500.gff
-rw-r--r-- 1 gmod gmod   189 2009-07-12 23:27 contig-dpp-500-500.maker.non_overlapping_ab_initio.proteins.fasta
-rw-r--r-- 1 gmod gmod   399 2009-07-12 23:27 contig-dpp-500-500.maker.non_overlapping_ab_initio.transcripts.fasta
-rw-r--r-- 1 gmod gmod   704 2009-07-12 23:27 contig-dpp-500-500.maker.proteins.fasta
-rw-r--r-- 1 gmod gmod   901 2009-07-12 23:27 contig-dpp-500-500.maker.snap_masked.proteins.fasta
-rw-r--r-- 1 gmod gmod  4837 2009-07-12 23:27 contig-dpp-500-500.maker.snap_masked.transcripts.fasta
-rw-r--r-- 1 gmod gmod  4430 2009-07-12 23:27 contig-dpp-500-500.maker.transcripts.fasta


  • The contig-dpp-500-500.gff contains all annotations and evidence alignments in GFF3 format. This is the important file for use with Apollo or GBrowse.
  • The contig-dpp-500-500.maker.transcripts.fasta and contig-dpp-500-500.maker.proteins.fasta files contain the transcript and protein sequences for MAKER produced gene annotations.
  • The contig-dpp-500-500.maker.snap_masked.transcripts.fasta and contig-dpp-500-500.maker.snap_masked.proteins.fasta files contain the transcript and protein sequences for all SNAP ab initio gene predictions. If you use other ab initio gene predictors, those sequence files will follow a similar naming pattern.
  • The contig-dpp-500-500.maker.non_overlapping_ab_initio.transcripts.fasta and contig-dpp-500-500.maker.non_overlapping_ab_initio.proteins.fasta files contain the set of best ab initio gene predictions that do not overlap a MAKER gene annotation. These files can be analyzed to see if there is any reason to promote them to the status of gene annotations. For example: you can run iprscan to see if they contain known protein domains.

Viewing MAKER Annotations

Viewing the raw GFF3 file produced by MAKER really isn't that meaningful.


For sanity checking purposes it would be nice to have a graphical view of what's in the GFF3 file. To do this GFF3 files can be loaded into programs like Apollo and GBrowse. MWAS allows you to view the files in Apollo directly on the website. You can also get summery statistics of annotation features using the tool SOBA from the Sequence Ontology Consortium.


Apollo

On the results screen choose a contig from a job and click "View in Apollo". A Java Web Start version of Apollo will then install itself automatically on your computer, if not already installed. Apollo will then automatically load the contig you indicated into the browser. You will notice that there are a number of bars representing the gene annotations and the evidence alignments supporting those annotations. Annotations are in the middle light colored panel, and evidence alignments are in the dark panels at the top and bottom.


All the evidence in the dark panels will be a different color depending on the source each piece of evidence was derived from (i.e. RepeatMasker, BLASTX, etc.). To identify which source a feature belongs to, just manually clicking on one and the name of the source will be displayed in the table at the bottom of the Apollo screen.

Possible Sources Include:

  • BLASTN - BLASTN alignment of EST evidence
  • BLASTX - BLASTX alignment of protein evidence
  • TBLASTX - TBLASTX alignment of EST evidence from closely related organisms
  • EST2Genome - Polished EST alignment from Exonerate
  • Protein2Genome - Polished protein alignment from Exonerate
  • SNAP - SNAP ab inito gene prediction
  • GENEMARK - GeneMarkab inito gene prediction
  • Augustus - Augustus ab inito gene prediction
  • FgenesH - FGENESH ab inito gene prediction
  • Repeatmasker - RepeatMasker identified repeat
  • RepeatRunner - RepeatRunner identified repeat from the repeat protein database

Training ab initio Gene Predictors

If you are involved in a genome project for an emerging model organism, you should already have an EST database which would have been generated as part of the original sequencing project. A protein database can be collected from closely related organism genome databases or by using the UniProt/SwissProt protein database or the NCBI NR protein database. However a trained ab initio gene predictor is a much more difficult thing to generate. Gene predictors require existing gene models on which to base prediction parameters. However, with emerging model organisms there are no pre-existing gene models. So how then are you supposed to train your gene prediction programs?


MWAS gives the user the option to produce gene annotations directly from the EST evidence. You can then use these imperfect gene models to train gene predictor program. Once you have re-run MWAS with the newly trained gene predictor, you can use the second set of gene annotations to train the gene predictors yet again. This boot-strap process allows you to iteratively improve the performance of ab initio gene predictors.

GFF3 Pass-through

What if I'm not working on a new genome project, but rather I have an existing annotation set, and I just want to update my genome database to reflect new protein and EST evidence. Here you can use a feature in MAKER called GFF3 pass-through, which allows you to pass existing annotations into the program and combine them with new evidence for use in the annotation process.

mRNAseq

mRNAseq is a high throughput technique for sequencing the entire transcriptome, and it holds the promise of allowing researchers to identify all exons and alternative splice forms for every gene in the genome with a single experiment. It may soon make gene predictors (mostly) a thing of the past.

  • Still need to de-convolute reads & evidence (for now)
  • Still need to archive, manage, and distribute annotations


MRNAseq.jpg


We are currently working on native support for mRNAseq data within the MAKER pipeline. However, because of the GFF3 pass-through option, there is a way to take advantage of mRNAseq reads right now. By mapping mRNAseq reads using BowTie and TopHat, you can create GFF3 files of read islands and junctions. This data can then be passed in as EST evidence and will be used for generating hint based gene prediction and for choosing final annotations.


Merge/Resolve Legacy Annotations

Legacy annotations

  • Many are no longer maintained by original creators
  • In some cases more than one group has annotated the same genome, using very different procedures, even different assemblies
  • Many investigators have their own genome-scale data and would like a private set of annotations that reflect these data
  • There will be a need to revise, merge, evaluate, and verify legacy annotation sets in light of RNA-seq and other data


Legacy.png


MAKER will:

  • Identify legacy annotation most consistent with new data
  • Automatically revise it in light of new data
  • If no existing annotation, create new one