Difference between revisions of "MWAS Tutorial"

From GMOD
Jump to: navigation, search
Line 184: Line 184:
 
*Annotations can be automatically updated with new evidence by simply passing existing annotation sets back into the pipeline
 
*Annotations can be automatically updated with new evidence by simply passing existing annotation sets back into the pipeline
  
==Installation==
 
  
===Prerequisites===
+
==Getting Started with MWAS==
*[http://www.perl.org/ Perl] 5.8.0 or Higher
+
===Registration===
*[http://www.bioperl.org/ BioPerl] 1.6 or higher
+
MWAS is free to all users and has no login requirement, but registration is recommended as it allows for easier file and job management and registered users can upload more sequence.
*[http://homepage.mac.com/iankorf/ SNAP] version 2009-02-03  or higher
+
*[http://www.repeatmasker.org/ RepeatMasker] 3.1.6  or higher
+
*[http://www.ebi.ac.uk/~guy/exonerate/ Exonerate] 1.4  or higher
+
  
 +
===RUNNING MWAS WITH EXAMPLE DATA===
 +
MWAS comes with some example files to familiarize the user with how to run MAKER. You can pre-load the fields for a new job by selecting one of the examples from the drop down menu on the "New Job" page.
  
You must also install one of the following:
 
*[http://blast.wustl.edu/ WU-BLAST] 2.0 or higher (Now [http://www.advbiocomp.com/ AB-BLAST])
 
*[http://www.ncbi.nlm.nih.gov/Ftp/ NCBI BLAST] 2.2.X or higher
 
  
 
+
Next we need to tell MAKER all the details about how we want the annotation process to proceed.  Because there can be many variables and options involved in annotation you will need to review each option carefully. At the very least you should provide a genome sequence file, an EST sequence file, and a protein homology sequence file.
Optional Components:
+
*[http://augustus.gobics.de/ Augustus] 2.0 or higher
+
*[http://exon.biology.gatech.edu/ GeneMark-ES] 2.3a or higher
+
*[http://www.softberry.com/ FGENESH] 2.6 or higher
+
 
+
 
+
Requird for optional MPI support:
+
*[http://www.mcs.anl.gov/research/projects/mpich2/ MPICH2]
+
 
+
 
+
 
+
===The MAKER Package===
+
Because of the number of prerequisites, we will not cover the details of installing these other programs; they have already been installed for you.  But even though I did pre-install most programs for you, I'm still going to have you perform basic post installation configurations, so lets get started.
+
 
+
 
+
MAKER can be downloaded from:
+
*http://www.yandell-lab.org/ - but don't do it
+
 
+
 
+
To keep everyone from hitting the server at once though, I placed a tarball in the <tt>~/software/maker/</tt> directory.  Let's unpack this to the directory <tt>/usr/local/</tt>.
+
cd /usr/local
+
sudo tar -zxvf ~/software/maker/maker.tar.gz
+
cd maker
+
ls -l
+
 
+
 
+
You should now see the following:
+
drwxr-xr-x  2 gmod gmod  4096 2009-03-25 13:24 Apollo
+
drwxr-xr-x  3 gmod gmod  4096 2009-07-12 22:50 bin
+
drwxr-xr-x  3 gmod gmod  4096 2009-07-12 23:37 data
+
-rw-r--r--  1 gmod gmod  7746 2009-07-12 22:50 INSTALL
+
drwxr-xr-x 18 gmod gmod  4096 2009-07-12 22:50 lib
+
drwxr-xr-x  3 gmod gmod  4096 2009-07-12 22:50 MPI
+
drwxr-xr-x  7 gmod gmod  4096 2009-07-12 23:07 perl
+
-rw-r--r--  1 gmod gmod 18653 2009-07-12 22:50 README
+
 
+
 
+
There are two files in particular that you would want to look at when installing MAKER -  <tt>INSTALL</tt> and <tt>README</tt>.  <tt>INSTALL</tt> gives a brief overview of MAKER and pre-requisite installation.  Lets take a look at this.
+
less INSTALL
+
 
+
 
+
You can see there is a step by step guide for installing pre-requisites as well as MAKER.  Since the pre-requisites are already installed, jump to the MAKER installation section (press space bar to scroll down).
+
 
+
7.  Install MAKER.  Download from http://www.yandell-lab.org
+
 
+
  a.  Unpack the MAKER tar file into the directory of your choice (i.e.
+
      /usr/local).
+
  b.  Change to the directory maker/perl and run Install.PL by typing:
+
      perl Install.PL
+
  c.  Now add the following to your .bash_profile if you haven't already:
+
        export WUBLASTFILTER="where_wublast_is/filter"
+
        export WUBLASTMAT="where_wublast_is/matrix"
+
        export ZOE="where_snap_is/Zoe"
+
        export AUGUSTUS_CONFIG_PATH="where_augustus_is/config
+
  d.  Add the location where you installed MAKER to your PATH variable in
+
      .bash_profile (i.e. export PATH="/usr/local/maker/bin:$PATH").
+
  e.  You can now run a test of MAKER by following the instructions in the MAKER
+
      README file.
+
 
+
 
+
  See the README file for details on installing mpi_maker
+
 
+
According to the documentation we need to run the <tt>Install.PL</tt> script.
+
cd perl
+
sudo perl Install.PL
+
 
+
 
+
Now we're going to need to add a few entries to your user profile.  So lets open it in a [[Linux Text Editors|text editor]] (I use emacs, you can use whatever you want).
+
emacs ~/.profile
+
{{TextEditorLink|emacs}}
+
 
+
Add the following to your user profile.
+
 
+
For bash:
+
PATH=$PATH:/usr/local/NCBI_blast/bin
+
PATH=$PATH:/usr/local/RepeatMasker
+
PATH=$PATH:/usr/local/exonerate/bin
+
PATH=$PATH:/usr/local/snap
+
PATH=$PATH:/usr/local/augustus/bin
+
PATH=$PATH:/usr/local/gmes
+
PATH=$PATH:/usr/local/maker/bin
+
export PATH
+
 
+
export ZOE=/usr/local/snap
+
export AUGUSTUS_CONFIG_PATH=/usr/local/augustus/config
+
 
+
 
+
Now reload your profile.
+
source ~/.profile
+
 
+
 
+
MAKER should now be installed.  Let's test the executable.  We should see the usage statement.
+
maker -help
+
 
+
==Getting Started with MAKER==
+
===RUNNING MAKER WITH EXAMPLE DATA===
+
MAKER comes with some example files to test the installation and to familiarize the user with how to run MAKER.  The example files are found in the <tt>maker/data</tt> directory.
+
ls -l /usr/local/maker/data
+
 
+
-rw-r--r-- 1 gmod gmod    32712 2009-03-25 13:24 dpp_contig.fasta
+
-rw-r--r-- 1 gmod gmod    3045 2009-03-25 13:24 dpp_proteins.fasta
+
-rw-r--r-- 1 gmod gmod    19138 2009-03-25 13:24 dpp_transcripts.fasta
+
-rw-r--r-- 1 gmod gmod 19744232 2009-07-12 22:50 te_proteins.fasta
+
 
+
For convenience we are going to copy these files before running MAKER.  First we need to make a new directory that will hold all MAKER input and output files.
+
mkdir ~/maker_run1
+
cd ~/maker_run1
+
 
+
Now copy the example files to the new directory.
+
cp /usr/local/maker/data/dpp* ~/maker_run1
+
 
+
 
+
Next we need to tell MAKER all the details about how we want the annotation process to proceed.  Because there can be many variables and options involved in annotation, command line options would be too numerous and cumbersome.  Instead MAKER uses a set of configuration files which guide each run.  You can create a set of generic configuration files in the current working directory by typing the following.
+
maker -CTL
+
 
+
 
+
This creates three files.
+
{|
+
! <tt>maker_exe.ctl</tt>
+
| contains the path information for needed executables.
+
|-
+
! <tt>maker_bopt.ctl</tt>
+
| contains filtering statistics for BLAST and Exonerate
+
|-
+
! <tt>maker_opt.ctl</tt>
+
| contains all other information for MAKER, including the location of the input genome file.
+
|}
+
 
+
Control files are run specific and separate control files will need to be built for each genome given to MAKER. MAKER will look for control files in the current working directory, so it is recommended that MAKER should be run in a separate directory containing unique control files for each genome.
+
 
+
Let's take a look at the <tt>maker_exe.ctl</tt> file.
+
emacs maker_exe.ctl
+
{{TextEditorLink|emacs}}
+
 
+
You will see the names of a number of MAKER supported executables as well as the path to their location.  If you followed the installation instructions correctly, including the instructions for installing pre-requisite programs, all executable paths should show up automatically for you.  However if the location to any of the executables is not set in your PATH environmental variable, as per installation instructions, you will have to add these manually to the <tt>maker_exe.ctl</tt> file every time you run MAKER.
+
 
+
Lines in the MAKER control files have the format key:value with no spaces before or after the colon(:).  If the value is a file name, you can use relative paths and environmental variables, ''i.e.'' <tt>snap:$HOME/snap</tt>.  Note that for all control files the comments written to help users begin with a pound sign(#).  In addition, options before the colon(:) can not be changed, nor should there be a space before or after the colon.
+
 
+
Now let's take a look at the <tt>maker_bopts.ctl</tt> file.
+
emacs maker_bopts.ctl
+
{{TextEditorLink|emacs}}
+
 
+
In this file you will find values you can edit for downstream filtering of BLAST and Exonerate alignments.  At the very top of the file you will see that I have the option to tell MAKER whether I prefer to use WU-BLAST or NCBI-BLAST.  We want to set this to NCBI-BLAST, since that is what we have installed.  We can just leave the remaining values as the default.
+
blast_type:ncbi
+
 
+
Now let's take a look at the <tt>maker_opts.ctl</tt> file.
+
emacs maker_opts.ctl
+
{{TextEditorLink|emacs}}
+
 
+
This is the primary configuration file for MAKER specific options.  Here we need to set the location of the genome, EST, and protein input files we will be using.  These come from the supplied example files.  We also need to set repeat masking options, as well as a number of other configurations.  We'll discuss these options in more detail later on, but for now just adjust the following values.
+
genome:dpp_contig.fasta
+
est:dpp_transcripts.fasta
+
protein:dpp_proteins.fasta
+
snaphmm:fly
+
predictor:snap
+
 
+
 
+
Now let's run MAKER.
+
maker maker_exe.ctl maker_opts.ctl maker_bopts.ctl
+
 
+
 
+
You should now see a large amount of status information flowing past your screen.  If you don't want to see this you can run MAKER with the <tt>-q</tt> option for "quiet" on future runs.
+
  
 
==Details of What is Going on Inside of MAKER==
 
==Details of What is Going on Inside of MAKER==
Line 384: Line 217:
  
  
MAKER currently supports:
+
MWAS currently supports:
 
*SNAP
 
*SNAP
 
*Augustus
 
*Augustus
 
*GeneMark
 
*GeneMark
*FGENESH
+
*FGENESH (Not shown on public site)
  
  
Line 428: Line 261:
  
 
==MAKER's Output==
 
==MAKER's Output==
If you look in the current working directory, you will see that MAKER has created an output directory called <tt>dpp_contig.maker.output</tt>.  The name of the output directory is based off of the input genomic sequence file, which in this case was <tt>dpp_contig.fasta</tt>.
+
Once your job is finished and you download the data, you will see that MAKER has created an output directory called something like <tt>2434.maker.output</tt>.  The name of the output directory is based off of the job id assigned to your sequence file.
 
+
 
+
Now let's see what's inside the output directory.
+
cd dpp_contig.maker.output
+
ls -l
+
  
  
 
You should now see a list of directories and files created by MAKER.
 
You should now see a list of directories and files created by MAKER.
  drwxr-xr-x 3 gmod gmod 4096 2009-07-12 23:23 dpp_contig_datastore
+
  drwxr-xr-x 3 gmod gmod 4096 2009-07-12 23:23 2434_datastore
  -rw-r--r-- 1 gmod gmod  135 2009-07-12 23:27 dpp_contig_master_datastore_index.log
+
  -rw-r--r-- 1 gmod gmod  135 2009-07-12 23:27 2434_master_datastore_index.log
 
  -rw-r--r-- 1 gmod gmod 1579 2009-07-12 23:23 maker_bopts.log
 
  -rw-r--r-- 1 gmod gmod 1579 2009-07-12 23:23 maker_bopts.log
 
  -rw-r--r-- 1 gmod gmod 1250 2009-07-12 23:23 maker_exe.log
 
  -rw-r--r-- 1 gmod gmod 1250 2009-07-12 23:23 maker_exe.log
Line 446: Line 274:
 
*The <tt>maker_opt.log</tt>, <tt>maker_exe.log</tt>, and <tt>maker_bopts.log</tt> files are logs of the control files used for this run of MAKER.
 
*The <tt>maker_opt.log</tt>, <tt>maker_exe.log</tt>, and <tt>maker_bopts.log</tt> files are logs of the control files used for this run of MAKER.
 
*The <tt>mpi_blastdb</tt> directory contains [[Glossary#FASTA|fasta]] indexes and BLAST database files created from the input EST, protein, and repeat databases.
 
*The <tt>mpi_blastdb</tt> directory contains [[Glossary#FASTA|fasta]] indexes and BLAST database files created from the input EST, protein, and repeat databases.
*The <tt>dpp_contig_master_datastore_index.log</tt> contains information on both the run status of individual contigs and information on where individual contig data is stored.
+
*The <tt>2434_master_datastore_index.log</tt> contains information on both the run status of individual contigs and information on where individual contig data is stored.
*The <tt>dpp_contig_datastore</tt> directory contains a set of subfolders, each containing the final MAKER output for individual contigs from the genomic fasta file.
+
*The <tt>2434_datastore</tt> directory contains a set of subfolders, each containing the final MAKER output for individual contigs from the genomic fasta file.
  
  
Once a MAKER run is finished the most important file to look at is the <tt>dpp_contig_master_datastore_index.log</tt> to see if there were any failures.
+
Once a MAKER run is finished the most important file to look at is the <tt>2434_master_datastore_index.log</tt> to see if there were any failures.
  less dpp_contig_master_datastore_index.log
+
  less 2434_master_datastore_index.log.  MWAS provides a summery of this file when you click on results to download a job.  MWAS also displays run errors in the log option button that you can click on when in the MWAS main queue page.
  
  
If everything proceeded correctly you should see the following.
+
If everything proceeded correctly you should see the following in your 2434_master_datastore_index.log file.
  contig-dpp-500-500      dpp_contig_datastore/contig-dpp-500-500 STARTED
+
  contig-dpp-500-500      2434_datastore/contig-dpp-500-500 STARTED
  contig-dpp-500-500      dpp_contig_datastore/contig-dpp-500-500 FINISHED
+
  contig-dpp-500-500      2434_datastore/contig-dpp-500-500 FINISHED
  
  
Line 466: Line 294:
  
  
The entries in the <tt>dpp_contig_master_datastore_index.log</tt> file also indicate that the output files for this contig are stored in the directory dpp_contig_datastore/contig-dpp-500-500/.  Knowing where the output is stored may seem rather trivial; however, input genome fasta files can contain thousands even hundreds-of-thousands of contigs, and many file-systems have performance problems with large numbers of sub-directories and files within a single directory.  Even when the underlying file-systems handle things gracefully, access via network file-systems can be an issue.  To deal with this situation, MAKER uses a datastore module to create a hierarchy of sub-directory layers, starting from a 'base', and mapping identifiers to corresponding sub-directories.  For situations where the input genome fasta file contains more than 1,000 contigs, the datastore structure is used automatically, and the <tt>master_datastore_index.log</tt> file becomes essential for identifying where the output for a given contig is stored.
+
The entries in the <tt>2434_master_datastore_index.log</tt> file also indicate that the output files for this contig are stored in the directory dpp_contig_datastore/contig-dpp-500-500/.  Knowing where the output is stored may seem rather trivial; however, input genome fasta files can contain thousands even hundreds-of-thousands of contigs, and many file-systems have performance problems with large numbers of sub-directories and files within a single directory.  Even when the underlying file-systems handle things gracefully, access via network file-systems can be an issue.  To deal with this situation, MAKER uses a datastore module to create a hierarchy of sub-directory layers, starting from a 'base', and mapping identifiers to corresponding sub-directories.  For situations where the input genome fasta file contains more than 1,000 contigs, the datastore structure is used automatically, and the <tt>master_datastore_index.log</tt> file becomes essential for identifying where the output for a given contig is stored.
  
  
 
now let's take a look at what MAKER produced for the contig 'contig-dpp-500-500'.
 
now let's take a look at what MAKER produced for the contig 'contig-dpp-500-500'.
  cd dpp_contig_datastore/contig-dpp-500-500
+
  cd 2434_datastore/contig-dpp-500-500
 
  ls -l
 
  ls -l
  
Line 481: Line 309:
 
  -rw-r--r-- 1 gmod gmod  4837 2009-07-12 23:27 contig-dpp-500-500.maker.snap_masked.transcripts.fasta
 
  -rw-r--r-- 1 gmod gmod  4837 2009-07-12 23:27 contig-dpp-500-500.maker.snap_masked.transcripts.fasta
 
  -rw-r--r-- 1 gmod gmod  4430 2009-07-12 23:27 contig-dpp-500-500.maker.transcripts.fasta
 
  -rw-r--r-- 1 gmod gmod  4430 2009-07-12 23:27 contig-dpp-500-500.maker.transcripts.fasta
drwxr-xr-x 2 gmod gmod  4096 2009-07-12 23:27 theVoid.contig-dpp-500-500
 
  
  
Line 488: Line 315:
 
*The <tt>contig-dpp-500-500.maker.snap_masked.transcripts.fasta</tt> and <tt>contig-dpp-500-500.maker.snap_masked.proteins.fasta</tt> files contain the transcript and protein sequences for all SNAP ''ab initio'' gene predictions.  If you use other ''ab initio'' gene predictors, those sequence files will follow a similar naming pattern.
 
*The <tt>contig-dpp-500-500.maker.snap_masked.transcripts.fasta</tt> and <tt>contig-dpp-500-500.maker.snap_masked.proteins.fasta</tt> files contain the transcript and protein sequences for all SNAP ''ab initio'' gene predictions.  If you use other ''ab initio'' gene predictors, those sequence files will follow a similar naming pattern.
 
*The <tt>contig-dpp-500-500.maker.non_overlapping_ab_initio.transcripts.fasta</tt> and <tt>contig-dpp-500-500.maker.non_overlapping_ab_initio.proteins.fasta</tt> files contain the set of best ''ab initio'' gene predictions that do not overlap a MAKER gene annotation.  These files can be analyzed to see if there is any reason to promote them to the status of gene annotations.  For example: you can run iprscan to see if they contain known protein domains.
 
*The <tt>contig-dpp-500-500.maker.non_overlapping_ab_initio.transcripts.fasta</tt> and <tt>contig-dpp-500-500.maker.non_overlapping_ab_initio.proteins.fasta</tt> files contain the set of best ''ab initio'' gene predictions that do not overlap a MAKER gene annotation.  These files can be analyzed to see if there is any reason to promote them to the status of gene annotations.  For example: you can run iprscan to see if they contain known protein domains.
*The directory <tt>theVoid.contig-dpp-500-500</tt> contains raw output files from all the programs MAKER wraps around (BLAST, SNAP, RepeatMasker, etc.).  You can usually ignore this directory and it's contents.
 
  
  
Line 495: Line 321:
  
  
For sanity checking purposes it would be nice to have a graphical view of what's in the GFF3 file.  To do this GFF3 files can be loaded into programs like [[Apollo]] and [[GBrowse]].
+
For sanity checking purposes it would be nice to have a graphical view of what's in the GFF3 file.  To do this GFF3 files can be loaded into programs like [[Apollo]] and [[GBrowse]]. MWAS allows you to view the files in Apollo directly on the website.  You also get summery statistics from SOBA.
  
  
 
===Apollo===
 
===Apollo===
Let's load the <tt>contig-dpp-500-500.gff</tt> into [[Apollo]] and take a look at what MAKER produced. Copy the <tt>contig-dpp-500-500.gff</tt> file to your home directory to make it easy to locate, and then let's start Apollo.
 
cp contig-dpp-500-500.gff ~
 
 
 
 
Select the file in Apollo, and open it.  You will notice that there are a number of bars representing the gene annotations and the evidence alignments supporting those annotations.  Annotations are in the middle light colored panel, and evidence alignments are in the dark panels at the top and bottom.
 
Select the file in Apollo, and open it.  You will notice that there are a number of bars representing the gene annotations and the evidence alignments supporting those annotations.  Annotations are in the middle light colored panel, and evidence alignments are in the dark panels at the top and bottom.
  
Line 508: Line 330:
 
All the evidence in the dark panels is in the same color which makes it difficult to identify the source of each piece of evidence without manually clicking on them.  MAKER comes with a configuration file for Apollo which gives a more more colorful view of MAKER produced annotations and evidence.  Let's close Apollo, copy this configuration file and then reload the annotations.
 
All the evidence in the dark panels is in the same color which makes it difficult to identify the source of each piece of evidence without manually clicking on them.  MAKER comes with a configuration file for Apollo which gives a more more colorful view of MAKER produced annotations and evidence.  Let's close Apollo, copy this configuration file and then reload the annotations.
  
 
+
You can now see the annotations and evidence in nice color.  Click on each piece of evidence and you will see it's source in the table at the bottom of the Apollo screen.
The configuration file should be place in the <tt>~/.apollo</tt> directory.  Create this directory if it does not exist.
+
cd ~
+
mkdir .apollo
+
 
+
 
+
Now copy the configuration file to that directory.
+
cp /usr/local/maker/Apollo/gff3.tiers ~/.apollo/
+
 
+
 
+
Open the <tt>contig-dpp-500-500.gff</tt> file again in Apollo.  You can now see the annotations and evidence in nice color.  Click on each piece of evidence and you will see it's source in the table at the bottom of the Apollo screen.
+
  
 
Possible Sources Include:
 
Possible Sources Include:
Line 533: Line 345:
 
*Blastx:Repeatmask - RepeatRunner identified repeat from the repeat protein database
 
*Blastx:Repeatmask - RepeatRunner identified repeat from the repeat protein database
  
 
===GBrowse===
 
Previous versions of [[GBrowse]] required explicit UTR features in the [[GFF3]] file.  This may or may not still be the case.  If you need these features, there is a MAKER accessory script you can use.
 
add_utr_gff.pl <gff3_directory>
 
 
 
The directory can contain multiple GFF3 files.
 
 
 
''(See [[GBrowse]] documentation to set up GBrowse)''
 
 
 
 
=Advanced MAKER Configuration, Re-annotation Options, and Improving Annotation Quality =
 
The remainder of this page mainly presents issues that can be encountered during the annotation process.  I then describe how MAKER can be used to resolve each issue.
 
 
<span style="font-size: 80%">[[Media:2009SumSchMAKER.pdf|See accompanying MAKER presentation (~16 MB)]].</span>
 
 
==Configuration Files in Detail==
 
Let's take a closer look at the configuration options in the maker_opt.ctl file.
 
  
  
Line 614: Line 406:
 
MAKER gives the user the option to produce gene annotations directly from the EST evidence.  You can then use these imperfect gene models to train gene predictor program.  Once you have re-run MAKER with the newly trained gene predictor, you can use the second set of gene annotations
 
MAKER gives the user the option to produce gene annotations directly from the EST evidence.  You can then use these imperfect gene models to train gene predictor program.  Once you have re-run MAKER with the newly trained gene predictor, you can use the second set of gene annotations
 
to train the gene predictors yet again.  This boot-strap process allows you to iteratively improve the performance of ''ab initio'' gene predictors.
 
to train the gene predictors yet again.  This boot-strap process allows you to iteratively improve the performance of ''ab initio'' gene predictors.
 
 
I've created an example file set so you can learn to train the gene predictor SNAP using this procedure.
 
 
 
First let's copy the data and setup a working directory.
 
cd ~
 
tar -zxf ~/software/maker/train.tar.gz
 
cd train
 
ls -al
 
 
 
You should see four files in the directory
 
genome.fasta
 
est.fasta
 
protein.fasta
 
repeat_protein.fasta
 
 
 
We need to build maker configuration files and populate the appropriate values.
 
maker -CTL
 
emacs maker_opts.ctl
 
{{TextEditorLink|emacs}}
 
 
 
Edit the following:
 
genome:genome.fasta
 
est:est.fasta
 
protein:protein.fasta
 
repeat_protein:repeat_protein.fasta
 
predictor:est2genome
 
 
 
MAKER is now configured to generate annotations from the EST data, so start the program.
 
maker
 
 
 
Now load the file genome.maker.output/genome_datastore/scf1117875581239.gff into Apollo.  You will see that there are far more regions with evidence alignments than there are gene annotations.  This is because there are so few spliced ESTs that are capable of generating gene models.
 
 
 
Now exit Apollo. We now need to convert the GFF3 gene models to ZFF format.  This is the format SNAP requires for training.  To do this wee need to collect all GFF3 files into a single directory.
 
mkdir gff
 
find genome.maker.output --name scf*.gff --exec cp {} gff \;
 
cd gff
 
maker2zff.pl . Pult
 
ls -l
 
 
 
There should now be two new files. The first is the ZFF format file and the second is a fasta the coordinates can be referenced against. These will be used to train SNAP.
 
Pult.ann
 
Pult.dna
 
 
 
Training SNAP.
 
fathom -categorize 1000 Pult.ann Pult.dna
 
fathom -export 1000 -plus uni.ann uni.dna
 
forge export.ann export.dna
 
hmm-assembler.pl Pult . > ../Pult.hmm
 
cd ..
 
 
 
The final training parameters file is Pult.hmm.  We do not expect SNAP to perform that well with this training file; however, it is a good starting point for further training.
 
 
 
We need to run MAKER again.
 
emacs maker_opts.ctl
 
{{TextEditorLink|emacs}}
 
 
predictor:snap,est2genome
 
snaphmm:Pult.hmm
 
 
maker
 
 
 
Now lets look at the output once again in Apollo.
 
 
 
Close Apollo, retrain SNAP, and run MAKER again.
 
rm gff/*
 
find genome.maker.output --name scf*.gff --exec cp {} gff \;
 
cd gff
 
maker2zff.pl . Pult
 
fathom -categorize 1000 Pult.ann Pult.dna
 
fathom -export 1000 -plus uni.ann uni.dna
 
forge export.ann export.dna
 
hmm-assembler.pl Pult . > ../Pult2.hmm
 
cd ..
 
emacs maker_opt.ctl
 
 
 
Change configuration file.
 
snaphmm:Pult2.hmm
 
 
 
Run maker.
 
maker
 
 
 
Let's examine the GFF3 file one last time in Apollo.  As you can see there, is a marked degree of improvement in the gene models.
 
  
 
==GFF3 Pass-through==
 
==GFF3 Pass-through==
What if I'm not working on a new genome project, but rather I have an existing annotation set, and I just want to update my genome database to reflect new protein and EST evidence.  Here you can use a feature in MAKER called GFF3 pass-through, which allows you to pass existing annotations into the program and combine them with updated EST and protein alignments.
+
What if I'm not working on a new genome project, but rather I have an existing annotation set, and I just want to update my genome database to reflect new protein and EST evidence.  Here you can use a feature in MAKER called GFF3 pass-through, which allows you to pass existing annotations into the program and combine them w
 
+
 
+
Let's begin by copying the GFF-passthrough example data and preparing MAKER.
+
cd ~
+
tar -zxf ~/software/maker/pass.tar.gz
+
cd pass
+
ls -al
+
 
+
 
+
You will see a number of files.  Not all of them are important (for now).
+
genome.fasta
+
est.fasta
+
protein.fasta
+
repeat_protein.fasta
+
model.gff
+
est.gff
+
pred.gff
+
Pult.hmm
+
 
+
 
+
We now need to generate MAKER configuration files and edit them.
+
maker -CTL
+
emacs maker_opt.ctl
+
 
+
genome:genome.fasta
+
est:est.fasta
+
protein:protein.fasta
+
repeat_protein:repeat_protein.fasta
+
model_gff:model.gff
+
predictor:model_gff
+
{{TextEditorLink|emacs}}
+
 
+
Now run MAKER.
+
maker
+
 
+
 
+
Load the output GFF3 file into Apollo.  You will see that the annotations and the updated evidence have all been bundled together.  The results can now be loaded into the genome database for distribution.
+
 
+
 
+
What if I also want to modify existing annotations to take into account the updated evidence.  Can that be done? Yes.  We just need to modify the configuration parameters.
+
emacs maker_opt.ctl
+
 
+
 
+
Now cahnge these values.
+
predictor:model_gff,snap
+
snaphmm:Pult.hmm
+
{{TextEditorLink|emacs}}
+
 
+
 
+
MAKER is now configured to produce SNAP gene models that will compete against the existing passed through GFF3 models.
+
 
+
 
+
Start MAKER.
+
maker
+
 
+
 
+
Load the resulting GFF3 output file into Apollo and you will see that new annotations replace old annotations where the evidence was sufficient to suggest a different model.  Note that if you want to maintain old gene names when models are replaced, set map_forward:1 in the maker_opt.ctl file.  You can then run maker again and view the results in Apollo.  You will see that the gene models are the same as the previous example, but the legacy names have been pulled forward into the updated models.
+
 
+
 
+
You've seen how GFF3 pass-through let's you use existing gene models, but if I can pass through existing gene models, wouldn't it be nice to have the ability to pass through any type of data?
+
 
+
 
+
MAKER also allows you to pass through exiting EST, protein, repeat, and prediction data in GFF3 format.  Even though the data may have originated from other programs, MAKER treats it as if it originated from within the pipeline.  MAKER even has an other_gff option, so you can pass-through features that don't necessarily fit into categories that MAKER can use.  These get passed strait through into the output file, so it's an easy way to keep user defined features.
+
 
+
 
+
With the GFF3 pass-through option, you can now imagine including gene predictions from programs like TwinScanor or EST alignments from programs like BLAT, both of which are unsupported by MAKER.  Let's do that.
+
emacs maker_opt.ctl
+
 
+
 
+
Change the configuration options.
+
predictor:model_gff,snap,pred_gff
+
est_gff:est.gff
+
pred_gff:pred.gff
+
 
+
 
+
Run maker.
+
maker
+
{{TextEditorLink|emacs}}
+
 
+
 
+
Now examine the output in Apollo, you will see new evidence features from TwinScan and BLAT. There are even a few annotations that now derive from the TwinScan predictions.
+
  
 
==mRNAseq==
 
==mRNAseq==
Line 825: Line 437:
 
*Automatically revise it in light of new data
 
*Automatically revise it in light of new data
 
*If no existing annotation, create new one
 
*If no existing annotation, create new one
 
 
Let's look at an example: ~/software/maker/legacy.tar.gz
 
cd ~
 
tar -zxvf ~/software/maker/legacy.tar.gz
 
cd legacy
 
ls -l
 
 
genome.fasta
 
est.fasta
 
protein.fasta
 
repeat_protein.fasta
 
legacy1.gff
 
legacy2.gff
 
Pult.hmm
 
 
 
You need to merge the legacy GFF3 files since maker only accepts one input model_gff file.  In future versions of MAKER you will be able to use a comma separated list.
 
gff3_merge legacy1.gff legacy2.gff -o legacy.gff
 
 
 
Edit configuration files.
 
maker -CTL
 
emacs maker_opts.ctl
 
 
 
Change the following configuration values. We are going to use the legacy annotations in conjunction with SNAP.  SNAP can then create and update annotations whenever the evidence permits.
 
genome:genome.fasta
 
est:est.fasta
 
protein:protein.fasta
 
repeat_protein:repeat_protein.fasta
 
model_gff:legacy.gff
 
predictor:model_gff,snap
 
snaphmm:Pult.hmm
 
 
 
Copy the Pult.hmm file to your current working directory from the previous GFF3 pass-through example.  We need this file for SNAP.
 
cp ../pass/Pult.hmm .
 
 
 
Now run MAKER.
 
maker
 
 
==MAKER Accessory Scripts==
 
MAKER comes with a number of accessory scripts that are meant to assist in manipulations of the MAKER input and output files.
 
 
 
Scripts:
 
*''add_utr_gff.pl'' - Adds explicit 5' and 3' UTR features to the GFF3 output file
 
add_utr_gff.pl <gff3_directory>
 
 
 
*''add_utr_start_stop_gff'' - Adds explicit 5' and 3' UTR as well as start and stop codon features to the GFF3 output file
 
add_utr_start_stop_gff <gff3_file>
 
 
 
*''fasta_merge'' - Collects all of MAKER's fasta file output for each contig and merges them to make genome level fastas
 
fasta_merge -d <datastore_index> -o <outfile>
 
 
 
*''gff3_merge'' - Collects all of MAKER's GFF3 file output for each contig and merges them to make a single genome level GFF3
 
gff3_merge -d <datastore_index> -o <outfile>
 
 
 
*''gff3_2_gtf'' - Converts MAKER GFF3 files to GTF format (run add_utr_start_stop_gff first to get UTR features)
 
gff3_2_gtf <gff3_file>
 
 
 
*''gff3_preds2models'' - Converts the gene prediction match/match_part format to annotation gene/mRNA/exon/CDS format
 
gff3_preds2models <gff3 file> <pred list>
 
 
 
*''iprscan2gff3'' - Takes InerproScan (iprscan) output and generates GFF3 features representing domains. Interesting tier for GBrowse.
 
iprscan2gff3 <iprscan_file> <gff3_fasta>
 
 
 
*''iprscan_batch'' - Wrapper for iprscan to take advantage of multiprocessor systems.
 
iprscan_batch <file_name> <cpus> <log_file>
 
 
 
*''ipr_update_gff'' - Takes InterproScan (iptrscan) output and maps domain IDs and GO terms to the Dbxref and Ontology_term attributes in the GFF3 file.
 
ipr_update_gff <gff3_file> <iprscan_file>
 
 
 
*''maker2zff.pl'' - Pulls out MAKER gene models from the MAKER GFF3 output and convert them into ZFF format for SNAP training.
 
maker2zff.pl <gff3_file>
 
 
 
*''maker_functional_fasta'' - Maps putative functions identified from BLASTP against UniProt/SwissProt to the MAKER produced tarnscript and protein fasta files.
 
maker_functional_fasta <uniprot_fasta> <blast_output> <fasta1> <fasta2> <fasta3> ...
 
 
 
*''maker_functional_gff'' - Maps putative functions identified from BLASTP against UniProt/SwissProt to the MAKER produced GFF3 files in the Note attribute.
 
maker_functional_gff <uniprot_fasta> <blast_output> <gff3_1>
 
 
 
*''maker_map_ids'' - Build shorter IDs/Names for MAKER genes and transcripts following the NCBI suggested naming format.
 
maker_map_ids --prefix PYU1_ --justify 6 genome.all.gff > genome.all.id.map
 
 
 
*''map_fasta_ids'' - Maps short IDs/Names to MAKER fasta files.
 
map_fasta_ids <map_file> <fasta_file>
 
 
 
*''map_gff_ids'' -  Maps short IDs/Names to MAKER GFF3 files, old IDs/Names are mapped to to the Alias attribute.
 
map_gff_ids <map_file> <gff3_file>
 
 
 
*''split_fasta'' - Splits multi-fasta files into the number of files specified y the user.  Useful for breaking up MAKER jobs.
 
split_fasta [count] <input_fasta>
 
 
==MPI Support==
 
MAKER optionally supports Message Passing Interface (MPI), a parallel computation communication protocol primarily used on computer clusters.  This allows for MAKER jobs to be broken up across multiple nodes/processors for increased performance and scalability.
 
 
 
[[Image:Mpi_maker.png]]
 
 
<div class="emphasisbox">
 
The steps below should get MPI to work on your machine.  However, we did not actually run them during the [[Americas]] course, so MPI does not work on the VMware images produced by that course.</div>
 
 
To use this feature, you must have MPICH2 installed with the the --enable-sharedlibs flag set during installation (See MPICH2 Installer's Guide).  I have installed this for you.  So lets set up MPI_MAKER and run the example file that comes with MAKER.  For some reason we cannot install via sudo because it destroys the PATH environmental variable that tells where MPICH2 executables are install, so instead we need to install explicitly as the root user.
 
sudo su
 
source /home/gmod/.bashrc
 
source /home/gmod/.profile
 
cd /usr/local/maker/MPI/
 
perl Install.pl
 
 
 
Now press control and d together (^d) to exit the root user.
 
 
 
You should now see the executable mpi_maker listed among the other MAKER scripts.  Let's run some example data to see if MPI_MAKER is working properly.
 
cd ~
 
mkdir ~/maker_run2
 
cd maker_run2
 
cp /usr/local/data/dpp* ~/maker_run2
 
maker -CTL
 
emacs maker_opt.ctl
 
 
 
Set values in maker configuration files.
 
genome:dpp_contig.fasta
 
est:dpp_transcripts.fasta
 
protein:dpp_proteins.fasta
 
predictor:snap
 
snaphmm:fly
 
 
 
We need to set up a few more things for MPI to work.  Type mpd to see a list of instructions.
 
mpd
 
 
 
You should see the following.
 
configuration file /home/gmod/mpd.conf not found
 
A file named .mpd.conf file must be present in the user's home
 
directory (/etc/mpd.conf if root) with read and write access
 
only for the user, and must contain at least a line with:
 
MPD_SECRETWORD=<secretword>
 
One way to safely create this file is to do the following:
 
  cd $HOME
 
  touch .mpd.conf
 
  chmod 600 .mpd.conf
 
and then use an editor to insert a line like
 
  MPD_SECRETWORD=mr45-j9z
 
into the file.  (Of course use some other secret word than mr45-j9z.)
 
 
 
Follow the instructions to set this file up, and start the mpi environment with mpdboot.  Then run mpi_maker through the MPI manager mpiexec.
 
mpdboot
 
mpiexec -n 2 mpi_maker
 
 
 
mpiexec is a wrapper that handles the MPI environment.  The -n 2 flag tells mpiexec to use 2 cpus/nodes when running mpi_maker.  For a large cluster, this could be set to something like 100.  You should now know how to start a MAKER job via MPI.
 
 
==MAKER Web-Service==
 
If you don't want to install MAKER, there is also a MAKER Web-Service that makes the annotation process even easier.  So now you can annotate a genome from you iPhone (there's an app for that. :-) ...
 
 
 
There are still quite a few bugs, but you can experiment and give me feedback if you want.
 
 
 
[[Image:MAKERWeb.jpg]]
 

Revision as of 23:19, 31 December 2009

{{#icon: MAKERLogo.png|MAKER|200|MAKER}}


MAKER Web Annotation Service Session

__NOTITLE__


This tutorial walks you through running the MAKER Web Annotation Service.


Maker Overview

The first half of this page describes the basics of MAKER - the easy-to-use genome annotation pipeline.


Introduction to Genome Annotation

What Are Annotations?

Annotations are descriptions of different features of the genome, and they can be both structural or functional in nature.

Examples:

  • Structural Annotations: exons, introns, UTRs, splice forms etc.
  • Functional Annotations: process a gene is involved in (metabolism), molecular function (hydrolase), location of expression (expressed in the mitochondria), etc.


It is especially important that all genome annotations include with themselves an evidence trail that describes in detail the evidence that was used to both suggest and support each annotation. This assists in quality control and downstream management of genome annotations.

Examples of evidence supporting a structural annotation:

  • Ab initio gene predictions
  • ESTs
  • Protein homology

Importance of Genome Annotations

Why should the average biologist care about genome annotations? Genome sequence itself is not very useful. The main question when any genome is sequenced is, "where are the genes?" To identify the genes we need to annotate the genome. And while most researchers probably don't give annotations a lot of thought, they use them everyday.


Examples of Annotation Databases:


Every time we use techniques such as RNAi, PCR, gene expression arrays, targeted gene knockout, or CHIP we are basing our experiments on the information derived from a digitally stored genome annotation. If the annotation is correct, then these experiments should succeed; however, if an annotation is incorrect these experiments are bound to fail. Which brings up a major point:

  • Incorrect and incomplete genome annotations poison every experiment that uses them.

Quality control and evidence management are therefore essential components to any annotation process.

Effect of Next Generation Sequencing on the Annotation Process

It’s generally accepted that within the next few years it will be possible to sequence even human sized genomes for as little as $1,000 and in a short time frame. Pacific Biosciences is claiming they will be able to sequence a human sized genome in fifteen minutes by 2013. If the hype is to be believed, then whole genome sequencing will become routine for even small labs in the not so distant future. Unfortunately, however, advances in annotation technology have not kept pace with genome sequencing, and annotation is rapidly becoming a major bottleneck affecting modern genomics research.

For example:

  • As of February 2009, 173 eukaryotic genomes were fully sequenced yet unpublished (this is an ever growing backlog).
  • Currently there are over 1,000 eukaryotic genome projects underway, assuming 10,000 genes per genome, that’s 10,000,000 new annotations (with this many new annotations, quality control and maintenance become an issue).
  • While there are organizations dedicated to producing and distributing genome annotations (i.e ENSEMBL and VectorBase), the shear volume of newly sequenced genomes exceeds both their capacity and stated purview.
  • Many small research groups (which often lack bioinformatics experience) must therefore confront the difficulties associated with genome annotation on their own.


MAKER is an easy-to-use annotation pipeline designed to help smaller research groups convert the mountain of genomic data provided by next generation sequencing technologies into a usable resource.


What does MAKER do?

  • Identifies and masks out repeat elements
  • Aligns ESTs to the genome
  • Aligns proteins to the genome
  • Produces ab initio gene predictions
  • Synthesizes these data into final annotations
  • Produces evidence-based quality values for downstream annotation management


File:Apollo view.jpg
MAKER generated annotations, shown in Apollo.


What sets MAKER apart from tools (ab initio gene predictors etc.)?

MAKER is an annotation pipeline, not a gene predictor. MAKER does not predict genes, rather MAKER leverages existing software tools (some of which are gene predictors) and integrates their output to produce what MAKER believes to be the best possible gene model for a given location based on evidence alignments.


gene prediction ≠ gene annotation

  • gene predictions are gene models.
  • gene annotations are gene models but should include a documented evidence trail supporting the model in addition to quality control metrics.


This may seem like just a matter of semantics since the primary output for both ab initio gene predictors and the MAKER pipeline is the same, a collection of gene models. However there are a few very significant consequences to the differences between these programs that I will explain shortly.


Emerging vs. Model Genomes

Emerging model organism genomes each come with there own set of issues that are not necessarily found in classic model genomes. These include difficulties associated with Repeat identification, gene finder training, and other complex analyses. Unfortunately emerging model organisms are often studied by very small research communities which often lack the resources and bioinformatics experience necessary to tackle these issues.

Classic Model Organisms Emerging Model Organisms

Well developed experimental systems

New experimental systems

  • Genome will be the central resource for work in these systems

Much prior knowledge about genome

Little prior knowledge about genome

  • Usually no genetics
Large community Small communities
Big $ Less $
Examples: D. melanogaster, C. elegans, human, etc. Examples: oomycetes, flat worms, cone snail, etc.

Comparison of Algorithm Performance on Model vs. Emerging Genomes

If you have ever looked at comparisons of gene predictor performance on classic model organisms such as C. elegans you would conclude that ab initio gene predictors match or even outperform state of the art annotation pipelines, and the truth is that, with enough training data, they do. However, it is important to keep in mind that ab initio gene predictors have been specifically optimized to perform well on model organisms such as Drosophila and C. elegans, organisms for which we have large amount of pre-existing data to both train and tweak the prediction parameters.


Table: MAKER's Performance on the C. elegans genome

Performance

Category

Ab initio Evidence Based
SNAP Augustus MAKER Gramene
Genomic Overlap (gene)
SP 82.48 88.09 91.69 93.49
SN 95.44 96.78 89.81 88.74
Exon Overlap
SP 18.88 22.87 25.58 27.38
SN 87.63 93.09 91.17 94.84

What about emerging model organisms for which little data is available? Gene prediction in classic model organisms is relatively simple because there are already a large number of experimentally determined and verified gene models, but with emerging model organisms, we are lucky to have a handful of gene models to train with. As a result ab initio gene predictors generally perform very poorly on emerging genomes.

Figure: MAKER's Performance on the S. mediterranea Emerging Model Organism Genome. Pfam domain content of gene models determined using rpsblast


By using ab inito gene predictors inside of the MAKER pipeline instead of as stand alone applications you get certain benefit:

  • Provide gene models as well as an evidence trail correlations for quality control and manual curation
  • Provide a mechanism to train and retrain ab initio gene predictors for even better performance.
  • Output can be easily loaded into a GMOD compatible database for annotation distribution (including evidence associations).
  • Annotations can be automatically updated with new evidence by simply passing existing annotation sets back into the pipeline


Getting Started with MWAS

Registration

MWAS is free to all users and has no login requirement, but registration is recommended as it allows for easier file and job management and registered users can upload more sequence.

RUNNING MWAS WITH EXAMPLE DATA

MWAS comes with some example files to familiarize the user with how to run MAKER. You can pre-load the fields for a new job by selecting one of the examples from the drop down menu on the "New Job" page.


Next we need to tell MAKER all the details about how we want the annotation process to proceed. Because there can be many variables and options involved in annotation you will need to review each option carefully. At the very least you should provide a genome sequence file, an EST sequence file, and a protein homology sequence file.

Details of What is Going on Inside of MAKER

Repeat Masking

The first step to MAKER is repeat masking, but why do we need to do this? Repetitive elements can make up a significant portion of the genome. Some of these repeats are simple/low-complexity repeats where you have runs of C's or G's or maybe even something like AAGGAAGGAAGG. Other repeats are more complex, i.e. transposable elements. These high-complexity repeats often encode real proteins like rerotranscriptase or even Gag, Pol, and Env viral proteins. Because they encode real proteins, they can play havoc with ab initio gene predictors. For example, a transposable element that occurs next to or even within the intron of a real protein encoding gene might cause a gene predictor to include extra exons as part of a gene model, sequence which really only belongs to the transposable element and not to the coding sequence of the gene. You will also get hundreds of instances where identical transportable element proteins get annotated as being part of an organisms proteome. In addition these issues, low-complexity repeat regions can align with high statistical significance to low-complexity protein regions creating a false sense of homology throughout the genome. To avoid these complications it is convenient to identify and mask any repeat elements before doing other analyses.


MAKER identifies repeats in two steps.

  • First a program called RepeatMasker is used to identify low-complexity and high-complexity repeats that match entries in the RepBase repeat library, or any species specific repeat library supplied by the user.
  • Next MAKER uses RepeatRunner to identify transposable element and viral proteins from the RepeatRunner protein database. Because protein sequence diverges at a slower rate than nucleotide sequence, this step helps pick up the most problematic regions of divergent repeats that are missed by RepeatMasker, which searches in nucleotide space.


Regions identified during repeat analysis are masked out so as not to complicate other downstream annotation analyses.

  • High-complexity repeats are hard-masked, a technique in which nucleotide sequence is replaced with the letter N to prohibit any alignments to that region.
  • Low-complexity regions are soft-masked, a technique in which nucleotides are made lower case so they can be treated as masked under certain situations without losing sequence information. I will discuss some of the applications and effects of soft-masking later.


Now the idea of masking out sequence might seem on the surface like we're losing a lot of information, and it is true that there can be proteins that have integrated repeats into their structure, so repeat masking will affect our ability to annotate these proteins. However, these proteins are rare and the number of gene models and homology alignments improved by this step far exceed the few gene models that may be negatively affected. You do have the option to run ab initio gene predictors on both the masked and unmasked sequence if repeat masking worries you though. You do this by setting unmask:1 in the maker_opt.ctl configuration file.

Ab Initio Gene Prediction

Following repeat masking, MAKER runs ab initio gene predictors specified by the user to produce preliminary gene models. Ab initio gene predictors produce gene predictions based on underlying mathematical models describing patterns of intron/exon structure and consensus start signals. Gene models are not produced by directly using experimental evidence. Because the patterns of gene structure are going to differ from organism to organism, you must train gene predictors before you can use them. I will discuss how to do this later on.


MWAS currently supports:

  • SNAP
  • Augustus
  • GeneMark
  • FGENESH (Not shown on public site)


You must specify in the maker_opts.ctl file the training parameters file you want to use use when running each of these algorithms.


EST and Protein Evidence Alignment

A simple way to indicate if a sequence region is likely associated with a gene is to identify (A) if the region is actively being transcribed or (B) if the region has homology to a known protein. This can be done by aligning Expressed Sequence Tags (ESTs) and proteins to the genome using alignment algorithms.

  • ESTs are sequences derived from a cDNA library. Because of the difficulties associated with working with mRNA and depending on how the cDNA library was prepared, EST databases usually represent bits and pieces of transcribed mRNAs with only a few full length transcripts. MAKER aligns these sequences to the genome using BLASTN. If ESTs from the organism being annotated are unavailable or sparse, you can use ESTs from a closely related organism. However, ESTs from closely related organisms are unlikely to align using BLASTN since nucleotide sequences can diverge quite rapidly. For these ESTs, MAKER uses TBLASTX to align them in protein space.
  • Protein sequence generally diverges quite slowly over large evolutionary distances, as a result proteins from even evolutionarily distant organisms can be aligned against raw genomic sequence to try and identify regions of homology. MAKER does this using BLASTX.


Remember now that we are aligning against the repeat-masked genomic sequence. How is this going to affect our alignments? For one thing we won't be able to align against low-complexity regions. Some real proteins contain low-complexity regions and it would be nice to identify those, but if I let anything align to a low-complexity region, then I will get spurious alignments all over the genome. Wouldn't it be nice if there was a way to allow BLAST to extend alignments through low-complexity regions, but only if there is is already alignment somewhere else? You can do this with soft-masking. If you remember soft-masking is using lower case letters to mask sequence without losing the sequence information. BLAST allows you to use soft-masking to keep alignments from seeding in low-complexity regions, but allows you to extend through them. This of course will allow some of the spurious alignments you were trying to avoid, but overall you still end up suppressing the majority of poor alignments while letting through enough real alignments to justify the cost. You can turn this behavior off though if it bothers you by setting softmask:0 in the maker_bopt.ctl file.


Polishing Evidence Alignments

Because of oddities associated with how BLAST statistics work, BLAST alignments are not as informative as they could be. BLAST will align regions any where it can, even if the algorithm aligns regions out of order, with multiple overlapping alignments in the exact same region, or with slight overhangs around splice sites.


To get more informative alignments MAKER uses the program Exonerate to polish BLAST hits. Exonerate realigns each sequences identified by BLAST around splice sites and forces the alignments to occur in order. The result is a high quality alignment that can be used to suggest near exact intron/exon positions. Polished alignments are produced using the est2genome and protein2genome options for Exonerate.


One of the benefits of polishing EST alignments is the ability to identify the strand an EST derives from. Because of amplification steps involved in building an EST library and limitations involved in some high throughput sequencing technologies, you don't necessarily know whether you're really aligning the forward or reverse transcript of an mRNA. However, if you take splice sites into account, you can only align to one strand correctly.


Integrating Evidence to Synthesize Final Annotations

Once you have ab initio predictions, EST alignments, and protein alignments you can integrate this evidence to produce even better gene predictions. MAKER does this by "talking" to the gene prediction programs. MAKER takes all the evidence, generates "hints" to where splice sites and protein coding regions are located, and then passes these "hints" to programs that will accept them.


MAKER produces hint based predictors for:

  • SNAP
  • Augustus
  • FGENESH
  • GeneMark (under development)


MAKER then takes the entire pool of ab initio and evidence informed gene predictions, updates features such as 5' and 3' UTRs based on EST evidence, tries to determine alternative splice forms where EST data permits, produces quality control metrics for each gene model (this is included in the output), and then MAKER chooses from among all the gene model possibilities the one that best matches the evidence. This is done using a modified sensitivity/specificity distance metric.


MAKER's Output

Once your job is finished and you download the data, you will see that MAKER has created an output directory called something like 2434.maker.output. The name of the output directory is based off of the job id assigned to your sequence file.


You should now see a list of directories and files created by MAKER.

drwxr-xr-x 3 gmod gmod 4096 2009-07-12 23:23 2434_datastore
-rw-r--r-- 1 gmod gmod  135 2009-07-12 23:27 2434_master_datastore_index.log
-rw-r--r-- 1 gmod gmod 1579 2009-07-12 23:23 maker_bopts.log
-rw-r--r-- 1 gmod gmod 1250 2009-07-12 23:23 maker_exe.log
-rw-r--r-- 1 gmod gmod 4016 2009-07-12 23:23 maker_opts.log
drwxr-xr-x 2 gmod gmod 4096 2009-07-12 23:23 mpi_blastdb
  • The maker_opt.log, maker_exe.log, and maker_bopts.log files are logs of the control files used for this run of MAKER.
  • The mpi_blastdb directory contains fasta indexes and BLAST database files created from the input EST, protein, and repeat databases.
  • The 2434_master_datastore_index.log contains information on both the run status of individual contigs and information on where individual contig data is stored.
  • The 2434_datastore directory contains a set of subfolders, each containing the final MAKER output for individual contigs from the genomic fasta file.


Once a MAKER run is finished the most important file to look at is the 2434_master_datastore_index.log to see if there were any failures.

less 2434_master_datastore_index.log.  MWAS provides a summery of this file when you click on results to download a job.  MWAS also displays run errors in the log option button that you can click on when in the MWAS main queue page.


If everything proceeded correctly you should see the following in your 2434_master_datastore_index.log file.

contig-dpp-500-500      2434_datastore/contig-dpp-500-500 STARTED
contig-dpp-500-500      2434_datastore/contig-dpp-500-500 FINISHED


There are only entries describing a single contig because there was only one contig in the example file. These lines indicate that the contig 'contig-dpp-500-500' STARTED and then FINISHED without incident. Other possible entries include:

  • DIED - indicates a failed run on this contig, MAKER will retry these
  • RETRY - indicates that MAKER is retrying a contig that failed
  • SKIPPED_SMALL - indicates the contig was too short
  • DIED_SKIPPED_PERMANENT - indicates a failed contig that MAKER will not attempt to retry


The entries in the 2434_master_datastore_index.log file also indicate that the output files for this contig are stored in the directory dpp_contig_datastore/contig-dpp-500-500/. Knowing where the output is stored may seem rather trivial; however, input genome fasta files can contain thousands even hundreds-of-thousands of contigs, and many file-systems have performance problems with large numbers of sub-directories and files within a single directory. Even when the underlying file-systems handle things gracefully, access via network file-systems can be an issue. To deal with this situation, MAKER uses a datastore module to create a hierarchy of sub-directory layers, starting from a 'base', and mapping identifiers to corresponding sub-directories. For situations where the input genome fasta file contains more than 1,000 contigs, the datastore structure is used automatically, and the master_datastore_index.log file becomes essential for identifying where the output for a given contig is stored.


now let's take a look at what MAKER produced for the contig 'contig-dpp-500-500'.

cd 2434_datastore/contig-dpp-500-500
ls -l

The directory should contain a number of files.

-rw-r--r-- 1 gmod gmod 47437 2009-07-12 23:27 contig-dpp-500-500.gff
-rw-r--r-- 1 gmod gmod   189 2009-07-12 23:27 contig-dpp-500-500.maker.non_overlapping_ab_initio.proteins.fasta
-rw-r--r-- 1 gmod gmod   399 2009-07-12 23:27 contig-dpp-500-500.maker.non_overlapping_ab_initio.transcripts.fasta
-rw-r--r-- 1 gmod gmod   704 2009-07-12 23:27 contig-dpp-500-500.maker.proteins.fasta
-rw-r--r-- 1 gmod gmod   901 2009-07-12 23:27 contig-dpp-500-500.maker.snap_masked.proteins.fasta
-rw-r--r-- 1 gmod gmod  4837 2009-07-12 23:27 contig-dpp-500-500.maker.snap_masked.transcripts.fasta
-rw-r--r-- 1 gmod gmod  4430 2009-07-12 23:27 contig-dpp-500-500.maker.transcripts.fasta


  • The contig-dpp-500-500.gff contains all annotations and evidence alignments in GFF3 format. This is the important file for use with Apollo or GBrowse.
  • The contig-dpp-500-500.maker.transcripts.fasta and contig-dpp-500-500.maker.proteins.fasta files contain the transcript and protein sequences for MAKER produced gene annotations.
  • The contig-dpp-500-500.maker.snap_masked.transcripts.fasta and contig-dpp-500-500.maker.snap_masked.proteins.fasta files contain the transcript and protein sequences for all SNAP ab initio gene predictions. If you use other ab initio gene predictors, those sequence files will follow a similar naming pattern.
  • The contig-dpp-500-500.maker.non_overlapping_ab_initio.transcripts.fasta and contig-dpp-500-500.maker.non_overlapping_ab_initio.proteins.fasta files contain the set of best ab initio gene predictions that do not overlap a MAKER gene annotation. These files can be analyzed to see if there is any reason to promote them to the status of gene annotations. For example: you can run iprscan to see if they contain known protein domains.


Viewing MAKER Annotations

Viewing the raw GFF3 file produced by MAKER really isn't that meaningful.


For sanity checking purposes it would be nice to have a graphical view of what's in the GFF3 file. To do this GFF3 files can be loaded into programs like Apollo and GBrowse. MWAS allows you to view the files in Apollo directly on the website. You also get summery statistics from SOBA.


Apollo

Select the file in Apollo, and open it. You will notice that there are a number of bars representing the gene annotations and the evidence alignments supporting those annotations. Annotations are in the middle light colored panel, and evidence alignments are in the dark panels at the top and bottom.


All the evidence in the dark panels is in the same color which makes it difficult to identify the source of each piece of evidence without manually clicking on them. MAKER comes with a configuration file for Apollo which gives a more more colorful view of MAKER produced annotations and evidence. Let's close Apollo, copy this configuration file and then reload the annotations.

You can now see the annotations and evidence in nice color. Click on each piece of evidence and you will see it's source in the table at the bottom of the Apollo screen.

Possible Sources Include:

  • BLASTN - BLASTN alignment of EST evidence
  • BLASTX - BLASTX alignment of protein evidence
  • TBLASTX - TBLASTX alignment of EST evidence from closely related organisms
  • EST2Genome - Polished EST alignment from Exonerate
  • Protein2Genome - Polished protein alignment from Exonerate
  • SNAP - SNAP ab inito gene prediction
  • GENEMARK - GeneMarkab inito gene prediction
  • Augustus - Augustus ab inito gene prediction
  • FgenesH - FGENESH ab inito gene prediction
  • Repeatmasker - RepeatMasker identified repeat
  • Blastx:Repeatmask - RepeatRunner identified repeat from the repeat protein database


Basic Input Files

All the basic input files for MAKER should be in fasta format.


  • genome - Genomic sequence file
  • est - ESTs from the same organism or from a very very closely related organism (i.e. chimpanzee to human). These are aligned first via BLASTN with very strict filtering so any sequence divergence can prohibit the alignment.
  • altest - These are ESTs from other closely related organisms (i.e. mouse to human). They are aligned via TBLASTX in protein space, so greater sequence divergence is permitted.
  • protein - proteins from the same or other organisms. These are aligned via BLASTX against the genome. Proteins that align to a region will not necessarily be orthologous or paralogous. The alignment may just be based on short regions such as a shared domain. You may also get alignments to pseudogenes. Polishing BLASTX hits with Exonerate helps identify what are likely true paralogs and orthologs.


Repeat Masking Options

Repeat masking is important for improving gene predictor performance and avoiding protein alignments to what are likely just transposons. You also expect a certain amount of genomic contamination in the EST database, much of this contamination maps back to repeat regions. By repeat masking we can avoid issues with all types of input data.


  • model_org - This is a RepeatMasker option that lets you limit the repeat database to specific organisms or groups of organisms (i.e. vertebrates, Nematodes, Drosophila, primates etc). By default MAKER sets this to 'all'.
  • repeat_protein - This is a fasta file of transposon and virus related proteins. MAKER has an internal RepeatRunner database it uses by default.
  • rmlib - This is a fasta file of nucleotide repeats provided by the user. You can create a species specific repeat database using programs like PILER.


Gene Prediction Options

Gene prediction options affect the final gene annotations more than any other option type. This brings up the point that electronically produced gene annotations will only be as good as the gene predictions they are based on.


  • predictor - This tells MAKER what programs to run for generating annotations.
    • est2genome - Allows high quality spliced Exonerate EST alignments to become gene annotations. This only happens when there is no gene prediction overlapping the region. This is useful for generating gene annotations in the absence of a trained gene predictor.
    • model_gff - This allows user defined models to be used
    • snap
    • augustus
    • genemark
    • fgenesh
  • unmask - Produce ab initio gene predictions for unmasked sequence as well as for masked sequence
  • snaphmm - SNAP training file (SNAP has some species files already available in the snap/HMM/ directory)
  • gmhmm - GeneMark training file (GeneMark self-trains and produces the resulting training file in the output mod/ directory)
  • augustus_species - Augustus species ID (Augustus uses an internal species index rather than a simple set of training files. Type 'augustus --species=help' to see the values you can choose)
  • fgenesh_par_file - FGENESH training file


Other MAKER Options

  • evaluate - runs an experimental annotation quality analysis program (Evaluator) on each annotation. Provides quantitative metrics for ranking annotations and identifying the features most in need of review. I'd like to emphasize that this is experimental.
  • max_dna_len - sets the length for dividing up contigs into chunks for processing. Larger chunks require more memory; smaller chunks require less memory. Allows the user to control system memory usage.
  • min_contig - sets the minimum length a contig must have or else it will be skipped.
  • min_protein - sets the minimum length a predicted protein must have (in amino acids) to be annotated.
  • split_hit - sets the expected max intron size for evidence alignments
  • pred_flank - sets the length for the sequence surrounding clusters of EST and protein evidence that will be used when building hint based gene predictions.
  • single_exon - tells MAKER to consider single exon EST evidence when generating annotations. Single exon ESTs are more likely to be genomic contamination.
  • single_length - sets the minimum length required for single exon ESTs if 'single_exon' is enabled
  • keep_preds - adds non-overlapping ab-inito gene prediction to the final annotation set rather than pushing them off into a separate file for the user to analyse. These predictions by definition do not overlap any form of supporting evidence.
  • retry - sets the number of times to retry a contig if there is a failure
  • clean_try - removes all data from previous MAKER runs before retrying a contig
  • clean_up - removes theVoid directory with individual raw analysis files at the end of the MAKER run
  • TMP - specifies a directory other than the system default temporary directory (/tmp) for writing temporary files. On some Linux systems the primary hard drive that also holds the default temporary directory is small, and most of the systems storage space is located on secondary hard drives mounted in directories elsewhere on the system. This is often true of computer clusters where each node has it's own small hard drive for booting purposes, and most storage space is network mounted. Temporary files created by MAKER are deleted as the program advances, but individual files related to BLAST jobs can be quite large, so setting TMP to another location can be useful.


Training ab initio Gene Predictors

If you are involved in a genome project for an emerging model organism, you should already have an EST database which would have been generated as part of the original sequencing project. A protein database can be collected from closely related organism genome databases or by using the UniProt/SwissProt protein database or the NCBI NR protein database. However a trained ab initio gene predictor is a much more difficult thing to generate. Gene predictors require existing gene models on which to base prediction parameters. However, with emerging model organisms there are no pre-existing gene models. So how then are you supposed to train your gene prediction programs?


MAKER gives the user the option to produce gene annotations directly from the EST evidence. You can then use these imperfect gene models to train gene predictor program. Once you have re-run MAKER with the newly trained gene predictor, you can use the second set of gene annotations to train the gene predictors yet again. This boot-strap process allows you to iteratively improve the performance of ab initio gene predictors.

GFF3 Pass-through

What if I'm not working on a new genome project, but rather I have an existing annotation set, and I just want to update my genome database to reflect new protein and EST evidence. Here you can use a feature in MAKER called GFF3 pass-through, which allows you to pass existing annotations into the program and combine them w

mRNAseq

mRNAseq is a high throughput technique for sequencing the entire transcriptome, and it holds the promise of allowing researchers to identify all exons and alternative splice forms for every gene in the genome with a single experiment. It may soon make gene predictors (mostly) a thing of the past.

  • Still need to de-convolute reads & evidence (for now)
  • Still need to archive, manage, and distribute annotations


MRNAseq.jpg


We are currently working on native support for mRNAseq data within the MAKER pipeline. However, because of the GFF3 pass-through option, there is a way to take advantage of mRNAseq reads right now. By mapping mRNAseq reads using BowTie and TopHat, you can create GFF3 files of read islands and junctions. This data can then be passed in as EST evidence and will be used for generating hint based gene prediction and for choosing final annotations.


Merge/Resolve Legacy Annotations

Legacy annotations

  • Many are no longer maintained by original creators
  • In some cases more than one group has annotated the same genome, using very different procedures, even different assemblies
  • Many investigators have their own genome-scale data and would like a private set of annotations that reflect these data
  • There will be a need to revise, merge, evaluate, and verify legacy annotation sets in light of RNA-seq and other data


Legacy.png


MAKER will:

  • Identify legacy annotation most consistent with new data
  • Automatically revise it in light of new data
  • If no existing annotation, create new one