MAKER Tutorial 2010
- 1 VMware
- 2 Caveats
- 3 Maker Overview, Installation, and Basic Configuration for Annotating Genomic Sequence
- 3.1 About MAKER
- 3.2 Introduction to Genome Annotation
- 3.3 MAKER Overview
- 3.4 Installation
- 3.5 Getting Started with MAKER
- 3.6 Details of What is Going on Inside of MAKER
- 3.7 MAKER's Output
- 3.8 Viewing MAKER Annotations
- 4 Advanced MAKER Configuration, Re-annotation Options, and Improving Annotation Quality
- 4.1 Configuration Files in Detail
- 4.2 Training ab initio Gene Predictors
- 4.3 MAKER Web Annotation Service
- 4.4 mRNAseq
- 4.5 Merge/Resolve Legacy Annotations
- 4.6 MPI Support
- 4.7 User Interface for Local MAKER Instalation
- 4.8 Appendix: MAKER Accessory Scripts
This tutorial was taught using a VMware system image as a starting point. If you want to start with the same system, download and install the start image (below). See VMware for what software you need to use a VMware system image and for directions on how to get the image up and running on your machine.
This tutorial describes the world as it existed on the day the tutorial was given. Please be aware that things like CPAN modules, Java libraries, and Linux packages change over time, and that the instructions in the tutorial will slowly drift over time. Newer versions of tutorials will be posted as they become available.
Maker Overview, Installation, and Basic Configuration for Annotating Genomic Sequence
The first half of this page describes the basics of MAKER - the easy-to-use genome annotation pipeline.
MAKER is an easy-to-use genome annotation pipeline designed to be usable by small research groups with little bioinformatics experience; however, MAKER is also designed to be scalable and is appropriate for projects of any size even including use by large sequence centers. MAKER can be used for de novo annotation of newly sequenced genomes, for updating existing annotations to reflect new evidence, or just to combine annotations, evidence, and quality control statistics for use in other GMOD programs like GBrowse, JBrowse, Chado, and Apollo.
Introduction to Genome Annotation
What Are Annotations?
Annotations are descriptions of different features of the genome, and they can be structural or functional in nature.
- Structural Annotations: exons, introns, UTRs, splice forms etc.
- Functional Annotations: process a gene is involved in (metabolism), molecular function (hydrolase), location of expression (expressed in the mitochondria), etc.
It is especially important that all genome annotations include with themselves an evidence trail that describes in detail the evidence that was used to both suggest and support each annotation. This assists in quality control and downstream management of genome annotations.
Examples of evidence supporting a structural annotation:
- Ab initio gene predictions
- Protein homology
Importance of Genome Annotations
Why should the average biologist care about genome annotations?
Genome sequence itself is not very useful. The main question when any genome is sequenced is, "where are the genes?" To identify the genes we need to annotate the genome. And while most researchers probably don't give annotations a lot of thought, they use them everyday.
Examples of Annotation Databases:
Every time we use techniques such as RNAi, PCR, gene expression arrays, targeted gene knockout, or CHIP we are basing our experiments on the information derived from a digitally stored genome annotation. If an annotation is correct, then these experiments should succeed; however, if an annotation is incorrect then the experiments that are based on that annotation are bound to fail. Which brings up a major point:
- Incorrect and incomplete genome annotations poison every experiment that uses them.
Quality control and evidence management are therefore essential components to any annotation process.
Effect of NextGen Sequencing on the Annotation Process
It’s generally accepted that within the next few years it will be possible to sequence even human sized genomes for as little as $1,000 and in a short time frame. Pacific Biosciences is claiming they will be able to sequence a human sized genome in fifteen minutes by 2013. If the hype is to be believed, then whole genome sequencing will become "routine" for even small labs in the not so distant future. Unfortunately, however, advances in annotation technology have not kept pace with genome sequencing, and annotation is rapidly becoming a major bottleneck affecting modern genomics research.
- As of April 2010, 401 genomes were fully sequenced yet unpublished.
- Currently there are over 1300 eukaryotic genome projects underway, assuming 10,000 genes per genome, that’s 13,000,000 new annotations (with this many new annotations, quality control and maintenance become an issue).
- While there are organizations dedicated to producing and distributing genome annotations (i.e ENSEMBL and VectorBase), the shear volume of newly sequenced genomes exceeds both their capacity and stated purview.
- Small research groups are affected disproportionately by the difficulties related to genome annotation, primarily because they often lack bioinformatics experience and must confront the difficulties associated with genome annotation on their own.
MAKER is an easy-to-use annotation pipeline designed to help smaller research groups convert the coming tsunami of genomic data provided by next generation sequencing technologies into a usable resource.
The easy-to-use annotation pipeline.
|User Requirements:||Can be run by a single individual with little bioinformatics experience|
|System Requirements:||Can run on laptop or desktop computers running Linux or Mac OS X (also cluster compatible)|
|Program Output:||Output is compatible with popular GMOD annotation tools like Apollo and GBrowse|
|Availability:||Free open source application (for academic use)|
What does MAKER do?
- Identifies and masks out repeat elements
- Aligns ESTs to the genome
- Aligns proteins to the genome
- Produces ab initio gene predictions
- Synthesizes these data into final annotations
- Produces evidence-based quality values for downstream annotation management
What sets MAKER apart from tools (ab initio gene predictors etc.)?
MAKER is an annotation pipeline, not a gene predictor. MAKER does not predict genes, rather MAKER leverages existing software tools (some of which are gene predictors) and integrates their output to produce what MAKER believes to be the best possible gene model for a given location based on evidence alignments.
gene prediction ≠ gene annotation
- gene predictions are gene models.
- gene annotations are gene models but should include a documented evidence trail supporting the model in addition to quality control metrics.
This may seem like just a matter of semantics since the primary output for both ab initio gene predictors and the MAKER pipeline is the same, a collection of gene models. However there are a few very significant consequences to the differences between these programs that I will explain shortly.
Emerging vs. Classic Model Genomes
Emerging model organism genomes each come with there own set of issues that are not necessarily found in classic model genomes. These include difficulties associated with Repeat identification, gene finder training, and other complex analyses. Unfortunately emerging model organisms are often studied by very small research communities which often lack the resources and bioinformatics experience necessary to tackle these issues.
|Classic Model Organisms||Emerging Model Organisms|
Well developed experimental systems
New experimental systems
Much prior knowledge about genome
Little prior knowledge about genome
|Large community||Small communities|
|Big $||Less $|
|Examples: D. melanogaster, C. elegans, human, etc.||Examples: oomycetes, flat worms, cone snail, etc.|
Comparison of Algorithm Performance on Model vs. Emerging Genomes
If you have ever looked at comparisons of gene predictor performance on classic model organisms such as C. elegans you would conclude that ab initio gene predictors match or even outperform state of the art annotation pipelines, and the truth is that, with enough training data, they do. However, it is important to keep in mind that ab initio gene predictors have been specifically optimized to perform well on model organisms such as Drosophila and C. elegans, organisms for which we have large amount of pre-existing data to both train and tweak the prediction parameters.
|Table: MAKER's Performance on the C. elegans genome|
|Ab initio||Evidence Based|
|Genomic Overlap (gene)|
What about emerging model organisms for which little data is available? Gene prediction in classic model organisms is relatively simple because there are already a large number of experimentally determined and verified gene models, but with emerging model organisms, we are lucky to have a handful of gene models to train with. As a result ab initio gene predictors generally perform very poorly on emerging genomes.
By using ab inito gene predictors inside of the MAKER pipeline instead of as stand alone applications you get certain benefit:
- Provide gene models as well as an evidence trail correlations for quality control and manual curation
- Provide a mechanism to train and retrain ab initio gene predictors for even better performance.
- Output can be easily loaded into a GMOD compatible database for annotation distribution (including evidence associations).
- Annotations can be automatically updated with new evidence by simply passing existing annotation sets back into the pipeline
- threads (Optional, for MPI scripts)
- IO::All (Optional, for accessory scripts)
- IO::Prompt (Optional, for accessory scripts)
- Perl 5.8.0 or Higher
- BioPerl 1.6 or higher
- SNAP version 2009-02-03 or higher
- RepeatMasker 3.1.6 or higher
- Exonerate 1.4 or higher
You must also install one of the following:
Required for optional MPI support:
(Working on Amazon EC2 support. Can also start MAKER multiple times and get parallelization without MPI. Subsequent MAKER instances will detect already running instances and integrate seamlessly.)
The MAKER Package
Because of the number of prerequisites, we will not cover the details of installing these other programs; they have already been installed for you. But even though I did pre-install most programs for you, I'm still going to have you perform basic post installation configurations, so lets get started.
MAKER can be downloaded from:
- http://www.yandell-lab.org/ - but it should already be on the image
To keep everyone from hitting the server at once though, I have already placed MAKER in the ~/Documents/Software/maker/ directory. Let's take a look at the packages contents.
cd Documents/Software/maker/ ls -1
Note: That a dash one, not a dash el, on the ls command.
You should now see the following:
Apollo/ bin/ data/ INSTALL JBrowse/ lib/ MPI/ MWAS/ NCBI/ README
There are two files in particular that you would want to look at when installing MAKER - INSTALL and README. INSTALL gives a brief overview of MAKER and prerequisite installation. Lets take a look at this.
You can see there is a step by step guide for installing prerequisites as well as MAKER. Since the prerequisites are already installed, jump to the MAKER installation section (press space bar to scroll down).
7. Install MAKER. Download from http://www.yandell-lab.org
a. Unpack the MAKER tar file into the directory of your choice (i.e. /usr/local). b. Now add the following to your .profile if you haven't already: export ZOE="where_snap_is/Zoe" export AUGUSTUS_CONFIG_PATH="where_augustus_is/config c. Add the location where you installed MAKER and all prerequisite programs to your PATH variable in .profile. (i.e. export PATH=/usr/local/maker/bin:$PATH). d. You can now run a test of MAKER by following the instructions in the MAKER README file.
See the README file for details on installing mpi_maker
According to the documentation we need to add a few entries to your user profile. So lets open it in a text editor.
Add the following to your user profile.
PATH=/home/gmod/Documents/Software/maker/bin:$PATH PATH=/usr/local/ncbi-blast/bin:$PATH PATH=/usr/local/exonerate/bin:$PATH PATH=/usr/local/augustus/bin:$PATH PATH=/usr/local/snap:$PATH PATH=/usr/local/gm_es:$PATH PATH=/usr/local/RepeatMasker:$PATH export PATH export ZOE=/usr/local/snap export AUGUSTUS_CONFIG_PATH=/usr/local/augustus/config
Now reload your profile to make the changes take hold.
MAKER should now be installed. Let's test the executable. We should see the usage statement.
Getting Started with MAKER
Before we begin with any examples. I want everyone to note that all finished examples are located in ~/Documents/Data/maker, so if you fall behind you can always find MAKER control files datasets and final results in there.
Let's just quickly take a look
cd ~/Documents/Data/maker ls -1
You should see five example folders
example1_dmel example2_pyu example3_mRNAseq example4_legacy example5_ecoli
Lets look inside example1
ls -1 example1_dmel
You will see a directory called finished.maker.output which contains all the final results for the example. Each of the other examples will contain a similar directory.
Now let's get started!
RUNNING MAKER WITH EXAMPLE DATA
MAKER comes with some example input files to test the installation and to familiarize the user with how to run MAKER. The example files are found in the maker/data directory.
ls -1 /home/gmod/Documents/Software/maker/data dpp_contig.fasta dpp_proteins.fasta dpp_transcripts.fasta te_proteins.fasta
The example files are in FASTA format. MAKER requires FASTA format for it's input files. Let's take a look at on of theses files to what the format looks like.
cat dpp_proteins.fasta >dpp-CDS-5 MRAWLLLLAVLATFQTIVRVASTEDISQRFIAAIAPVAAHIPLASASGSGSGRSGSRSVG ASTSTALAKAFNPFSEPASFSDSDKSHRSKTNKKPSKSDANRQFNEVHKPRTDQLENSKN KSKQLVNKPNHNKMAVKEQRSHHKKSHHHRSHQPKQASASTESHQSSSIESIFVEEPTLV LDREVASINVPANAKAIIAEQGPSTYSKEALIKDKLKPDPSTLVEIEKSLLSLFNMKRPP KIDRSKIIIPEPMKKLYAEIMGHELDSVNIPKPGLLTKSANTVRSFTHKDSKIDDRFPHH HRFRLHFDVKSIPADEKLKAAELQLTRDALSQQVVASRSSANRTRYQVLVYDITRVGVRG QREPSYLLLDTKTVRLNSTDTVSLDVQPAVDRWLASPQRNYGLLVEVRTVRSLKPAPHHH VRLRRSADEAHERWQHKQPLLFTYTDDGRHKARSIRDVSGGEGGGKGGRNKRQPRRPTRR KNHDDTCRRHSLYVDFSDVGWDDWIVAPLGYDAYYCHGKCPFPLADHFNSTNHAVVQTLV NNMNPGKVPKACCVPTQLDSVAMLYLNDQSTVVLKNYQEMTVVGCGCR
FASTA format is fairly simple. It contains a definition line starting with '>' that contains a name for a sequence followed by the actual sequence in nucleotide or amino acid format. The file we are looking at contains protein sequences, so the sequence uses the single letter code for amino acids. A minimal input file set for MAKER would generally consist of a FASTA file for the genomic sequence, a FASTA file of ESTs derived from the transcriptome, and a FASTA file of protein sequences from the same or related organisms. I'll describe in more detail exactly what MAKER does with each data file shortly.
Now we are going to copy the example files to the example1_dmel directory we looked at earlier before running MAKER.
cd /home/gmod/Documents/Data/maker/example1_dmel cp /home/gmod/Documents/Software/maker/data/dpp* .
Next we need to tell MAKER all the details about how we want the annotation process to proceed. Because there can be many variables and options involved in annotation, command line options would be too numerous and cumbersome. Instead MAKER uses a set of configuration files which guide each run. You can create a set of generic configuration files in the current working directory by typing the following.
This creates three files (type ls -l to see).
- maker_exe.ctl - contains the path information for needed executables.
- maker_bopt.ctl - contains filtering statistics for BLAST and Exonerate
- maker_opt.ctl - contains all other information for MAKER, including the location of the input genome file.
Control files are run specific and separate control files will need to be built for each genome given to MAKER. MAKER will look for control files in the current working directory, so it is recommended that MAKER should be run in a separate directory containing unique control files for each genome.
Let's take a look at the maker_exe.ctl file.
You will see the names of a number of MAKER supported executables as well as the path to their location. If you followed the installation instructions correctly, including the instructions for installing prerequisite programs, all executable paths should show up automatically for you. However if the location to any of the executables is not set in your PATH environmental variable, as per installation instructions, you will have to add these manually to the maker_exe.ctl file every time you run MAKER.
Lines in the MAKER control files have the format key:value with no spaces before or after the colon(:). If the value is a file name, you can use relative paths and environmental variables, i.e. snap:$HOME/snap. Note that for all control files the comments written to help users begin with a pound sign(#). In addition, options before the colon(:) can not be changed, nor should there be a space before or after the colon.
Now let's take a look at the maker_bopts.ctl file.
In this file you will find values you can edit for downstream filtering of BLAST and Exonerate alignments. At the very top of the file you will see that I have the option to tell MAKER whether I prefer to use WU-BLAST or NCBI-BLAST. We want to set this to NCBI-BLAST, since that is what is installed. We can just leave the remaining values as the default.
Now let's take a look at the maker_opts.ctl file.
This is the primary configuration file for MAKER specific options. Here we need to set the location of the genome, EST, and protein input files we will be using. These come from the supplied example files. We also need to set repeat masking options, as well as a number of other configurations. We'll discuss these options in more detail later on, but for now just adjust the following values.
genome:dpp_contig.fasta est:dpp_transcripts.fasta protein:dpp_proteins.fasta predictor:est2genome
Now let's run MAKER.
You should now see a large amount of status information flowing past your screen. If you don't want to see this you can run MAKER with the -q option for "quiet" on future runs.
Details of What is Going on Inside of MAKER
The first step to MAKER is repeat masking, but why do we need to do this? Repetitive elements can make up a significant portion of the genome. Some of these repeats are simple/low-complexity repeats where you have runs of C's or G's or maybe even dinucleotide repeats. Other repeats are more complex, i.e. transposable elements. These high-complexity repeats often encode real proteins like rerotranscriptase or even Gag, Pol, and Env viral proteins. Because they encode real proteins, they can play havoc with ab initio gene predictors. For example, a transposable element that occurs next to or even within the intron of a real protein encoding gene might cause a gene predictor to include extra exons as part of a gene model, sequence which really only belongs to the transposable element and not to the coding sequence of the gene. You will also get hundreds of instances where identical transportable element proteins get annotated as being part of an organisms proteome. In addition to these issues, low-complexity repeat regions can align with high statistical significance to low-complexity protein regions creating a false sense of homology throughout the genome. To avoid these complications it is convenient to identify and mask any repeat elements before doing other analyses.
MAKER identifies repeats in two steps.
- First a program called RepeatMasker is used to identify low-complexity and high-complexity repeats that match entries in the RepBase repeat library, or any species specific repeat library supplied by the user.
- Next MAKER uses RepeatRunner to identify transposable element and viral proteins from the RepeatRunner protein database. Because protein sequence diverges at a slower rate than nucleotide sequence, this step helps pick up the most problematic regions of divergent repeats that are missed by RepeatMasker, which searches in nucleotide space.
Regions identified during repeat analysis are masked out so as not to complicate other downstream annotation analyses.
- High-complexity repeats are hard-masked, a technique in which nucleotide sequence is replaced with the letter N to prohibit any alignments to that region.
- Low-complexity regions are soft-masked, a technique in which nucleotides are made lower case so they can be treated as masked under certain situations without losing sequence information. I will discuss some of the applications and effects of soft-masking later.
Now the idea of masking out sequence might seem on the surface like we're losing a lot of information, and it is true that there can be proteins that have integrated repeats into their structure, so repeat masking will affect our ability to annotate these proteins. However, these proteins are rare and the number of gene models and homology alignments improved by this step far exceed the few gene models that may be negatively affected. You do have the option to run ab initio gene predictors on both the masked and unmasked sequence if repeat masking worries you though. You do this by setting unmask:1 in the maker_opt.ctl configuration file.
Ab Initio Gene Prediction
Following repeat masking, MAKER runs ab initio gene predictors specified by the user to produce preliminary gene models. Ab initio gene predictors produce gene predictions based on underlying mathematical models describing patterns of intron/exon structure and consensus start signals. Because the patterns of gene structure are going to differ from organism to organism, you must train gene predictors before you can use them. I will discuss how to do this later on.
MAKER currently supports:
You must specify in the maker_opts.ctl file the training parameters file you want to use use when running each of these algorithms.
EST and Protein Evidence Alignment
A simple way to indicate if a sequence region is likely associated with a gene is to identify (A) if the region is actively being transcribed or (B) if the region has homology to a known protein. This can be done by aligning Expressed Sequence Tags (ESTs) and proteins to the genome using alignment algorithms.
- ESTs are sequences derived from a cDNA library. Because of the difficulties associated with working with mRNA and depending on how the cDNA library was prepared, EST databases usually represent bits and pieces of transcribed mRNAs with only a few full length transcripts. MAKER aligns these sequences to the genome using BLASTN. If ESTs from the organism being annotated are unavailable or sparse, you can use ESTs from a closely related organism. However, ESTs from closely related organisms are unlikely to align using BLASTN since nucleotide sequences can diverge quite rapidly. For these ESTs, MAKER uses TBLASTX to align them in protein space.
- Protein sequence generally diverges quite slowly over large evolutionary distances, as a result proteins from even evolutionarily distant organisms can be aligned against raw genomic sequence to try and identify regions of homology. MAKER does this using BLASTX.
Remember now that we are aligning against the repeat-masked genomic sequence. How is this going to affect our alignments? For one thing we won't be able to align against low-complexity regions. Some real proteins contain low-complexity regions and it would be nice to identify those, but if I let anything align to a low-complexity region, then I will get spurious alignments all over the genome. Wouldn't it be nice if there was a way to allow BLAST to extend alignments through low-complexity regions, but only if there is is already alignment somewhere else? You can do this with soft-masking. If you remember soft-masking is using lower case letters to mask sequence without losing the sequence information. BLAST allows you to use soft-masking to keep alignments from seeding in low-complexity regions, but allows you to extend through them. This of course will allow some of the spurious alignments you were trying to avoid, but overall you still end up suppressing the majority of poor alignments while letting through enough real alignments to justify the cost. You can turn this behavior off though if it bothers you by setting softmask:0 in the maker_bopt.ctl file.
Polishing Evidence Alignments
Because of oddities associated with how BLAST statistics work, BLAST alignments are not as informative as they could be. BLAST will align regions any where it can, even if the algorithm aligns regions out of order, with multiple overlapping alignments in the exact same region, or with slight overhangs around splice sites.
To get more informative alignments MAKER uses the program Exonerate to polish BLAST hits. Exonerate realigns each sequences identified by BLAST around splice sites and forces the alignments to occur in order. The result is a high quality alignment that can be used to suggest near exact intron/exon positions. Polished alignments are produced using the est2genome and protein2genome options for Exonerate.
One of the benefits of polishing EST alignments is the ability to identify the strand an EST derives from. Because of amplification steps involved in building an EST library and limitations involved in some high throughput sequencing technologies, you don't necessarily know whether you're really aligning the forward or reverse transcript of an mRNA. However, if you take splice sites into account, you can only align to one strand correctly.
Integrating Evidence to Synthesize Annotations
Once you have ab initio predictions, EST alignments, and protein alignments you can integrate this evidence to produce even better gene predictions. MAKER does this by "talking" to the gene prediction programs. MAKER takes all the evidence, generates "hints" to where splice sites and protein coding regions are located, and then passes these "hints" to programs that will accept them.
MAKER produces hint based predictors for:
- GeneMark (under development)
Selecting and Revising the Final Gene Model
MAKER then takes the entire pool of ab initio and evidence informed gene predictions, updates features such as 5' and 3' UTRs based on EST evidence, tries to determine alternative splice forms where EST data permits, produces quality control metrics for each gene model (this is included in the output), and then MAKER chooses from among all the gene model possibilities the one that best matches the evidence. This is done using a modified sensitivity/specificity distance metric.
MAKER can use evidence from EST alignments to revise gene models to include features such as 5' and 3' UTRs.
Finally MAKER calculates quality control statistics to assist in future downstream management and curation of gene models outside of MAKER.
If you look in the current working directory, you will see that MAKER has created an output directory called dpp_contig.maker.output. The name of the output directory is based off of the input genomic sequence file, which in this case was dpp_contig.fasta.
Now let's see what's inside the output directory.
cd dpp_contig.maker.output ls -1
You should now see a list of directories and files created by MAKER.
dpp_contig_datastore dpp_contig_master_datastore_index.log maker_bopts.log maker_exe.log maker_opts.log mpi_blastdb
- The maker_opt.log, maker_exe.log, and maker_bopts.log files are logs of the control files used for this run of MAKER.
- The mpi_blastdb directory contains FASTA indexes and BLAST database files created from the input EST, protein, and repeat databases.
- The dpp_contig_master_datastore_index.log contains information on both the run status of individual contigs and information on where individual contig data is stored.
- The dpp_contig_datastore directory contains a set of subfolders, each containing the final MAKER output for individual contigs from the genomic fasta file.
Once a MAKER run is finished the most important file to look at is the dpp_contig_master_datastore_index.log to see if there were any failures.
If everything proceeded correctly you should see the following.
contig-dpp-500-500 dpp_contig_datastore/contig-dpp-500-500 STARTED contig-dpp-500-500 dpp_contig_datastore/contig-dpp-500-500 FINISHED
There are only entries describing a single contig because there was only one contig in the example file. These lines indicate that the contig contig-dpp-500-500 STARTED and then FINISHED without incident. Other possible entries include:
- FAILED - indicates a failed run on this contig, MAKER will retry these
- RETRY - indicates that MAKER is retrying a contig that failed
- SKIPPED_SMALL - indicates the contig was too short to annotate (minimum contig length is specified in maker_opt.ctl)
- DIED_SKIPPED_PERMANENT - indicates a failed contig that MAKER will not attempt to retry (number of times to retry a contig is specified in maker_opt.ctl)
The entries in the dpp_contig_master_datastore_index.log file also indicate that the output files for this contig are stored in the directory dpp_contig_datastore/contig-dpp-500-500/. Knowing where the output is stored may seem rather trivial; however, input genome fasta files can contain thousands even hundreds-of-thousands of contigs, and many file-systems have performance problems with large numbers of sub-directories and files within a single directory. Even when the underlying file-systems handle things gracefully, access via network file-systems can be an issue. To deal with this situation, MAKER uses a datastore module to create a hierarchy of sub-directory layers, starting from a 'base', and mapping identifiers to corresponding sub-directories. For situations where the input genome fasta file contains more than 1,000 contigs, the datastore structure is used automatically, and the master_datastore_index.log file becomes essential for identifying where the output for a given contig is stored.
Now let's take a look at what MAKER produced for the contig 'contig-dpp-500-500'.
cd dpp_contig_datastore/contig-dpp-500-500 ls -1
The directory should contain a number of files and a directory.
contig-dpp-500-500.gff contig-dpp-500-500.maker.proteins.fasta contig-dpp-500-500.maker.transcripts.fasta run.log theVoid.contig-dpp-500-500
- The contig-dpp-500-500.gff contains all annotations and evidence alignments in GFF3 format. This is the important file for use with Apollo or GBrowse.
- The contig-dpp-500-500.maker.transcripts.fasta and contig-dpp-500-500.maker.proteins.fasta files contain the transcript and protein sequences for MAKER produced gene annotations.
- The run.log file is a log file. If you change settings and rerun MAKER on the same dataset, or if you are running a job on an entire genome and the system fails, this file lets MAKER know what analyses need to be deleted, rerun, or can be carried over from a previous run. One advantage of this is that rerunning MAKER is extremely fast, and your runs are virtually immune to all system failures.
- The directory theVoid.contig-dpp-500-500 contains raw output files from all the programs MAKER wraps around (BLAST, SNAP, RepeatMasker, etc.). You can usually ignore this directory and it's contents.
Viewing MAKER Annotations
Let's take a look at the GFF3 file produced by MAKER.
As you can see, manually viewing the raw GFF3 file produced by MAKER really isn't that meaningful. While you can identify individual features such as genes, mRNAs, and exons, trying to interpret those features in the context of thousands of other genes and thousands of bases of sequence really can't be done by directly looking at the GFF3 file.
Let's load the contig-dpp-500-500.gff into Apollo and take a look at what MAKER produced. Copy the contig-dpp-500-500.gff file to your home directory to make it easy to locate.
cp contig-dpp-500-500.gff ~
Now before starting Apollo, MAKER comes with a configuration file that will allow Apollo to display MAKER annotations and evidence in nice color (otherwise everything will be the same color of white). Copy the configuration file to the ~/.apollo directory, to make the configuration file available to Apollo.
cp /home/gmod/Documents/Software/maker/Apollo/gff3.tiers ~/.apollo/
Now open Apollo and select our GFF3 file.
You will notice that there are a number of bars representing the gene annotations and the evidence alignments supporting those annotations. Annotations are in the middle light colored panel, and evidence alignments are in the dark panels at the top and bottom. As you have probably realized, this view is much easier to interpret than looking directly at the GFF3 file.
Now click on each piece of evidence and you will see it's source in the table at the bottom of the Apollo screen.
Possible Sources Include:
- BLASTN - BLASTN alignment of EST evidence
- BLASTX - BLASTX alignment of protein evidence
- TBLASTX - TBLASTX alignment of EST evidence from closely related organisms
- EST2Genome - Polished EST alignment from Exonerate
- Protein2Genome - Polished protein alignment from Exonerate
- SNAP - SNAP ab inito gene prediction
- GENEMARK - GeneMarkab inito gene prediction
- Augustus - Augustus ab inito gene prediction
- FgenesH - FGENESH ab inito gene prediction
- Repeatmasker - RepeatMasker identified repeat
- RepeatRunner - RepeatRunner identified repeat from the repeat protein database
Advanced MAKER Configuration, Re-annotation Options, and Improving Annotation Quality
The remainder of this page mainly presents issues that can be encountered during the annotation process. I then describe how MAKER can be used to resolve each issue.
Configuration Files in Detail
Let's take a closer look at the configuration options in the maker_opt.ctl file.
cd /home/gmod/Documents/Data/maker/example1_dmel emacs maker_opts.ctl
Basic Input Files
All the basic input files for MAKER should be in FASTA format.
- genome - Genomic sequence file
- est - ESTs from the same organism or from a very very closely related organism (i.e. chimpanzee to human). These are aligned first via BLASTN with very strict filtering so any sequence divergence can prohibit the alignment.
- altest - These are ESTs from other closely related organisms (i.e. mouse to human). They are aligned via TBLASTX in protein space, so greater sequence divergence is permitted.
- protein - proteins from the same or other organisms. These are aligned via BLASTX against the genome. Proteins that align to a region will not necessarily be orthologous or paralogous. The alignment may just be based on short regions such as a shared domain. You may also get alignments to pseudogenes. Polishing BLASTX hits with Exonerate helps identify what are likely true paralogs and orthologs.
Repeat Masking Options
Repeat masking is important for improving gene predictor performance and avoiding protein alignments to what are likely just transposons. You also expect a certain amount of genomic contamination in the EST database, much of this contamination maps back to repeat regions. By repeat masking we can avoid issues with all types of input data.
- model_org - This is a RepeatMasker option that lets you limit the repeat database to specific organisms or groups of organisms (i.e. vertebrates, Nematodes, Drosophila, primates etc). By default MAKER sets this to 'all'.
- repeat_protein - This is a fasta file of transposon and virus related proteins. MAKER has an internal RepeatRunner database it uses by default.
- rmlib - This is a fasta file of nucleotide repeats provided by the user. You can create a species specific repeat database using programs like PILER.
Gene Prediction Options
Gene prediction options affect the final gene annotations more than any other option type. This brings up the point that electronically produced gene annotations will only be as good as the gene predictions they are based on.
- predictor - This tells MAKER what programs to run for generating annotations.
- est2genome - Allows high quality spliced Exonerate EST alignments to become gene annotations. This only happens when there is no gene prediction overlapping the region. This is useful for generating gene annotations in the absence of a trained gene predictor.
- protein2genome - Attempts to build gene models directly from protein alignments (works on prokaryotes only)
- model_gff - This allows user defined models to be used
- pred_gff - This allows user provided ab initio predictions
- unmask - Produce ab initio gene predictions for unmasked sequence as well as for masked sequence
- snaphmm - SNAP training file (SNAP has some species files already available in the snap/HMM/ directory)
- gmhmm - GeneMark training file (GeneMark self-trains and produces the resulting training file in the output mod/ directory)
- augustus_species - Augustus species ID (Augustus uses an internal species index rather than a simple set of training files. Type 'augustus --species=help' to see the values you can choose)
- fgenesh_par_file - FGENESH training file
Other MAKER Options
- evaluate - runs an experimental annotation quality analysis program (Evaluator) on each annotation. Provides quantitative metrics for ranking annotations and identifying the features most in need of review. I'd like to emphasize that this is experimental.
- max_dna_len - sets the length for dividing up contigs into chunks for processing. Larger chunks require more memory; smaller chunks require less memory. Allows the user to control system memory usage.
- min_contig - sets the minimum length a contig must have or else it will be skipped.
- min_protein - sets the minimum length a predicted protein must have (in amino acids) to be annotated.
- split_hit - sets the expected max intron size for evidence alignments
- pred_flank - sets the length for the sequence surrounding clusters of EST and protein evidence that will be used when building hint based gene predictions.
- single_exon - tells MAKER to consider single exon EST evidence when generating annotations. Single exon ESTs are more likely to be genomic contamination.
- single_length - sets the minimum length required for single exon ESTs if 'single_exon' is enabled
- keep_preds - adds non-overlapping ab-inito gene prediction to the final annotation set rather than pushing them off into a separate file for the user to analyse. These predictions by definition do not overlap any form of supporting evidence.
- retry - sets the number of times to retry a contig if there is a failure
- clean_try - removes all data from previous MAKER runs before retrying a contig
- clean_up - removes theVoid directory with individual raw analysis files at the end of the MAKER run
- TMP - specifies a directory other than the system default temporary directory (/tmp) for writing temporary files. On some Linux systems the primary hard drive that also holds the default temporary directory is small, and most of the systems storage space is located on secondary hard drives mounted in directories elsewhere on the system. This is often true of computer clusters where each node has it's own small hard drive for booting purposes, and most storage space is network mounted. Temporary files created by MAKER are deleted as the program advances, but individual files related to BLAST jobs can be quite large, so setting TMP to another location can be useful.
Training ab initio Gene Predictors
If you are involved in a genome project for an emerging model organism, you should already have an EST database which would have been generated as part of the original sequencing project. A protein database can be collected from closely related organism genome databases or by using the UniProt/SwissProt protein database or the NCBI NR protein database. However a trained ab initio gene predictor is a much more difficult thing to generate. Gene predictors require existing gene models on which to base prediction parameters. However, with emerging model organisms you are not likely to have any pre-existing gene models. So how then are you supposed to train your gene prediction programs?
MAKER gives the user the option to produce gene annotations directly from the EST evidence. You can then use these imperfect gene models to train gene predictor program. Once you have re-run MAKER with the newly trained gene predictor, you can use the second set of gene annotations to train the gene predictors yet again. This boot-strap process allows you to iteratively improve the performance of ab initio gene predictors.
I've created an example file set so you can learn to train the gene predictor SNAP using this procedure.
First let's move to the example directory.
cd /home/gmod/Documents/Data/maker/example2_pyu ls -1
You should see the following files (plus others) in the directory
pyu-contig.fasta pyu-est.fasta pyu-protein.fasta
We need to build maker configuration files and populate the appropriate values.
maker -CTL gedit maker_opts.ctl
Edit the following:
genome:pyu-contig.fasta est:pyu-est.fasta protein:pyu-protein.fasta predictor:est2genome
MAKER is now configured to generate annotations from the EST data, so start the program (this will take a minute to run).
Once finished load the file pyu-contig.maker.output/pyu-contig_datastore/scf1117875581239.gff into Apollo. You will see that there are far more regions with evidence alignments than there are gene annotations. This is because there are so few spliced ESTs that are capable of generating gene models.
mkdir gff cp pyu-contig.maker.output/pyu-contig_datastore/scf1117875582023/scf1117875582023.gff gff/ cd gff maker2zff.pl scf1117875582023.gff ls -1
There should now be two new files. The first is the ZFF format file and the second is a FASTA file the coordinates can be referenced against. These will be used to train SNAP.
The basic steps for training SNAP are first to filter the input gene models, then capture genomic sequence immediately surrounding each model locus, and finally uses those captured segments to produce the HMM. You can explore the internal SNAP documentation for more details if you wish.
fathom -categorize 1000 output.ann output.dna fathom -export 1000 -plus uni.ann uni.dna forge export.ann export.dna hmm-assembler.pl Pult . > Pult.hmm cd ..
The final training parameters file is Pult.hmm. We do not expect SNAP to perform that well with this training file because it is based on incomplete gene models; however, this file is a good starting point for further training.
We need to run MAKER again with the new HMM file we just built for SNAP.
Now lets look at the output once again in Apollo. When you examine the annotations you should notice that final MAKER gene models displayed in light blue, are more abundant now and are in relatively good agreement with the evidence alignments. However the SNAP ab initio gene predictions in the evidence tier do not yet match the evidence that well. This is because SNAP predictions are based solely on the mathematic descriptions in the HMM; whereas, MAKER models also use evidence alignments to help further inform gene models. This demonstrates why you get better performance by running ab initio gene predictors like SNAP inside of MAKER rather than producing gene models by themselves for emerging model organism genomes. The fact that the MAKER models are in better agreement with the evidence than the current SNAP models also means I can use the MAKER models to retrain SNAP in a bootstrap fashion, thereby improving SNAP's performance and consequentially MAKER's performance.
Close Apollo, retrain SNAP, and run MAKER again.
mkdir gff2 cp pyu-contig.maker.output/pyu-contig_datastore/scf1117875582023/scf1117875582023.gff gff2/ cd gff2 maker2zff.pl scf1117875582023.gff fathom -categorize 1000 output.ann output.dna fathom -export 1000 -plus uni.ann uni.dna forge export.ann export.dna hmm-assembler.pl Pult . > Pult2.hmm cd .. emacs maker_opts.ctl
Change configuration file.
Let's examine the GFF3 file one last time in Apollo. As you can see there, there is now a marked degree of improvement in both the MAKER and SNAP gene models, and both models are in more agreement with each other.
Now that you have SNAP trained for the genome file, you can also combine the results from SNAP with another gene predictor like GeneMark. This is done by adding genemark to the list of predictors in the maker_opts.ctl file and providing MAKER with a genemark training file.
GeneMark is self-training, and produces a training file that has already been loaded for you into the example directory file. Unfortunately genemark will not run on the vmware image, so to see what the results should look like, open the the finished example in the /home/gmod/Documents/Data/maker/example2_pyu/finished.maker.output directory.
You can load these results in Apollo. You will see instances where MAKER produces models based on GeneMark's gene predictions and others where models are based on SNAPS predictions.
MAKER Web Annotation Service
As you have all experienced with the previous examples, running programs on the command line can seem difficult. Many users might feel overwhelmed by trying to install and run a program like MAKER locally, especially if they are not very familiar with Linux. For those individuals, our lab has produced the MAKER Web Annotation Service (MWAS). MWAS is a website where you can run MAKER over the web without having to install any software locally, and you are provided with a much more user friendly interface for configuring MAKER and viewing results.
- Go to http://www.yandell-lab.org and select MWAS from the tabbed menu. You will see a link at the bottom of the page to access the MAKER Web Annotation Server. On the MWAS server page log in as a guest, then select 'New Job' from the top of the page.
Scrolling down the page, you should notice there are options to select the genome file, EST and protein evidence files, and choose ab initio gene predictors. At the top of the page select Example Jobs → D. melanogaster :Dpp' and click 'Load.
Now if you scroll down, you should notice that the values for your genome, EST and protein files has been filled out for you. At the bottom of the page click Add Job to Queue. You will now be sent to the job status page.
You will need to click Refresh Job Status, a couple of times until your job finishes. When your job is finished you will see an icon in the column marked Log. Click it. A window will come up displaying any errors that occurred for your job, so ideally this window will be blank. Next click on the View Results icon.
The results window will provide a brief summery of the status of each contig in your job, and will give you the opportunity to download the data, or view the results for individual contigs. Click on View in Apollo. This will open your data in Apollo (Ed Lee will describe just how launching Apollo over the web works during the Apollo section). Then close Apollo and click on SOBA statistics. This will open up a tool from the Sequence Ontology Consortium that provides simple summery statistics of features in a GFF3 file.
mRNAseq is a high throughput technique for sequencing the entire transcriptome, and it holds the promise of allowing researchers to identify all exons and alternative splice forms for every gene in the genome with a single experiment. It may soon make gene predictors (mostly) a thing of the past.
- Still need to de-convolute reads & evidence (for now)
- Still need to archive, manage, and distribute annotations
By mapping mRNAseq reads using programs like TopHat and CrossBow, you can create GFF3 files of read islands and junctions. This data can then be passed in as EST evidence and will be used for generating hint based gene prediction and for choosing final annotations.
Load example on MWAS site. http://derringer.genetics.utah.edu/cgi-bin/MWAS/maker.cgi
Merge/Resolve Legacy Annotations
- Many are no longer maintained by original creators
- In some cases more than one group has annotated the same genome, using very different procedures, even different assemblies
- Many investigators have their own genome-scale data and would like a private set of annotations that reflect these data
- There will be a need to revise, merge, evaluate, and verify legacy annotation sets in light of RNA-seq and other data
- Identify legacy annotation most consistent with new data
- Automatically revise it in light of new data
- If no existing annotation, create new one
Load example on MWAS class site. http://derringer.genetics.utah.edu/cgi-bin/MWAS/maker.cgi
MAKER optionally supports Message Passing Interface (MPI), a parallel computation communication protocol primarily used on computer clusters. This allows for MAKER jobs to be broken up across multiple nodes/processors for increased performance and scalability.
To use this feature, you must have MPICH2 installed with the the --enable-sharedlibs flag set during installation (See MPICH2 Installer's Guide). I have installed this for you. So lets set up MPI_MAKER and run the example file that comes with MAKER.
cd ~/Documents/Software/maker/MPI perl Install.PL
You should now see the executable mpi_maker listed among the all the MAKER scripts (/maker/bin). Let's run some example data to see if MPI_MAKER is working properly.
cd ~ mkdir ~/maker_run2 cd maker_run2 cp ~Documents/Software/maker/data/dpp_* ~/maker_run2 maker -CTL gedit maker_opts.ctl
Set values in maker configuration files.
genome:dpp_contig.fasta est:dpp_transcripts.fasta protein:dpp_proteins.fasta predictor:snap snaphmm:fly
We need to set up a few more things for MPI to work. Type mpd to see a list of instructions.
You should see the following.
configuration file /home/gmod/mpd.conf not found A file named .mpd.conf file must be present in the user's home directory (/etc/mpd.conf if root) with read and write access only for the user, and must contain at least a line with: MPD_SECRETWORD=<secretword> One way to safely create this file is to do the following: cd $HOME touch .mpd.conf chmod 600 .mpd.conf and then use an editor to insert a line like MPD_SECRETWORD=mr45-j9z into the file. (Of course use some other secret word than mr45-j9z.)
Follow the instructions to set this file up, and start the mpi environment with mpdboot. Then run mpi_maker through the MPI manager mpiexec.
mpdboot mpiexec -n 2 mpi_maker
mpiexec is a wrapper that handles the MPI environment. The -n 2 flag tells mpiexec to use 2 cpus/nodes when running mpi_maker. For a large cluster, this could be set to something like 100. You should now know how to start a MAKER job via MPI.
User Interface for Local MAKER Instalation
The MWAS interface provides a very convenient method for running MAKER and viewing results; however, because compute resources are limited users are only allowed to submit a maximum of 2 megabases of sequence per job. So while MWAS might be suitable for some analyses (i.e. annotating BACs and short preliminary assemblies), if you plan on annotating an entire genome you will need to install MAKER locally. But if you like the convenience of the MWAS user interface, you can optionally install the interface on top of a locally installed version of MAKER for use in your own lab.
First under the maker directory there is a subdirectory called MWAS. MWAS contains all the needed files to build the MAKER web interface. The maker/MWAS/bin/mwas_server file is used to setup and run this web interface. Lets configure that now. There are three steps to setting up the server. First you must create and edit a server configuration file, then load all other configuration files, and then install all files to the appropriate web accessible directory.
cd /home/gmod/Documents/Software/maker/MWAS/ bin/mwas_server PREP
This will create a file in /maker/MWAS/config/ called server.ctl. We will need to edit this file before continuing.
apache_user:www-data web_address:http://localhost cgi_dir:/usr/lib/cgi-bin/maker cgi_web:/cgi-bin/maker html_dir:/var/www/maker html_web:/maker data_dir:/var/www/maker/data use_login:0 JBROWSE_ROOT:/var/www/jbrowse/ GBROWSE_MASTER:/etc/gbrowse2/GBrowse.conf
Now we need to generate other settings that are dependent on the values in server_opts.ctl.
Several new configuration files should now be loaded in the config/ directory. These new files define default MAKER options for the server and the location of files for the server dropdown menus.
maker_bopts.ctl maker_exe.ctl maker_opts.ctl menus.ctl
We only need to edit one of these files to let MAKER know to use NCBI BLAST instead of WUBLAST.
Now finally lets copy all web related files to the appropriate web accessible directories. This must be done as root or using sudo.
sudo bin/mwas_server SETUP
If you set APOLLO_ROOT in the server.ctl file, then you can now setup a special Java Web Start version of Apollo to view results directly from the web interface. Web Start will be described in more detail in the Apollo session. This must be done as root or using sudo.
sudo bin/mwas_server APOLLO
In addition, if you have JBrowse and GBrowse 2 installed you can also use these to view results. To use these you must have JBROWSE_ROOT and GBROWSE_MASTER set in the server.ctl file (already set). Then we tell mwas_server to configure the programs.
sudo bin/mwas_server GBROWSE sudo bin/mwas_server JBROWSE
We can now run MAKER examples using this web interface, but first we need to launch a server daemon to monitor for new job submissions.
sudo bin/mwas_server START
And then go to
Appendix: MAKER Accessory Scripts
MAKER comes with a number of accessory scripts that are meant to assist in manipulations of the MAKER input and output files.
- add_utr_gff.pl - Adds explicit 5' and 3' UTR features to the GFF3 output file
- add_utr_start_stop_gff - Adds explicit 5' and 3' UTR as well as start and stop codon features to the GFF3 output file
- fasta_merge - Collects all of MAKER's fasta file output for each contig and merges them to make genome level fastas
fasta_merge -d <datastore_index> -o <outfile>
- gff3_merge - Collects all of MAKER's GFF3 file output for each contig and merges them to make a single genome level GFF3
gff3_merge -d <datastore_index> -o <outfile>
- gff3_2_gtf - Converts MAKER GFF3 files to GTF format (run add_utr_start_stop_gff first to get UTR features)
- gff3_preds2models - Converts the gene prediction match/match_part format to annotation gene/mRNA/exon/CDS format
gff3_preds2models <gff3 file> <pred list>
- iprscan2gff3 - Takes InerproScan (iprscan) output and generates GFF3 features representing domains. Interesting tier for GBrowse.
iprscan2gff3 <iprscan_file> <gff3_fasta>
- iprscan_batch - Wrapper for iprscan to take advantage of multiprocessor systems.
iprscan_batch <file_name> <cpus> <log_file>
- ipr_update_gff - Takes InterproScan (iprscan) output and maps domain IDs and GO terms to the Dbxref and Ontology_term attributes in the GFF3 file.
ipr_update_gff <gff3_file> <iprscan_file>
- maker2zff.pl - Pulls out MAKER gene models from the MAKER GFF3 output and convert them into ZFF format for SNAP training.
- maker_functional_fasta - Maps putative functions identified from BLASTP against UniProt/SwissProt to the MAKER produced tarnscript and protein fasta files.
maker_functional_fasta <uniprot_fasta> <blast_output> <fasta1> <fasta2> <fasta3> ...
- maker_functional_gff - Maps putative functions identified from BLASTP against UniProt/SwissProt to the MAKER produced GFF3 files in the Note attribute.
maker_functional_gff <uniprot_fasta> <blast_output> <gff3_1>
- maker_map_ids - Build shorter IDs/Names for MAKER genes and transcripts following the NCBI suggested naming format.
maker_map_ids --prefix PYU1_ --justify 6 genome.all.gff > genome.all.id.map
- map_fasta_ids - Maps short IDs/Names to MAKER fasta files.
map_fasta_ids <map_file> <fasta_file>
- map_gff_ids - Maps short IDs/Names to MAKER GFF3 files, old IDs/Names are mapped to to the Alias attribute.
map_gff_ids <map_file> <gff3_file>
- split_fasta - Splits multi-fasta files into the number of files specified by the user. Useful for breaking up MAKER jobs.
split_fasta [count] <input_fasta>