Apollo Tutorial

From GMOD
Revision as of 17:29, 28 September 2009 by Clements (Talk | contribs)

Jump to: navigation, search
Under Construction

This page or section is under construction.

{{{1}}}

{{#icon: 2009SummerSchoolAmericas170.png|2009 GMOD Summer School - Americas|120|2009 GMOD Summer School - Americas}}

{{#icon: GMOD2009Europe170.png|2009 GMOD Summer School - Europe|120|2009 GMOD Summer School - Europe}}
Apollo Session

2009 GMOD Summer School - Europe & Americas
July & August 2009
Ed Lee

__NOTITLE__


This tutorial walks you through how to use the Apollo genome annotation editor. This tutorial was originally taught by Ed Lee at the 2009 GMOD Summer School - Europe & Americas. The notes and VMware image used on this page are from the Europe course.



VMware

This tutorial was taught using a VMware system image as a starting point. If you want to start with that same system, download and install the Starting image.

See VMware for what software you need to use a VMware system image, and for directions on how to get the image setup and running on your machine.

Download
Starting Image

Ending Image


Username: gmod
Password: gmod

Caveats

Important Note

This tutorial describes the world as it existed on the day the tutorial was given. Please be aware that things like CPAN modules, Java libraries, and Linux packages change over time, and that the instructions in the tutorial will slowly drift over time. Newer versions of tutorials will be posted as they become available.

Introduction

Overview

Once we have a sequence assembled, we need to annotate that sequence, that is add features such as genes, pseudogenes, ncRNAs, etc. Otherwise we just have sequence and can't make much sense of the data. Computational analysis, such as Genscan, FGeneSH and tRNAscanSE are a great way to start the annotation process, as they help us localize regions of interest. However, these automated tools are far from perfect and the results often require manual updating from expert biologists. That is where Apollo comes in. Apollo is a sequence annotation editor and will allow you to create and edit annotations.

Annotation workflow

Architecture

Apollo is setup in a 3-tier architecture, with a presentation (GUI), logic and data layer. It is highly configurable, with most users configuring the presentation and data layers.

Apollo architecture

Presentation Layer

The presentation layer (GUI) handles displaying and gives the user an interface for creating and editing these annotations. Customization of this layer usually entails setting up how features are displayed (e.g., color, shape, labels).

Logic Layer

The logic layer handles how data is represented and the various operations you can perform on the data (e.g., creating, editing, adding information to annotations).

Data Layer

The data layer takes care of interfacing with the different data sources. Customization of the data layer usually entails setting up access to different databases (e.g., your own Chado instance) to even creating new adapters to read new data formats or schemas.

Installation

You can download Apollo from pre-built installer packages or getting the code from either SVN or tarball, both which require building the application.

Pre-built Installers

You can download OS-specific pre-built installer packages from the Apollo installer page. We provide the following installers:

Platform Optionally bundled JRE
Windows Yes
Mac OS X No
Linux< Yes
Solaris Yes (x86 version)
Unix No

Since we're installing it on our Linux virtual machine, we'll download the Linux version. Java has already been setup in these machines, so we'll get the installer without the bundled JRE. The bundled JRE option is a great solution for users who don't have Java installed or have multiple ones installed and are not sure how to select the correct one to be used. Download the Linux installer and save it to ~/to_be_installed.

The Linux installer is a shell script. If we want to install Apollo to the common application locations (such as /usr/local/bin), we'll need root access. We'll do everything from the command line since it's easier that way. Open up a terminal window and type the following:

<bash> $ cd to_be_installed $ sudo /bin/sh Apollo_unix.sh </bash>

We'll just install everything with the default options.

Checking Out the Code From SVN

You can also checkout the code from SourceForge SVN. You'll need an SVN client to do so. This will guarantee that you'll get the latest Apollo code. Note that you'll be getting the development version and as such might not be fully stable. The following commands apply to Unix based command line SVN clients for an anonymous checkout.

<bash> $ svn co https://gmod.svn.sourceforge.net/svnroot/gmod/apollo/trunk </bash>

If you're using an IDE, chances are that your IDE will have SVN support (or have a plugin available).

Getting the Source Tarball

You can also download the source tarball from SourceForge. You can get the source code for both the current and previous public releases. You won't get the latest development code as in SVN, but you should be getting code that is more stable.

Using Apollo

We'll start off with seeing some of the features that Apollo can do. We'll be connecting to our local [[Chado] instance. A customized Apollo Chado configuration has been setup for this. Don't worry, we'll cover the details on how we did that once we talk about setting up custom Chado configurations.

The following only applies to students who are having issues starting Fluxbox without root (sudo) privileges (apparently only an issue on Vista). For those students, we'll copy the custom configurations to the freshly installed Apollo. Type the following in the terminal (this will all make sense later on): <bash> $ cd ~/.apollo $ sudo cp gmod_summer_school.* chado-adapter.xml /usr/local/Apollo/conf </bash>

Let's first start by launching Apollo. Type the following in your terminal:

<bash> $ apollo </bash>

We'll see an option for which data we want to load. We'll choose Chado database as our data source. Since we still don't have annotations on our data, we can't use the gene option. Select contig in Select a region to display. Let's look at scf1117875581239 in the region between 102000 and 110000.

Chado adapter

Once loading is complete, we'll see the main Apollo window.

Apollo main window

The panels with the aqua background are for annotations and those with the black background are for computational results. The white box in the middle with the ruler represent the genomic region with the numbers being coordinates. The annotation and result panels above the genomic coordinate window are for the plus strand and the ones on the bottom are for the minus strand. The "Zoom" buttons will allow you to zoom in and out of the currently loaded region. The panels below provide information on the currently selected feature.

We only have results loaded and no annotations. So let's create a gene. We'll select maker-106017-102976. We want to create a transcript with all of the exons. So to select all of the exons, we just need to double left-click on one. You'll notice that all the exons have a red border around them.

Apollo with evidence model selected

Now that they're all selected, to create a new gene it's as easy as just dragging and dropping into the annotation panel. Voila! We have a new gene, GMOD:temp1, with transcript GMOD:temp1-transcript1.

Apollo with a single gene model

OK, let's try this again with another model. Let's do the same with fgenesh_masked-scf1117875581239-abinit-gene-1.48-mRNA-1 (man, that's a long name!). We can see that the transcript belongs to a new gene, GMOD:temp2. Makes sense, it's a obviously a separate gene.

Apollo with two gene models

But what if we were create a new feature from genemark_masked-scf1117875581239-abinit-gene-1.91-mRNA-1? Let's find out. Whoa! We can see that this new transcript was created as part of GMOD:temp2.

Apollo with splice variants

That's great, as it looks as it's a splice variant, rather than a whole new gene. So is this always the case? After all, there are overlapping genes, right? Apollo looks to see if there are any existing transcripts with in-frame overlaps to the new transcript. If that's the case, it is considered a splice variant. Otherwise, a new gene is created.

You'll notice that the newly created genes have an ID of the form GMOD:temp# and the transcripts have an ID of the form GMOD:temp#-transcript#. Apollo uses naming adapters to define how newly created features should be named. For example, FlyBase uses FBgn:temp# for genes and FBgn:temp#:chromosome#:chromosome_start-chromosome_end-R? where ? is A, B, C and so on.

Let's make sure that our Chado connectivity is working. Let's save our work using File → Save as....

File → Save as...

Make sure that Chado database is selected at the data source (should already be).

Chado save dialog

You'll notice that the IDs have changed. This is because the GMOD naming adapter follows the convention that all newly created features should have an ID of PREFIX:FEATURE_ID for a Chado database. This only occurs with newly created features. If you modify the ID of an existing feature and save that, this ID replacement will not take place.

Let's reload the data with File → Open new....

File → Open new...

Again, make sure that Chado database is the selected data source (should already be). You'll notice that all the information we put in when we first loaded from the database is already there. Apollo keeps a history of loading and saving so that you can easily access previous data sources. Click Ok. Great, the data is there!

Speaking of which, let's say that we actually want to change the gene ID to something more interesting. We can do so by selecting an exon in our feature, right clicking it, and choosing Annotation Info Editor... from the popup menu.

Annotation editor popup menu

We can see that we can add lots of interesting information for our gene and transcript.

Annotation editor

Let's go ahead and change the gene symbol to something else. We can see that this change affects all the transcripts. Just as we'd expect.

About the popup menu, you can see there are lot of things you can do your existing annotation. You can merge and split transcripts and exons, move exons from one transcript to another, and lots of other cool stuff. Let's take a look at merging exons. Select the 2 exons that you want to merge (hold down shift to allow you select multiple items), right click, and choose Merge exons.

Merge exons popup menu

Alright, it does what we'd expect it to do. Apollo-with-merged-exons.jpg But let's say that upon closer inspection, this is not what we wanted. Previously, you'd need to either manually split the exon again or remove the merged exon and then add both exons again. Kind of a pain, so traditionally users would save constantly so they could revert to a previous state. However, we have now added a much sought after feature, undo! So instead of doing all that work, we can undo our merge with Edit → Undo.

Edit → Undo

Wow, lookie here, it split the exons again. Although this looks to be a trivial operation, it's actually very complex, as one single change can lead to multiple cascading changes. In the case of of the merged exons, we can see that it changed the coding region frame for the downstream exons, thus affecting the CDS. So a single change caused other implicit changes to occur.

One cool recent addition to Apollo is the ability to do run remote analysis. We currently support BLAST and Primer-BLAST (primer identification tool) over at NCBI. Let's look at how the BLAST support works.

Select the first model we created (with ID GMOD:00014333 in this guide - the ID in your data might be different). Double-click on an exon to select the whole model. Right-click on the selected feature and choose Analyze region.

Analyze region popup menu

The Run analysis window will show up.

Run analysis

We see there is a tab for NCBI-BLAST and NCBI Primer-BLAST. We'll just run BLAST for now. We have a pull-down menu for BLAST type and can select blastn, blastx, and tblastx.

BLAST types

Let's run a blastn search. There are a number of options for running BLAST and post processing options. The post processing options are particularly useful as since we're searching against NCBI's nr database (which is very large), we'll get A LOT of results back. We'll check the following options:

  • Run options
    • Filter out low complexity sequence
    • Filter out masked sequence
  • Post processing options
    • Remove hits with an expect above threshold
    • Remove hits with a score below threshold
    • Remove HSPs with a percent identity below threshold

We can leave the default values for those options. Click Run to run the analysis. After a few seconds, a popup window will appear.

Analysis expected submission time

This gives us the estimated time before our analysis starts running (as estimated by the NCBI servers). Note that this is the estimated time for the analysis to start, NOT the expected time for the analysis to the completed. Checking for analysis completion all take place in the background, so you can feel free to continue working as usual. You will be notified when the analysis is complete.

Analysis complete

The new analysis will appear in the results panel and since we ran blastn against the nr database, the type for the result is blastn:nr.

blastn results

One last thing worth mentioning is the exon detail editor. It allows you to make edits to your models at the base level. We have also recently added the sequence aligner that allows you to make the same types of edits the exon detail editor support, but in reference to multiple alignment data. We'll come back and talk about the sequence aligner if we have time.

Unfortunately we don't have the time to go over all the sophisticated editing features for Apollo, but you can get more information on all the powerful editing features from the Apollo user's guide.

Configuring Apollo

Ok, now that we got some idea of what Apollo can do, let's talk about how to configure Apollo. First of all, be aware that all configuration files can live in two places:

  • The global Apollo configuration directory in $APOLLO_ROOT/conf where $APOLLO_ROOT is where Apollo was installed
  • User specific configurations, stored in ~/.apollo where ~ is the user home directory (different OS's handle it differently)

The configurations in the user directory take precedence over the global ones. Depending on the configuration, it will either fully overwrite the global configuration or just overwrite/append to the global one.

There are 3 sets of general configurations we'll discuss: apollo.cfg, data_source.style, data_source.tiers. You can check out the Apollo configuration section from the user guide for a more detailed description of the supported options.

apollo.cfg

This is the main Apollo configuration. Options are composed of columns delimited by white space, where the first column is the option parameter and the following columns are the specific options for the parameter. // is used for comments and everything following it (up the the new line) will be ignored. Out of all the options, the most interesting one is DataAdapterInstall, which is used to install data adapters for handling new types of data. We'll talk about it in more detail in the writing custom data adapters section. You can just add any new options or ones you wish to override in your custom apollo.cfg file. The global apollo.cfg options will be used for any options absent in your custom file.

data_source.style

Each data source has a style file associated with it. The style file contains options that are data source specific and should be shared amongst every feature. Like the apollo.cfg file, it is also composed of columns delimited by white space, where the first column is the option parameter and the following columns are the specific options for the parameter. // is also used for comments and everything following it (up to the new line) will be ignored. We've recently added a GUI for setting up the most common options. You can access it from Edit -> Preferences.

Edit -> Preference

Make sure that the Style tab is selected.

Style wizard

Be aware that the GUI only supports a subset of all the options supported. This was done as to not overwhelm users with overly complex GUIs. If you need to change anything that is not supported with the GUI, you'll need to do so by manually editing the file. Of particular interest is the Canned annotation/transcript comments section. It allows you to add predefined comments that users can add to their top level annotations and transcripts using the annotation info editor from a pull down menu.

data_source.tiers

Each data source has a tiers file associated with it. The tiers files contains options on how to display specific features. It has a completely different format than both apollo.cfg and data_source.style files. # is used for comments. A tiers file contains a set of Tier and Type records.

A tier record defines a set of feature types that will always be displayed together as a group. They will be displayed in the same row if possible when the features are expanded but as close together as possible if they overlap. A Tier record will look something like this:

[Tier]
tiername : Annotation
visible : true
expanded : true
maxrows : 0
labeled : true
curated : true
warnonedit : false

Following the Tier record is one or more Type records. A Type record specifies that different types that should appear in the given Tier. The Type record will look something like this:

[Type]
tiername : Gene Prediction
typename : Genscan
resulttype : genscan:dummy
resulttype : genscan
color : 204,153,255
usescore : true
minscore : - 1
maxscore : 50
glyph : DrawableResultFeatureSet
column : SCORE
column : GENOMIC_RANGE
column : query_frame
sortbycolumn : GENOMIC_RANGE
weburl : http://genes.mit.edu/GENSCAN.html#

Again, there are many options supported by the tiers file and it can get quite overwhelming. The current fly.tiers file is over 1500 lines long! Craziness. Luckily we've also recently added a GUI for setting the most useful options. You can access it by clicking Edit -> Preferences and selecting the Types tab.

Types wizard

If however you need to change something not supported by the GUI, you'll have to edit the file by hand. You can learn more about the configuration wizards in the Preferences section from the Apollo user guide.

Setting Up Custom Chado Configurations

Ok, so we connected to our local Chado instance before with an already existing configuration file. Now we're going to go into detail on how to set that up. The file that contains the Chado database configuration is chado-adapter.xml.

chado-adapter.xml

Like all other configuration files, it resides in $APOLLO_ROOT/conf for the global configuration and ~/.apollo for the user configurations. As you can guess from the file extension, this configuration is in XML format (nice how all the formats between the configurations are so consistent, huh? =P). It contains a <chado-adapter> root element, with at least one chadoInstance child element and at least one chadodb element. The skeleton for the XML file will look something like this:

<xml><?xml version="1.0" encoding="UTF-8"?> <chado-adapter>

 <chadoInstance>
   ...
 </chadoInstance>
  ...
 <chadodb>
   ...
 </chadodb>

</chado-adapter></xml>

chadoInstance Element

You'll need at least one chadoInstance element. It will look something like this:

<xml><chadoInstance id="gmodSummerSchoolInstance" default="true">

 <clsName>apollo.dataadapter.chado.jdbc.RiceChadoInstance</clsName>
 <sequenceTypes>
   <type>gene</type>
   <type>
     <name>contig</name>
     <useStartAndEnd>true</useStartAndEnd>
     <queryForValueList>true</queryForValueList>
     <isTopLevel>true</isTopLevel>
   </type>
 </sequenceTypes>
 <partOfCvTerm>part_of</partOfCvTerm>
 <featureCV>sequence</featureCV>
 <relationshipCV>relationship</relationshipCV>
 <propertyTypeCV>feature_property</propertyTypeCV>
 <genePredictionPrograms>
   <program>maker</program>
 </genePredictionPrograms>
 <searchHitPrograms>
   <program>blastn</program>
   <program>blastx</program>
   <program>tblastx</program>
   <program>est2genome</program>
   <program>protein2genome</program>
   <program>repeatmasker</program>
   <program>fgenesh</program>
   <program>fgenesh_masked</program>
   <program>genemark</program>
   <program>genemark_masked</program>
   <program>snap</program>
   <program>snap_masked</program>
 </searchHitPrograms>
 <searchHitsHaveFeatLocs>true</searchHitsHaveFeatLocs>
 <oneLevelAnnotTypes>
   <type>promoter</type>
   <type>transposable_element</type>
   <type>remark</type>
   <type>repeat_region</type>
 </oneLevelAnnotTypes>
 <threeLevelAnnotTypes>
   <type>gene</type>
   <type>pseudogene</type>
   <type>tRNA</type>
   <type>snRNA</type>
   <type>snoRNA</type>
   <type>ncRNA</type>
   <type>rRNA</type>
   <type>miRNA</type>
 </threeLevelAnnotTypes>

</chadoInstance></xml>

chadodb Element

You'll need at least one <chadodb> element. It contains information to connect to the database. Each <chadodb> element will have a <chadoInstance> associated with it. You'll need one <chadodb> element for each database you want to connect to (you can have multiple ones). The XML will look something like this:

<xml><chadodb>

 <name>GMOD Summer School</name>
 <adapter>apollo.dataadapter.chado.jdbc.PostgresChadoAdapter</adapter>
 <url>jdbc:postgresql://localhost:5432/chado</url>
 <dbName>chado</dbName>
 <dbUser>gmod</dbUser>
 <dbInstance>gmodSummerSchoolInstance</dbInstance>
 <style>gmod_summer_school.style</style>
 <default-command-line-db>true</default-command-line-db>

</chadodb></xml>

Setting Up a Custom WebStart Instance

One of the benefits of having Apollo as a Java application is that we can make use of Java WebStart. This is a great way to deploy Apollo with your custom modifications. If any modifications are made (either source code or configuration), it will be automatically deployed through WebStart. To setup our own WebStart instance, we'll need to compile the code ourselves. See the installation section on information on how to checkout the code.

This section talks about CVS, but GMOD has migrated to SVN since this tutorial was taught. The CVS commands shown here will still work, but they won't get you the latest and greatest version of Apollo. They will get you a version of Apollo as it existed when the CVS to Subversion Conversion was done.

We've already checked out the code from CVS in our virtual machines. The code is located in ~/software/apollo. The first thing we'll do is update the CVS to make sure that we have the most up to date code.

<bash> $ cd ~/software/apollo $ cvs update </bash>

Once the CVS update is done, we'll need to create our Apollo jar file for deployment. Before we do that, we want to make sure that our custom configurations are in the conf directory (we want it to be globally deployed, not locally). So let's copy our modified chado-adapter.xml and the style and tiers files to the conf directory.

<bash> $ cp ~/.apollo/chado-adapter.xml ~/.apollo/gmod_summer_school.* conf </bash>

Now we're ready to build our updated Apollo jar. We'll use Apache Ant to do so. ant is similar in many ways to make but has a lot of native support for Java. Like make, we can have multiple targets. We're interested in the jar target.

<bash> $ cd src/java $ ant jar </bash>

So traditionally, setting up a WebStart instance is quite a bit of work. Luckily, we have a very nice Perl script that does a lot of the magic for us! Before we can use this script, we'll need to look at the template XML file that is used for this script.

<xml><?xml version="1.0" encoding="UTF-8"?> <webstart>

 <jarsigner>
   <alias>apollo</alias>
   <keypass>apollo</keypass>
   <storepass>apollo</storepass>
   <keystore>apollo_store</keystore>
   <validity>700</validity>
   <commonName>GMOD Summer School 2009</commonName>
   <organizationUnit>GMOD Summer School 2009</organizationUnit>
   <organizationName>GMOD</organizationName>
   <localityName>Durham/Oxford</localityName>
   <stateName>NC/Oxford</stateName>
   <country>USA/UK</country>
 </jarsigner>
 <jnlp spec="1.0+">
   <information>
     <title>Apollo</title>
     <vendor>GMOD Summer School 2009</vendor>
     <description>Apollo Webstart</description>
     <homepage href="http://localhost/apollo" />
     <icon href="images/head-of-apollo.gif" kind="shortcut" />
     <shortcut online="true">
       <desktop />
     </shortcut>
     <offline-allowed />
   </information>
   <security>
     <all-permissions />
   </security>
   <resources>
     <j2se version="1.5+" initial-heap-size="64m" max-heap-size="500m" />
   </resources>
   <application-desc main-class="apollo.main.Apollo">
       <argument>-i</argument>
       <argument>chadodb</argument>
       <argument>-l</argument>
       <argument>scf1117875581239:102000-110000</argument>
   </application-desc>
 </jnlp>
 <webserver>
   <url>http://localhost/apollo/webstart</url>
   <jar_location>jars</jar_location>
 </webserver>

</webstart></xml>

The nice thing about this template is that you only need to set it up once (assuming you're not changing the URL or any other option).

The Apache web pages reside at /var/www. We'll create an Apollo directory. The directory is only writable by root, so we'll need to be root.

<bash> $ sudo -s $ cd /var/www $ mkdir -p apollo/webstart $ cd apollo/webstart </bash>

Let's create the file apollo_webstart.xml in the webstart directory we just created. Now we'll run the magical script.

<bash> $ ~/software/apollo/bin/webstart_generator.pl -i apollo_webstart.xml -d ~/software/apollo/jars -o apollo.jnlp -D jars </bash>

Voila, it was THAT easy. This script took care of signing all the jars and generating the appropriate jnlp file. Next time, when you make a change to one of the configurations, you'll just need to recompile the apollo.jar and re-run this script.

Lastly, we'll just create a simple web page to link to the Apollo WebStart instance.

<html>
  <body>
    <a href="apollo.jnlp">Launch Apollo!!!</a>
  </body>
</html>

One last note, you'll want to make sure that your web server has support for jnlp files. With our Apache install, you'll need to make sure that you have the following line in your /etc/mime.types file:

application/x-java-jnlp-file    jnlp

Writing Custom Data Adapters

There's a bit of work involved with writing a custom data adapter. It will require you to have knowledge of Java and the data model used in Apollo. We won't have enough time to write one ourselves, but we can briefly discuss how the process works.

All data adapters belong to the apollo.dataadapter package. To build our own custom data adapter, we need to implement two specific interfaces: ApolloDataAdapterI.java and org.bdgp.io.DataAdapterUI. There are abstract classes that implement common methods for both of those interfaces: AbstractApolloAdapter.java and org.bdgp.swing.AbstractDataAdapterUI respectively.

ApolloDataAdapterI does the work of parsing the input data. The most important method for the interface is getCurationSet() which does the work of returning the data to the logic layer.

ApolloDataAdapterGUI provides the GUI that we see in the Apollo: load data window. It implements the necessary interfaces and extends JPanel, so you can build your GUI directly in the class.

We'll take a look at how to get the sample adapter to work. The code for the sample adapter is located in apollo.dataadapter.sample package. Since we just finished building the jar, we have already compiled the code. So all we need to do is tell Apollo to load the data adapter plugin. We'll add the following to apollo.cfg:

DataAdapterInstall      "apollo.dataadapter.sample.SampleAdapter"   "gmod_summer_school.style"     "Sample data adapter"

The first column tells Apollo that we'll be loading a new data adapter. The second column points to the class for the data adapter. The third column tells which style to associate with this data. The fourth column provides the text that will be displayed in the data adapter pull down menu.

Now if we run Apollo, we'll see "Sample data adapter" as one of the options. Lovely!

There's a roughly written tutorial on how to create your own data adapter. It covers some information on the data adapter API and the data model. You'll want to read that if you're interested in creating one. It's located in $APOLLO_ROOT/doc/html/dataadapter_cookbook.html. You can also view the Apollo Javadoc API.