Difference between revisions of "InterMine"

From GMOD
Jump to: navigation, search
m
(Loading Custom Data Sources)
Line 1: Line 1:
__NOTITLE__
+
{{SessionHead}}
 +
{| class="tutorialheader"
 +
| {{TutorialTitleLine|InterMine}}<br />
 +
[[2011 GMOD Spring Training]]<br />
 +
8-12 March 2011<br />
 +
Thursday 10th March <br />
 +
[[User:Alex@flymine.org|Alex Kalderimis]]
 +
|{{#icon: InterMineLogo.png|InterMine|300|gmod:InterMine}}
 +
|}
  
<center>{{#icon: InterMineLogo.png|InterMine||http://gmod.org/wiki/InterMine#Logo}}</center>
+
=OOOPS!=
 +
<div style="border:solid 1px;background-color:salmon">
 +
'''First things first:'''
 +
* '''I forgot to install a cpan module:'''
 +
<pre class="enter">
 +
sudo cpan Expect
 +
</pre>
 +
* '''You need to run a command:'''
 +
<pre class="enter">
 +
cp -r /home/gmod/Documents/Software/intermine/bio/sources/example-sources/malaria-gff  /home/gmod/Documents/Software/intermine/bio/sources/
 +
</pre>
 +
</div>
  
 +
=Introduction=
  
{{ComponentBox|{{InterMineResourcesBoxItem}}|<!--{{ComponentBoxEventsHeader}}|{{GMODAmericas2011BoxItem|2011 GMOD Spring Training|GMOD Spring Training|March 8-12}}-->| | | | |}}
+
[http://db.tt/yCxyJnx Intro Slides]
[http://www.intermine.org/ InterMine] makes it easy to integrate multiple data sources into a single data warehouse. It has a core data model based on the [http://sequenceontology.org Sequence Ontology] and supports several biological data formats, just configure which organisms or data files are required. It is easy to extend the data model and integrate your own data, Java and Perl APIs and an [[Glossary#XML|XML]] format to help import custom data. Currently supported formats include:
+
  
* [[Chado]], [[GFF3]], [[Glossary#FASTA|FASTA]], GO & gene association files, UniProt XML, PSI XML (protein interactions), InParanoid orthologs, Ensembl, Uniprot, and many others
+
InterMine is a project that aims to make creating, running, and maintaining massive data warehouses of integrated genomics data fast and flexible. It provides a back end database solution, a front end web-application, and a fully capable webservice API to access the data you host. InterMine already powers several websites, including [http://www.flymine.org FlyMine], [http://intermine.modencode.org/ modMine], [http://ratmine.mcw.edu/ RatMine], [http://yeastmine.yeastgenome.org YeastMine], and soon [http://www.metabolicmine.org/beta/begin.do metabolicMine] and [http://zmine.zfin.org/zfinmine/ ZFINmine] as well.
  
A web application allows creation of custom queries, includes template queries (web forms to run 'canned' queries) and can upload and operate on lists of data. It is possible to configure/create widgets to analyse lists with graphs and enrichment statistics. An admin user can publish new template queries, change report pages and create public lists at any time without any programming. Many aspects of the web app can be configured and branded.
+
InterMine is fundamentally data agnostic, and can host any data you like, but we have been funded to develop genomics tools, and you will find a wide range of utilities that make dealing with different biological sources of data easy. Dealing with the massive amount of data that genomics research produces is never really easy, but InterMine makes the straightforward simple, and the difficult possible.
  
== Chado Integration ==
+
=Overview=
  
We are developing a parser to load data from the GMOD [[Chado]] database schema into a companion InterMine warehouse.  This will provide a web environment to perform rapid, complex queries on imported [[Chado]] data with minimal development effort.  The eventual aim is to allow import of any [[Chado]] database with some configuration, the parser is currently used in FlyMine  and the [http://www.modencode.org/ modENCODE] project to import genome annotation from [[:Category:FlyBase|FlyBase]] and [[:Category:WormBase|WormBase]] [[Chado]] databases.  More information:  http://www.intermine.org/wiki/ChadoDBSource
+
We aim to demonstrate three strengths of InterMine:
  
 +
#It's <del>effortless</del> straightforward to integrate data from different datasets (even your own data!) into one database.
 +
#Once you get your data into the database, you get a powerful, works-out-of-the-box webapp that makes it easy and fun to access your data.
 +
#Once you get your webapp up on the server, you get a sophisticated webservice that enables you or others to access the data via scripts, Java programs and other web-pages.
  
==Demo==
+
To do this, we will set up a stand-alone InterMine. This consists of a PostgreSQL database and a Java web-app sitting on top of it. Setting up your InterMine involves loading data into this database, and then mounting the web application in a running Tomcat instance.
  
[http://www.flymine.org FlyMine] is an example of InterMine in use.
+
=Loading Data Into Your Database=
  
 +
The database schema, and the Java classes that represent it, are generated from XML configuration files. To manage this each mine has an associated Java project folder, named after itself. This project also contains code which manages ''integration'', or the procedure of loading data into the scheme once it has been defined. Therefore the outline of this section is:
 +
# Setting up the project structure
 +
# Configuring the data sources and the associated schema
 +
# Running the build process
  
==Requirements==
+
==The tutorial data==
 +
We will use the sample data set we distribute with our source. This is located at:
 +
~/Documents/Software/intermine/bio/tutorial/malariamine/malaria-data.tar.gz
  
InterMine is written in Java and uses the [http://www.postgresql.org/ PostgreSQL] database. The web application requires [http://tomcat.apache.org Apache Tomcat]For more details on requirements see:  http://www.intermine.org/wiki/Prerequisites
+
Place this data in the the data directory, and extract it for use in the tutorial:
 +
<pre class="enter">
 +
mkdir ~/Documents/Data/intermine
 +
  cd ~/Documents/Data/intermine
 +
cp ~/Documents/Software/intermine/bio/tutorial/malariamine/malaria-data.tar.gz .
 +
  tar -zxvf malaria-data.tar.gz
 +
</pre>
  
 +
You should now have a directory of data available at <code>/home/gmod/Documents/Data/intermine/malaria</code>
  
==Documentation==
+
==The MineManager Graphical Installer==
 +
We are developing a graphical application to manage these stages, which we will use in this section of the workshop:
  
Documentation is available at: http://www.intermine.org/
+
The MineManager is located in our source tree at <code>SVN_ROOT/intermine/MineManager</code>.  
  
For a quick tour of the web application features go to [http://www.flymine.org FlyMine] and click 'Take a tour'.
+
===Running it from the command line===
 +
To run it open a terminal and type the command:
 +
<pre class="enter">
 +
  /home/gmod/Documents/Software/intermine/intermine/MineManager/run
 +
</pre>
  
=== Presentations ===
+
===Running it from a clickable launcher===
* [[:Image:IntermineGMOD2008.pdf‎|InterMine and Chado]] by Richard Smith at the [[July 2008 GMOD Meeting]].
+
If you would prefer a point and click interface, on standard Linux desktops, you can run the launcher installer to obtain a runnable icon:
* [[:Image:InterMine middleware.pdf|Presentation on InterMine]] by Gos Micklem at the [[GMOD Middleware]] meeting that happened at the [[January 2007 GMOD Meeting]].  This presentation is also available as a [[InterMine Presentation|wiki page]].
+
<pre class="enter">
 +
  /home/gmod/Documents/Software/intermine/intermine/MineManager/install_launcher
 +
</pre>
 +
You should then find a MineManager icon on your desktop, which you can double click to open the installer.
  
==Contact==
+
===Welcome Screen===
 +
You should see a window like this:
  
{{MailingListsFor|InterMine}}
+
[[image:MineManager-welcome.png]]
  
=== Some Interesting Email Threads ===
+
This installer will guide you through the install procedure in 4 steps to the point of having a working database that we can use to release a mine on.
  
* {{NabbleThreadLink|Status-of-BioMart-integration-td1488484.html#a1488484|Status of BioMart integration}} - 2010/09,  [[User:Jogoodma|Josh Goodman]], [[User:Rsmith|Richard Smith]]
+
==Starting a new Mine==
 +
To do this enter a name in the box at the top and click on the '''save''' icon to the right of the text box. This will automatically open up the next stage of the mine creation process.
  
==Downloads==
+
===Setting up the new Mine's properties===
 +
In order to proceed, we need to tell the installer where the InterMine source tree we are using is located. This is referred to here as '''InterMine Home'''. You can use the browse button on the top right of the Mine Information tab to select the appropriate directory:
 +
  /home/gmod/Documents/Software/intermine
  
All code is released under the LGPL license and is available in [http://www.intermine.org/wiki/SVNCheckout subversion].
+
[[image:MineManager-minesettings-info.png]]
  
== Logo ==
+
Once this form is completed and you have '''applied your changes''' you will be able to create your mine.
  
The [[:Image:InterMineLogo.png|InterMine logo]] was created by Sean O'Connell, a participant in the [[Spring 2010 Logo Program]], while a design student at [http://www.linn-benton.edu Linn-Benton Community College].
+
===The Mine project directory===
 +
Creating the mine runs the <code>make_mine</code> script, which sets up the Java project directories in the appropriate places, and then builds an initial version of the data model. The structure of the mine's project directory is:
 +
  SVN_ROOT/your_mine
 +
          |
 +
          + -- dbmodel/
 +
          + -- integrate/
 +
          + -- postprocess/
 +
          + -- webapp/
 +
          + -- project.xml
 +
          + -- default.intermine.webapp.properties
 +
          + -- default.intermine.integrate.properties
 +
The four sub directories are each separate Java projects that manage the different stages of building and running a mine, pretty much in the order they appear.
  
[[Category:Database Tools]]
+
==Adding Sources to a Project==
[[Category:InterMine]]
+
The next section of the MineManager handles adding sources to a project:
[[Category:GMOD Components]]
+
 
[[Category:Java]]
+
A source here refers to the combination of a datasource and a parser that reads the data into the database. We supply a [http://intermine.org/wiki/BioSources large number of parsers] with our source code for reading in data from common biological formats (including [[Chado]], [[GFF3]]), and we supply the tools for writing your own parsers for datasources we don't support out of the box.
 +
 
 +
[[image:InterMine-dataparsing.png|700px]]
 +
 
 +
[[image:MineManager-sourcesettings-empty.png]]
 +
 
 +
===Importing Protein Data From Uniprot===
 +
 
 +
Click on '''add source''', and then select '''uniprot''' as the source type from the drop-down list. (you can choose to name each source, but in this case ''uniprot'' is fine).
 +
 
 +
Once you add a source the section on the right will load up its specific configuration options.
 +
 
 +
[[image:MineManager-sourcesettings-uniprot.png]]
 +
 
 +
For uniprot the appropriate settings are:
 +
{|class="wikitable"
 +
!Field
 +
!Value
 +
|-
 +
|List of Organisms
 +
|36329
 +
|-
 +
|Create protein domains
 +
|TICKED
 +
|-
 +
|Create GO terms
 +
|UNTICKED
 +
|-
 +
|Location of data directory
 +
|/home/gmod/Documents/Data/malaria/uniprot
 +
|}
 +
 
 +
To save this configuration so it is used in the build, select '''save sources''' from the bottom right, or when prompted.
 +
 
 +
====Side Note: Model Additions====
 +
The Uniprot configuration section has a second tab named ''Source Model Additions'', which specifies the addtions to the data model that a particular source brings with it. The uniprot source adds the following classes:
 +
* Component
 +
* UniprotFeature
 +
 
 +
And adds fields to the following classes:
 +
* Gene
 +
* GoAnnotation
 +
* Protein
 +
* ProteinDomain
 +
 
 +
If you select '''model''' &rarr; '''view model''' from the menu you can see how these classes and fields are integrated into the data model.
 +
 
 +
===The GFF3 source===
 +
InterMine includes a parser to load valid [[GFF3]] files. The creation of features, sequence features (usually chromosomes), locations and standard attributes is taken care of automatically.
 +
 
 +
The files we are loading are from PlasmoDB and contain gene, exon and mRNA features, there is one file per chromosome. Look at an example:
 +
<pre class="enter">
 +
head ~/Documents/Data/intermine/malaria/genome/gff/MAL1.gff3
 +
</pre>
 +
 
 +
To add the GFF3 source to your MalariaMine:
 +
#Select the '''Add source''' option on the Sources menu.
 +
#Choose '''malaria-gff'''
 +
#Click the '''add''' button and '''save''' your sources.
 +
 
 +
The properties set for malaria-gff are:
 +
{|class="wikitable"
 +
! Field
 +
! Value
 +
! Notes
 +
|-
 +
|gff3.seqClsName
 +
|Chromosome
 +
|the ids in the first column represent Chromosome objects, e.g. MAL1
 +
|-
 +
|gff3.taxonId
 +
|36329
 +
|taxon id of malaria
 +
|-
 +
|gff3.dataSourceName
 +
|PlasmoDB
 +
|the data source for features and their identifiers, this is used for the DataSet (evidence) and synonyms.
 +
|-
 +
|gff3.seqDataSourceName
 +
|PlasmoDB
 +
|the source of the seqids (chromosomes) is sometimes different to the features described
 +
|-
 +
|gff3.dataSetTitle
 +
|PlasmoDB P. falciparum genome
 +
|a DataSet object is created as evidence for the features, it is linked to a DataSource (PlasmoDB)
 +
|-
 +
|Location of Data Directory
 +
|/home/gmod/Documents/Data/intermine/malaria/genome/gff
 +
|Where we unpacked the data to
 +
|}
 +
 
 +
===FASTA files===
 +
FASTA is a minimal format for representing sequence data. Files comprise a header with some identifier information preceded by '>' and a sequence. At present the InterMine FASTA parser loads just the first entry in header after <code>></code> and assigns it to be an attribute of the feature created. Here we will load one FASTA file for each malaria chromosome. Look at an example of the files we will load:
 +
<pre class="enter">
 +
head ~/malaria/genome/fasta/MAL1.fasta
 +
</pre>
 +
 
 +
Add a fasta source to your Mine by following these steps:
 +
#Select '''Add source''' option from the Sources menu
 +
#Select the '''fasta''' type and name your source '''malaria-chromosome-fasta'''. <span style="color:grey">''Note: you must use this name as there is an integration keys set of this name predefined for you. We will deal with keys in more detail in the custom source section.''</span>
 +
#Click the '''add''' button and '''save''' your changes.
 +
 
 +
The following properties should be defined for malaria-chromosome-fasta:
 +
{|class="wikitable"
 +
! Field
 +
! Value
 +
! Notes
 +
|-
 +
|FASTA Class Name
 +
|org.intermine.model.bio.Chromosome
 +
|the type of feature that each sequence is for
 +
|-
 +
|fasta.dataSourceName
 +
|PlasmoDB
 +
|the source of identifiers to be created
 +
|-
 +
|Dataset Name
 +
|PlasmoDB chromosome sequence
 +
|a DataSet object is created as evidence
 +
|-
 +
|Taxon ID
 +
|36329
 +
|the organism id for Plasmodium falciparum
 +
|-
 +
|Location of Data Directory
 +
|/home/gmod/Documents/Data/intermine/malaria/genome/fasta
 +
|Where we unpacked the data to before
 +
|}
 +
 
 +
===Entrez Organism===
 +
 
 +
Some sources depend on other sources, usually in order to complete the database with fields that can be derived or fetched in some way, '''Entrez Organism''' is one of these. It fetches organism names from Entrez. Add this source to the mine - it does not need any special configuration.
 +
 
 +
===Loading Custom Data Sources===
 +
 
 +
As well as the standard data loaders that ship with the InterMine source tree, we supply a tool-chain for building your own data loaders for any custom data source. There are APIs for this tool-chain in Java and Perl. The next section will walk us through loading a custom data source by using the Perl data loading API.
 +
 
 +
====Tool-chain details====
 +
* '''Java''' - data loaders are written by sub-classing one of a set of basic data loading classes (eg. <code>org.intermine.bio.dataconversion.BioFileConverter</code>), which provide a scaffold.
 +
* '''Perl''' - data is loaded in a two stage process, by first using a set of Perl modules to convert your data into our [http://intermine.org/wiki/ItemsXmlFormat XML format], which is then loaded into the database using a core dataloader.
 +
 
 +
====Installing the Perl tool-chain====
 +
The Perl modules are located in our source tree at:
 +
  SVN_ROOT/intermine/perl/InterMine-Item
 +
 
 +
and can be installed in the standard manner with the commands:
 +
<pre class="enter">
 +
  perl Build.PL
 +
  sudo ./Build installdeps [if you don't have the pre-requisites]
 +
  ./Build test [optional]
 +
  sudo ./Build install
 +
</pre>
 +
 
 +
Or the entire procedure above can be automated with your preferred CPAN client by installing <code>InterMine::Item</code>. eg:
 +
<pre class="enter">
 +
  cpan InterMine::Item
 +
</pre>
 +
 
 +
====Our example dataset====
 +
First let's look at the data we will be adding to the database. In this tutorial we will use data from the [http://www.genome.jp/kegg/pathway.html KEGG pathway database]. In their words:
 +
 
 +
<blockquote style="color:grey">"''KEGG PATHWAY mapping is the process to map molecular datasets, especially large-scale datasets in genomics, transcriptomics, proteomics, and metabolomics, to the KEGG pathway maps for biological interpretaion of higher-level systemic functions.''"</blockquote>
 +
 
 +
Specifically, the data we have will provide mappings between genes and KEGG pathways. It takes the form of two files in '''/home/gmod/Documents/Data/malaria/kegg''', look at these now.
 +
 
 +
* <code>pfa_gene_map.tab</code> - this has two tab delimited columns:
 +
** the first is the identifier of a malaria gene, note these are the same ids we have used for `Gene.primaryIdentifier` in other sources.
 +
** the second is a space separated list of KEGG pathway ids that the gene is involved in
 +
* <code>map_title.tab</code> - also has two tab delimited columns:
 +
** the first is a KEGG pathway identifier
 +
** the second the descriptive name of the pathway
 +
 
 +
====Our parsing strategy====
 +
 
 +
We will
 +
#Create data items for the data source, data set, and organism
 +
#Read in the pathways file
 +
##Create a data item for each pathway in the file
 +
##Remember which item was made for each id
 +
#Read in the gene mappings file
 +
##Create a data item for each gene in the file, linked to the pathway items made earlier
 +
 
 +
 
 +
====Adding The Source====
 +
 
 +
There a couple of custom source types; since we will be using the Perl toolchain, add a new source of the type '''intermine-items-xml''', and set the name to something sensible, such as '''kegg-pathways'''.
 +
 
 +
====An Example Implementation of this Strategy====
 +
 
 +
Click on '''open parser to edit''' and paste in the script below:
 +
 
 +
<perl>#!/usr/bin/perl
 +
 
 +
use warnings;
 +
use strict;
 +
use InterMine::Model;
 +
use InterMine::Item::Document;
 +
 
 +
@ARGV == 4 or die "Bad arguments: we need four arguments\n$0 model-file output-file pathways-file gene-mappings-file\n";
 +
 
 +
my ( $model_file, $out_file, $pathway_file, $gene_mappings_file ) = @ARGV;
 +
 
 +
# Create the writing apparatus
 +
my $model = InterMine::Model->new( file => $model_file );
 +
my $document = InterMine::Item::Document->new(
 +
    model      => $model,
 +
    output    => $out_file,
 +
    auto_write => 1,
 +
);
 +
 
 +
my $data_source = 'Kegg';
 +
my $taxon_id = 36329;
 +
my %pathway_with;
 +
 
 +
# Create data items for the data source, data set, and organism
 +
 
 +
my $datasource_item = $document->add_item(
 +
    'DataSource',
 +
    'name' => $data_source,
 +
);
 +
 
 +
my $dataset_item = $document->add_item(
 +
    'DataSet',
 +
    name      => $data_source . ' data set for taxon id: ' . $taxon_id,
 +
    dataSource => $datasource_item,
 +
);
 +
 
 +
my $org_item = $document->add_item(
 +
    'Organism',
 +
    taxonId  => $taxon_id,
 +
);
 +
 
 +
# Read in the pathways file
 +
open(my $pathways, '<', $pathway_file) or die "Could not open $pathway_file, $!";
 +
for (<$pathways>) {
 +
    chomp;
 +
    my ($id, $title) = split(/\t/);
 +
 
 +
    ## Create a data item for each pathway in the file
 +
    ## Remember which item was made for each id
 +
    $pathway_with{$id} = $document->add_item(
 +
        'Pathway',
 +
        identifier => $id,
 +
        name      => $title,
 +
    );
 +
}
 +
close $pathways or die "Could not close $pathway_file, $!";
 +
 
 +
# Read in the gene mappings file
 +
open(my $gene_mappings, '<', $gene_mappings_file) or die "Couldn't open $gene_mappings_file, $!";
 +
for (<$gene_mappings>) {
 +
    chomp;
 +
    my ($gene_id, $pathway_string) = split(/\t/);
 +
    my @pathway_ids = split(/\s/, $pathway_string);
 +
    my $pathway_items = [@pathway_with{@pathway_ids}];
 +
 
 +
    ## Create a data item for each gene in the file, linked to the pathway items made earlier
 +
    $document->add_item('Gene',
 +
      primaryIdentifier => $gene_id,
 +
      organism          => $org_item,
 +
      pathways          => $pathway_items,
 +
      dataSets          => [$dataset_item],
 +
    );
 +
}
 +
close $gene_mappings or die "Could not close $gene_mappings_file, $!";
 +
 
 +
# Close the document
 +
$document->close();
 +
 
 +
exit;</perl>
 +
 
 +
====Additions====
 +
 
 +
Our Model currently has no class "Pathway" (as you can confirm by browsing the model). We need to add it, and the Gene &harr; Pathway link. We can do this by using the '''Source Model Additions''' editor.
 +
 
 +
#Add a new class and name it '''Pathway'''
 +
#Add an attribute in this class and name it '''name''', with the type set to '''String'''
 +
#Add an attribute in this class and name it '''identifier''', with the type set to '''String'''
 +
#Add a collection in this class and name it '''genes''', with the type set to '''Gene''' and a reverse reference named '''pathways'''
 +
#Click '''yes''' when asked if you want to make the reverse reference
 +
#Change the field-type of the new reverse reference in the Gene class to '''collection'''
 +
#Click '''no''' when asked if you want to make the reverse reference.
 +
 
 +
You should end up with:
 +
 
 +
=====Gene=====
 +
{| class="wikitable"
 +
!FieldType
 +
!Name
 +
!Type
 +
!Reverse-Reference
 +
|-
 +
|Collection
 +
|pathways
 +
|Pathway
 +
|genes
 +
|}
 +
 
 +
=====Pathway=====
 +
{| class="wikitable"
 +
!FieldType
 +
!Name
 +
!Type
 +
!Reverse-Reference
 +
|-
 +
|Attribute
 +
|name
 +
|String
 +
| -
 +
|-
 +
|Attribute
 +
|identifier
 +
|String
 +
| -
 +
|-
 +
|Collection
 +
|genes
 +
|Gene
 +
|pathways
 +
|}
 +
 
 +
Once these are added, if you reload the model, you should find the new Pathway class as part of the model
 +
 
 +
====Dealing With Integration====
 +
 
 +
As we are adding gene data from this source in addition to the other genes already in the database, we need to make sure they play nicely together. We do this by setting up "integration keys" that tell the integration process how to identify when we are adding details about an object we already have in the database, rather than adding an entirely new one.
 +
 
 +
To do this, on the '''source properties''' tab of the source details panel, click on '''Open keys file''' to edit the integration keys.
 +
 
 +
We already have keys defined for DataSet and DataSource: we only need to add the following line:
 +
Gene.key = primaryIdentifier
 +
 
 +
We do not need to add a key for Pathway, as we are not adding pathways data from any other source.
 +
 
 +
====Generating the XML====
 +
 
 +
Now we are ready to generate the XML using our parser. First we need to generate the model:
 +
<pre class="enter">
 +
  cd ~/Documents/Software/intermine/malariamine/dbmodel
 +
  ant build-db
 +
</pre>
 +
 
 +
And now we can run our parser and generate XML
 +
<pre class="enter">
 +
  perl ~/Documents/Software/intermine/bio/sources/kegg-pathways/kegg-pathways_parser.pl \
 +
    ~/Documents/Software/intermine/malariamine/dbmodel/build/model/genomic_model.xml \
 +
    ~/Documents/Data/intermine/malaria/kegg/pathways.xml \
 +
    ~/Documents/Data/intermine/malaria/kegg/map_title.tab \
 +
    ~/Documents/Data/intermine/malaria/kegg/pfa_gene_map.tab
 +
</pre>
 +
 
 +
And then finally tell our mine where the data is by filling in the Data file location in the ''kegg-pathways'' source properties section to '''/home/gmod/Documents/Data/intermine/malaria/kegg/pathways.xml'''
 +
 
 +
==Running a Build==
 +
 
 +
The '''build''' section of the MineManager runs the build process (a front-end for our <code>project-build</code> script).
 +
 
 +
[[image:MineManager-buildscreen.png]]
 +
 
 +
If you click build and there are problems with your configuration that would prevent a successful build, the MineManager will catch that and tell you:
 +
 
 +
[[image:MineManager-cannotbuild.png]]
 +
 
 +
In this case we would need to go back to the database options in the first section ('''Mine Settings''' &rarr; '''Database''' &rarr; '''create databases''').
 +
 
 +
After a successful build, you will see a summary of the time taken at each stage:
 +
 
 +
[[image:MineManager-built.png]]
 +
 
 +
The sources we have set up above should take about 650 sec (give or take) to integrate into the database
 +
&there4; time for a break!
 +
 
 +
=Deployment=
 +
Once you have read access to a production database, you can build and release a web application against it.
 +
 
 +
==Configuration==
 +
If you haven't already, use the MineManager to configure the tomcat properties ('''Mine Settings''' &rarr; '''Web Settings'''):
 +
 
 +
Uses these settings for the tutorial (<span style="color:grey">''the tomcat settings refer to a preconfigured tomcat role''</span>).
 +
{|class="wikitable"
 +
! Field
 +
! Value
 +
! Notes
 +
|-
 +
|tomcat username
 +
|manager
 +
|The name of a tomcat administrator
 +
|-
 +
|tomcat password
 +
|manager
 +
|the password for the tomcat administrator
 +
|-
 +
|superuser username
 +
|'''choose a name'''
 +
|The name for the webapp administrator
 +
|-
 +
|superuser password
 +
|'''choose a password'''
 +
|The password for the webapp administrator
 +
|}
 +
 
 +
==UserProfile Initialisation==
 +
In addition to the ObjectStore DB which contains your data, there is a separate database which holds user information (accounts, saved preferences, query history, lists, templates, etc) and general webapp configuration (which technically all belongs to the superuser).
 +
 
 +
Since this mine is new, we need to build a new one (we will only ever need to do this '''once''' - repeating this step at a later date will <span style="color:red">'''delete all your users' data'''</span>).
 +
 
 +
To build the database:
 +
 
 +
<span style="color:red">NOTE: This command will delete any data in the userprofile database.</span>
 +
<pre class="enter">
 +
cd ~/Documents/Software/intermine/malariamine/webapp
 +
ant build-db-userprofile
 +
</pre>
 +
This command creates the SuperUser account and loads the <tt>default-template-queries.xml</tt> file.
 +
 
 +
==Starting the Tomcat Webserver==
 +
Tomcat is the webserver we use to serve InterMine webapps. Start Tomcat with this command:
 +
<pre class="enter">
 +
cd ~/Documents/Software/tomcat6
 +
bin/startup.sh
 +
</pre>
 +
 
 +
Visit the Tomcat manager at http://localhost:8080/. The username and password required to access the manager are '''manager''' and '''manager'''
 +
 
 +
==Deploying the Webapp to Tomcat==
 +
 
 +
Run the following command to release your webapp:
 +
<pre class="enter">
 +
cd ~/Documents/Software/intermine/malariamine/webapp
 +
ant clean default remove-webapp release-webapp
 +
</pre>
 +
This will fetch the model from the database and generate the model java code, remove and release the webapp. The default target forces a rebuild of the .war file. (The clean is not always necessary, but it doesn't hurt to include it, and remove-webapp is only really required when you have previously released before).
 +
 
 +
Visit your newly minted mine: http://localhost:8080/malariamine
 +
 
 +
=Accessing Your Data Through the Webapp=
 +
 
 +
In this section we will look at how you can examine, analyse and aggregate your data in the webapp, looking both at the webapp you have built, and FlyMine.
 +
 
 +
==Single Objects (Report Pages)==
 +
 
 +
Each object in the database (each Gene,  Chromosome, Exon, Protein, etc) will have a report page that can display:
 +
*The properties of the object
 +
*Links to other objects this object references
 +
*Widgets that display data about the object (GBrowse/Cytoscape)
 +
*Links to sites that contain information about the objects
 +
*Homologues of the object in other organisms/mines
 +
*Templates that you can run on the given object, and the number of results you can expect.
 +
 
 +
In the top right there is a search box which uses the Lucene quick-search. Enter '''ald''' to find the ''Aldolase'' gene.
 +
 
 +
[[image:report-page-props.png]]
 +
 
 +
[[image:report-page-templates.png|700px]]
 +
 
 +
==Multiple Objects (Lists)==
 +
 
 +
Lists of Objects of any type can be made and explored. The pages that display data on these lists are called "''List Report Pages''", and can display:
 +
*The properties of the objects in the list
 +
*Links to tools in other sites
 +
*Tools that convert a list into:
 +
**A list of a different type: (eg. ''gene'' &rarr; ''exon'')
 +
**A list of orthologues
 +
*Tools that aggregate data over the list:
 +
**Enrichment
 +
**Distribution
 +
**Expression
 +
**Localisation
 +
*Queries run on all objects in the list
 +
 
 +
Click the '''Lists''' tab to see the lists section. Here you can either:
 +
* '''Upload''' a list of identifiers to create a new list
 +
* '''View''' an existing list.
 +
 
 +
On the list view page you can perform set logic on lists of objects (intersection/union/etc) to derive new lists.
 +
 
 +
[[image:list-view-page.png]]
 +
 
 +
From the ''view'' sub-tab, select a list you think looks interesting:
 +
 
 +
[[image:list-analysis-page.png]]
 +
 
 +
==Exporting Data and Summarising Columns==
 +
 
 +
When viewing lists or Query results these two actions are always available:
 +
 
 +
===Exporting Data===
 +
 
 +
Data can be exported from Lists and Query results in a number of formats:
 +
* Flat file formats (TSV/CSV)
 +
* Excel .xls format
 +
* GFF3
 +
* Fasta
 +
* Galaxy (for use in a workflow)
 +
 
 +
The export link is always in the top left of the page:
 +
 
 +
[[image:export-options.png]]
 +
 
 +
===Viewing Column Summaries===
 +
 
 +
Each column header also has a summary symbol (&Sigma;) which helps you get an overview over the data contained in that column:
 +
 
 +
[[image:column-header.png]]
 +
 
 +
==Running Queries==
 +
 
 +
Queries in the webapp are created and run using the ''QueryBuilder'' interface, which helps you build queries using the data model as a guide.
 +
 
 +
#Click on the '''QueryBuilder''' tab<br/>[[image:query-builder-tab.png]]
 +
#Select '''gene''' as the type of object we want to query for<br/>[[image:query-select-gene.png]]
 +
#Click '''summary''' next to '''gene''' in the Model browser
 +
#Scroll down to pathways
 +
##Expand the pathways collection
 +
##Click '''constrain''' next to '''name'''
 +
##Type in '''p''' into the value box in the pop-up
 +
##Select '''Pentose Phosphate Pathway''' from the autocomplete drop-down<br/>[[image:query-built-query.png]]
 +
#Select '''Show results'''
 +
 
 +
You should see results like this:
 +
[[image:query-results.png]]
 +
 
 +
==Making Lists from Query Results==
 +
 
 +
The query results page provides links to the report pages of individual objects, and we can create lists of the objects returned. To create a list, click the check-box in the header of the column containing the type of object you want to make into a list, here, any of the gene fields:
 +
 
 +
[[image:making-a-list1.png]]
 +
 
 +
Then name your list:
 +
 
 +
[[image:making-a-list2.png]]
 +
 
 +
And you're done.
 +
 
 +
==Running Templates==
 +
 
 +
Templates are queries that have been pre-written and saved for later re-use, either by and for a single user, or for all the users of the mine. Rather than running the same query over and over, they allow for their parameters to be changed, and they automatically present a simple web-form interface.
 +
 
 +
When we build our mine we included a number of default templates, and large mines such as FlyMine have many more. Click on the '''Templates''' tab to see what templates are available (you can type a name, or part of a name into the box at the top to filter the list of templates):
 +
 
 +
[[image:flymine-templates.png]]
 +
 
 +
Select the '''Pathway &rarr; Genes''' template to see how the template interface differs from the query interface:
 +
 
 +
[[image:template-form.png]]
 +
 
 +
Running this query should get us the same results (more or less) than the query we wrote ourselves. To see where it might differ, we can view the underlying query by selecting '''Edit Query'''.
 +
 
 +
==Making Templates==
 +
 
 +
To make templates you and others can use later, you need to be logged in. When you are, you will be able to edit templates that belong to you, and make new templates:
 +
 
 +
Making a template is as simple as making a query, and then clicking '''start building a template query'''.
 +
 
 +
[[image:logged-in-options.png]]
 +
 
 +
You can choose what constraints are shown to the user (whether they are editable or not), and whether they are required or optional.
 +
 
 +
[[image:template-editing.png]]
 +
 
 +
=Accessing Your Data through the Webservice=
 +
 
 +
As well as the graphical webapp interface, each mine also offers a webservice that exposes an external, scriptable  programmatic API to the data (although this can be turned off at deployment). The webservice takes the form of a RESTful(''-ish'') set of resource paths, that accept either GET or POST HTTP requests (for further details see [http://www.intermine.org/wiki/WebService here]).
 +
 
 +
==Raw URLs==
 +
 
 +
Anything you can do with the webservice ultimately boils down to requests to urls, and the clients we provide are simply ways to generate and validate urls, and manage the results they return. An example of a websevice url is:
 +
http://preview.flymine.org/preview/service/template/results?name=Gene_Protein&constraint1=Gene&op1=LOOKUP&value1=big&extra1=&size=10&format=jsonobjects
 +
 
 +
Here the different parts are:
 +
;http://preview.flymine.org/preview/service
 +
:The base url for this service
 +
;template/results
 +
:The resource path (in this case, results for templates)
 +
;?name=Gene_Protein&constraint1=Gene&op1=LOOKUP&value1=big&extra1=&size=10&format=jsonobjects
 +
:The query string, a URL-encoded name-value pair set that tells the resource what we want to do
 +
 
 +
===Asking the Webapp to generate them for you===
 +
 
 +
Obviously generating these urls is possible, but non-obvious. The simplest way to get a url for a query you want to run again is to ask the webapp to generate it for you. You can do this when you are on the edit query page or a template form page by selecting the '''webservice url''' link at the bottom of the page:
 +
 
 +
[[image:getting-query-xml.png]]
 +
 
 +
This will get you a url you can use with ''wget'' or ''curl'', although it will be difficult to edit and adjust.
 +
 
 +
==The Command-Line utilities==
 +
 
 +
For very simple applications of the webservice, we also provide command line utilities that can take a query as XML or a template as a name and a list of parameters and return you the result as a flat-file. This is much more readable than simply using URLs.
 +
 
 +
(The command line utilities are installed automatically when the Perl Webservice client modules are installed)
 +
 
 +
===Getting XML from the Webapp===
 +
 
 +
Queries are represented in the webservice as XML strings, and rather than having to write them yourself (although you [http://www.intermine.org/wiki/QueryXML can]), again the webservice will generate this for you if you want. Just select '''Query XML''' at the bottom of the page:
 +
 
 +
[[image:getting-query-xml.png]]
 +
 
 +
Which in the case of our pathways query would look like this:
 +
 
 +
<xml>
 +
<query name="" model="genomic" view="Pathway.identifier Pathway.name Pathway.genes.primaryIdentifier Pathway.genes.symbol"
 +
  longDescription="For a specified KEGG, REACTOME or FlyReactome pathway, list all the genes that are involved for a particular organism"
 +
  sortOrder="Pathway.identifier asc" constraintLogic="B and C and A">
 +
  <pathDescription pathString="Pathway.genes" description="Gene"/>
 +
  <constraint path="Pathway.name" code="A" op="=" value="Pentose phosphate pathway"/>
 +
  <constraint path="Pathway.dataSets.name" code="B" op="=" value="KEGG pathways data set"/>
 +
  <constraint path="Pathway.genes.organism.name" code="C" op="=" value="Drosophila melanogaster"/>
 +
</query>
 +
</xml>
 +
 
 +
To run the xml you got, then use the ''run-im-query'' program:
 +
run-im-query --url www.flymine.org/query path/to/query.xml
 +
 
 +
===Running Templates===
 +
 
 +
To run a template all we need is the name of the template, and the parameters we want to specify. This information is all included in the query string part of the webservice url. For example, to make a command line request for the pathways &rarr; genes template we can run the following command:
 +
run-im-template --url www.flymine.org/query --title Pathway_Genes value1="Pentose phosphate pathway" value2="Drosophila melanogaster" value3="KEGG pathways data set"
 +
 
 +
==Access from Perl & Java programs==
 +
 
 +
To simplify access to the webservice from Perl and Java programs, we supply client software to run queries with. This software is included in our source tree:
 +
* '''Perl''': <code>~/Documents/Software/intermine/perl/Webservice-InterMine</code>
 +
* '''Java''': <code>~/Documents/Software/intermine/webservice/client</code>
 +
 
 +
But the easiest way to install these clients is respectively:
 +
* '''Perl''': install with a cpan client:
 +
cpan Webservice::InterMine
 +
* '''Java''': download the client package from the appropriate webapp, by visiting the API tab
 +
 
 +
===Webapp/Webservice Integration===
 +
 
 +
Each mine now includes an API tab that provides links and guidance on using the programmatic client software. To get the Java client package for a particular webservice, make sure to click on the '''Java''' subtab (in the top-left), and then click the '''download''' link in the first section.
 +
[[image:perl-api_tab.png]]
 +
[[image:java-api_tab.png]]
 +
 
 +
In addition to this help page, every query and template you visit will offer to write a script or a java program for you that you can just save and run. To get this, click on the '''Perl''' or '''Java''' link to see the generated code:
 +
 
 +
[[image:template-form.png]]
 +
 
 +
Additional help is provided at the CPAN: http://search.cpan.org/perldoc?Webservice::InterMine, or though the use of the <code>perldoc</code> command:
 +
perldoc Webservice::InterMine
 +
 
 +
===Accessing Templates===
 +
 
 +
The following is the complete code you would get by clicking on the '''Perl''' link above:
 +
 
 +
<perl>
 +
use Webservice::InterMine 0.9412 'http://www.flymine.org/release-27.0/service';
 +
 
 +
# This is an automatically generated script to run the FlyMine template
 +
# You should install the Webservice::InterMine modules to run this example, e.g. sudo cpan Webservice::InterMine
 +
 
 +
# template name - Pathway_Genes
 +
# template description - For a specified KEGG, REACTOME or FlyReactome pathway, list all the genes that are involved for a particular organism
 +
 
 +
my $template = Webservice::InterMine->template('Pathway_Genes')
 +
    or die 'Could not find template';
 +
 
 +
# You can edit the constraint values below
 +
# A    Pathway.name    Show genes in pathway:
 +
# B    Pathway.dataSets.name    From dataset (KEGG, Reactome or FlyReactome):
 +
# C    Pathway.genes.organism.name    For organism:
 +
 
 +
my $results = $template->results_with(
 +
    opA    => '=',
 +
    valueA => 'Pentose phosphate pathway',
 +
    opB    => '=',
 +
    valueB => 'KEGG pathways data set',
 +
    opC    => '=',
 +
    valueC => 'Drosophila melanogaster',
 +
    as    => 'string',
 +
);
 +
 
 +
print $results."\n";
 +
</perl>
 +
 
 +
The equivalent '''Java''' query would look like this:
 +
 
 +
<java>
 +
package flymine;
 +
 
 +
import java.util.ArrayList;
 +
import java.util.List;
 +
 
 +
import org.intermine.webservice.client.core.ServiceFactory;
 +
import org.intermine.webservice.client.services.TemplateService;
 +
import org.intermine.webservice.client.template.TemplateParameter;
 +
 
 +
/**
 +
* This is an automatically generated Java program to run the FlyMine template.
 +
* template name - Pathway_Genes
 +
* template description - For a specified KEGG, REACTOME or FlyReactome pathway, list all the genes that are involved for a particular organism
 +
*
 +
* @author FlyMine
 +
*
 +
*/
 +
public class TemplatePathwayGenes
 +
{
 +
    private static String serviceRootUrl = "http://www.flymine.org/release-27.0/service";
 +
 
 +
    /**
 +
    * @param args command line arguments
 +
    */
 +
    public static void main(String[] args) {
 +
 
 +
        TemplateService service = new ServiceFactory(serviceRootUrl, "TemplateService").getTemplateService();
 +
 
 +
        List<TemplateParameter> parameters = new ArrayList<TemplateParameter>();
 +
 
 +
        // You can edit the constraint values below
 +
        // Constraint description - Show genes in pathway:
 +
        parameters.add(new TemplateParameter("Pathway.name", "eq", "Pentose phosphate pathway"));
 +
        // Constraint description - From dataset (KEGG, Reactome or FlyReactome):
 +
        parameters.add(new TemplateParameter("Pathway.dataSets.name", "eq", "KEGG pathways data set"));
 +
        // Constraint description - For organism:
 +
        parameters.add(new TemplateParameter("Pathway.genes.organism.name", "eq", "Drosophila melanogaster"));
 +
 
 +
        // Name of a public template, private templates are not supported at the moment
 +
        String templateName = "Pathway_Genes";
 +
 
 +
        // Number of results are fetched
 +
        int maxCount = 10000;
 +
        List<List<String>> result = service.getResult(templateName, parameters, maxCount);
 +
        System.out.print("Results: \n");
 +
        for (List<String> row : result) {
 +
            for (String cell : row) {
 +
                System.out.print(cell + " ");
 +
            }
 +
            System.out.print("\n");
 +
        }
 +
    }
 +
}
 +
</java>
 +
 
 +
===Accessing Queries===
 +
 +
The '''Perl''' to access the same underlying query as that above using the query service would look like this:
 +
 
 +
<perl>
 +
use Webservice::InterMine 0.9412 'http://www.flymine.org/release-27.0/service';
 +
 
 +
# This is an automatically generated script to run the FlyMine query
 +
# You should install the Webservice::InterMine modules to run this example, e.g. sudo cpan Webservice::InterMine
 +
 
 +
# query description - For a specified KEGG, REACTOME or FlyReactome pathway, list all the genes that are involved for a particular organism
 +
 
 +
my $query = Webservice::InterMine->new_query;
 +
 
 +
# The view specifies the output columns
 +
$query->add_view(qw/
 +
    Pathway.identifier
 +
    Pathway.name
 +
    Pathway.genes.primaryIdentifier
 +
    Pathway.genes.symbol
 +
/);
 +
 
 +
# Sort by
 +
$query->set_sort_order('Pathway.identifier' => 'ASC');
 +
 
 +
# You can edit the constraint values below
 +
$query->add_constraint(
 +
    path  => 'Pathway.name',
 +
    op    => '=',
 +
    value => 'Pentose phosphate pathway',
 +
    code => 'A',
 +
);
 +
 
 +
$query->add_constraint(
 +
    path  => 'Pathway.dataSets.name',
 +
    op    => '=',
 +
    value => 'KEGG pathways data set',
 +
    code => 'B',
 +
);
 +
 
 +
$query->add_constraint(
 +
    path  => 'Pathway.genes.organism.name',
 +
    op    => '=',
 +
    value => 'Drosophila melanogaster',
 +
    code => 'C',
 +
);
 +
 
 +
# Constraint Logic
 +
$query->logic('B and C and A');
 +
 
 +
print $query->results(as => 'string')."\n";
 +
</perl>
 +
 
 +
The equivalent '''Java''' would look like this:
 +
 
 +
<java>
 +
package flymine;
 +
 
 +
import java.io.IOException;
 +
import java.util.List;
 +
 
 +
import org.intermine.metadata.Model;
 +
import org.intermine.webservice.client.core.ServiceFactory;
 +
import org.intermine.webservice.client.services.ModelService;
 +
import org.intermine.webservice.client.services.QueryService;
 +
import org.intermine.pathquery.PathQuery;
 +
import org.intermine.pathquery.OrderDirection;
 +
import org.intermine.pathquery.Constraints;
 +
 
 +
/**
 +
* This is an automatically generated Java program to run the FlyMine query.
 +
*
 +
* @author FlyMine
 +
*
 +
*/
 +
public class QueryClient
 +
{
 +
    private static String serviceRootUrl = "http://www.flymine.org/release-27.0/service";
 +
 
 +
    /**
 +
    * @param args command line arguments
 +
    * @throws IOException
 +
    */
 +
    public static void main(String[] args) {
 +
        QueryService service =
 +
            new ServiceFactory(serviceRootUrl, "QueryService").getQueryService();
 +
        Model model = getModel();
 +
        PathQuery query = new PathQuery(model);
 +
 
 +
        // Add views
 +
        query.addViews("Pathway.identifier",
 +
                "Pathway.name",
 +
                "Pathway.genes.primaryIdentifier",
 +
                "Pathway.genes.symbol");
 +
 
 +
        // Add orderby
 +
        query.addOrderBy("Pathway.identifier", OrderDirection.ASC);
 +
 
 +
        // Add constraints and you can edit the constraint values below
 +
        query.addConstraint(Constraints.eq("Pathway.name", "Pentose phosphate pathway"), "A");
 +
 
 +
        query.addConstraint(Constraints.eq("Pathway.dataSets.name", "KEGG pathways data set"), "B");
 +
 
 +
        query.addConstraint(Constraints.eq("Pathway.genes.organism.name", "Drosophila melanogaster"), "C");
 +
 
 +
        // Add constraintLogic
 +
        query.setConstraintLogic("B and C and A");
 +
 
 +
        // Number of results are fetched
 +
        int maxCount = 10000;
 +
        List<List<String>> result = service.getResult(query, maxCount);
 +
        System.out.print("Results: \n");
 +
        for (List<String> row : result) {
 +
            for (String cell : row) {
 +
                System.out.print(cell + " ");
 +
            }
 +
            System.out.print("\n");
 +
        }
 +
    }
 +
 
 +
    private static Model getModel() {
 +
        ModelService service = new ServiceFactory(serviceRootUrl, "ModelService").getModelService();
 +
        return service.getModel();
 +
    }
 +
}
 +
</java>
 +
 
 +
==Data Formats==
 +
 
 +
Thus far we have received all our results as tab-delimited rows of data, but there are other formats we can request:
 +
 
 +
===Row Based Formats===
 +
 
 +
;tab
 +
:The default format - simple tab separated values
 +
;csv
 +
:As above, but comma separated, and double quoted
 +
;jsonrows
 +
:Row based json format: http://intermine.org/wiki/JSONRowFormat
 +
;xml
 +
:Structured data format with the structure
 +
<xml><ResultSet><Row><i></i>...</Row>...</ResultSet></xml>
 +
 
 +
===Record Based Formats===
 +
 
 +
We have one format ('''jsonobjects''') that treats records as the unit of the query, returning an object with arbitrarily deep nesting of references and collections: see http://intermine.org/wiki/JSONRowFormat for more. You can see an example of the results in this format below:
 +
 
 +
<javascript>
 +
{
 +
  'rootClass': 'Gene',
 +
  'modelName': 'genomic',
 +
  'views':    ["Gene.primaryIdentifier", "Gene.symbol", "Gene.proteins.primaryAccession", "Gene.proteins.primaryIdentifier"],
 +
  'executionTime':  '2011.01.14 13:32::14',
 +
  'results':  [
 +
    {
 +
      "primaryIdentifier": null,
 +
      "symbol":            null,
 +
      "objectId":          1719268932,
 +
      "class":            "Gene",
 +
      "proteins":          [
 +
        {
 +
          "primaryAccession":  "A2AKB2",
 +
          "primaryIdentifier": "A2AKB2_MOUSE",
 +
          "objectId":          1719574559,
 +
          "class":            "Protein"
 +
        },
 +
        {
 +
          "primaryAccession":  "P61965",
 +
          "primaryIdentifier": "WDR5_MOUSE",
 +
          "objectId":          1719268927,
 +
          "class":            "Protein"
 +
        },
 +
        {
 +
          "primaryAccession":  "Q3UNQ3",
 +
          "primaryIdentifier": "Q3UNQ3_MOUSE",
 +
          "objectId":          1719447174,
 +
          "class":            "Protein"
 +
        }
 +
      ]
 +
    }
 +
  ]
 +
}
 +
</javascript>
 +
 
 +
===Getting the Total===
 +
 
 +
Set the format to '''count'''
 +
 
 +
==Access From Within the Browser==
 +
 
 +
We have a javascript client as well, called IMBedding (http://www.intermine.org/imbedding) which enables queries to
 +
any Mine webservice from any browser, and display tables of data inline. Please look at the imbedding tutorial to
 +
see more, but an example is included below as a demonstration:
 +
 
 +
<pre>
 +
<head>
 +
    <!-- jQuery is hosted by Google -->
 +
    <script src="https://ajax.googleapis.com/ajax/libs/jquery/1.4.4/jquery.min.js"
 +
        type="text/javascript">
 +
    </script>
 +
    <!-- jquery-jsonp is likewise available from an online repository -->
 +
    <script src="http://jquery-jsonp.googlecode.com/files/jquery.jsonp-2.1.4.min.js"
 +
        type="text/javascript">
 +
    </script>
 +
    <!-- Similarly imbedding.js is hosted on intermine.org -->
 +
    <script src="http://www.intermine.org/lib/imbedding/0.1/imbedding.min.js"
 +
        type="text/javascript">
 +
    </script>
 +
</head>
 +
<div id="some-placeholder"></div>
 +
<script type="text/javascript">
 +
    IMBedding.setBaseUrl("http://preview.flymine.org/preview");
 +
    IMBedding.loadTemplate(
 +
        {
 +
            name:          "Gene_RegionOverlappingTFbindingsite",
 +
 
 +
            constraint1:    "Gene",
 +
            op1:            "LOOKUP",
 +
            value1:        "CG2328",
 +
            code1:          "A",
 +
        },           
 +
        '#some-placeholder',
 +
    );
 +
</script>
 +
</pre>
 +
 
 +
= Evaluation =
 +
 
 +
{{Feedback}}
 +
 
 +
{{NextSession|GBrowse|GBrowse}}

Revision as of 04:56, 11 March 2011

Template:SessionHead

InterMine Session

2011 GMOD Spring Training
8-12 March 2011
Thursday 10th March
Alex Kalderimis

InterMine|300|gmod:InterMine}}

Contents

OOOPS!

First things first:

  • I forgot to install a cpan module:
 sudo cpan Expect
  • You need to run a command:
 cp -r /home/gmod/Documents/Software/intermine/bio/sources/example-sources/malaria-gff  /home/gmod/Documents/Software/intermine/bio/sources/

Introduction

Intro Slides

InterMine is a project that aims to make creating, running, and maintaining massive data warehouses of integrated genomics data fast and flexible. It provides a back end database solution, a front end web-application, and a fully capable webservice API to access the data you host. InterMine already powers several websites, including FlyMine, modMine, RatMine, YeastMine, and soon metabolicMine and ZFINmine as well.

InterMine is fundamentally data agnostic, and can host any data you like, but we have been funded to develop genomics tools, and you will find a wide range of utilities that make dealing with different biological sources of data easy. Dealing with the massive amount of data that genomics research produces is never really easy, but InterMine makes the straightforward simple, and the difficult possible.

Overview

We aim to demonstrate three strengths of InterMine:

  1. It's effortless straightforward to integrate data from different datasets (even your own data!) into one database.
  2. Once you get your data into the database, you get a powerful, works-out-of-the-box webapp that makes it easy and fun to access your data.
  3. Once you get your webapp up on the server, you get a sophisticated webservice that enables you or others to access the data via scripts, Java programs and other web-pages.

To do this, we will set up a stand-alone InterMine. This consists of a PostgreSQL database and a Java web-app sitting on top of it. Setting up your InterMine involves loading data into this database, and then mounting the web application in a running Tomcat instance.

Loading Data Into Your Database

The database schema, and the Java classes that represent it, are generated from XML configuration files. To manage this each mine has an associated Java project folder, named after itself. This project also contains code which manages integration, or the procedure of loading data into the scheme once it has been defined. Therefore the outline of this section is:

  1. Setting up the project structure
  2. Configuring the data sources and the associated schema
  3. Running the build process

The tutorial data

We will use the sample data set we distribute with our source. This is located at:

~/Documents/Software/intermine/bio/tutorial/malariamine/malaria-data.tar.gz

Place this data in the the data directory, and extract it for use in the tutorial:

 mkdir ~/Documents/Data/intermine
 cd ~/Documents/Data/intermine
 cp ~/Documents/Software/intermine/bio/tutorial/malariamine/malaria-data.tar.gz .
 tar -zxvf malaria-data.tar.gz

You should now have a directory of data available at /home/gmod/Documents/Data/intermine/malaria

The MineManager Graphical Installer

We are developing a graphical application to manage these stages, which we will use in this section of the workshop:

The MineManager is located in our source tree at SVN_ROOT/intermine/MineManager.

Running it from the command line

To run it open a terminal and type the command:

  /home/gmod/Documents/Software/intermine/intermine/MineManager/run

Running it from a clickable launcher

If you would prefer a point and click interface, on standard Linux desktops, you can run the launcher installer to obtain a runnable icon:

  /home/gmod/Documents/Software/intermine/intermine/MineManager/install_launcher

You should then find a MineManager icon on your desktop, which you can double click to open the installer.

Welcome Screen

You should see a window like this:

MineManager-welcome.png

This installer will guide you through the install procedure in 4 steps to the point of having a working database that we can use to release a mine on.

Starting a new Mine

To do this enter a name in the box at the top and click on the save icon to the right of the text box. This will automatically open up the next stage of the mine creation process.

Setting up the new Mine's properties

In order to proceed, we need to tell the installer where the InterMine source tree we are using is located. This is referred to here as InterMine Home. You can use the browse button on the top right of the Mine Information tab to select the appropriate directory:

 /home/gmod/Documents/Software/intermine

MineManager-minesettings-info.png

Once this form is completed and you have applied your changes you will be able to create your mine.

The Mine project directory

Creating the mine runs the make_mine script, which sets up the Java project directories in the appropriate places, and then builds an initial version of the data model. The structure of the mine's project directory is:

 SVN_ROOT/your_mine
         |
         + -- dbmodel/
         + -- integrate/
         + -- postprocess/
         + -- webapp/
         + -- project.xml
         + -- default.intermine.webapp.properties
         + -- default.intermine.integrate.properties 

The four sub directories are each separate Java projects that manage the different stages of building and running a mine, pretty much in the order they appear.

Adding Sources to a Project

The next section of the MineManager handles adding sources to a project:

A source here refers to the combination of a datasource and a parser that reads the data into the database. We supply a large number of parsers with our source code for reading in data from common biological formats (including Chado, GFF3), and we supply the tools for writing your own parsers for datasources we don't support out of the box.

InterMine-dataparsing.png

MineManager-sourcesettings-empty.png

Importing Protein Data From Uniprot

Click on add source, and then select uniprot as the source type from the drop-down list. (you can choose to name each source, but in this case uniprot is fine).

Once you add a source the section on the right will load up its specific configuration options.

MineManager-sourcesettings-uniprot.png

For uniprot the appropriate settings are:

Field Value
List of Organisms 36329
Create protein domains TICKED
Create GO terms UNTICKED
Location of data directory /home/gmod/Documents/Data/malaria/uniprot

To save this configuration so it is used in the build, select save sources from the bottom right, or when prompted.

Side Note: Model Additions

The Uniprot configuration section has a second tab named Source Model Additions, which specifies the addtions to the data model that a particular source brings with it. The uniprot source adds the following classes:

  • Component
  • UniprotFeature

And adds fields to the following classes:

  • Gene
  • GoAnnotation
  • Protein
  • ProteinDomain

If you select modelview model from the menu you can see how these classes and fields are integrated into the data model.

The GFF3 source

InterMine includes a parser to load valid GFF3 files. The creation of features, sequence features (usually chromosomes), locations and standard attributes is taken care of automatically.

The files we are loading are from PlasmoDB and contain gene, exon and mRNA features, there is one file per chromosome. Look at an example:

 head ~/Documents/Data/intermine/malaria/genome/gff/MAL1.gff3

To add the GFF3 source to your MalariaMine:

  1. Select the Add source option on the Sources menu.
  2. Choose malaria-gff
  3. Click the add button and save your sources.

The properties set for malaria-gff are:

Field Value Notes
gff3.seqClsName Chromosome the ids in the first column represent Chromosome objects, e.g. MAL1
gff3.taxonId 36329 taxon id of malaria
gff3.dataSourceName PlasmoDB the data source for features and their identifiers, this is used for the DataSet (evidence) and synonyms.
gff3.seqDataSourceName PlasmoDB the source of the seqids (chromosomes) is sometimes different to the features described
gff3.dataSetTitle PlasmoDB P. falciparum genome a DataSet object is created as evidence for the features, it is linked to a DataSource (PlasmoDB)
Location of Data Directory /home/gmod/Documents/Data/intermine/malaria/genome/gff Where we unpacked the data to

FASTA files

FASTA is a minimal format for representing sequence data. Files comprise a header with some identifier information preceded by '>' and a sequence. At present the InterMine FASTA parser loads just the first entry in header after > and assigns it to be an attribute of the feature created. Here we will load one FASTA file for each malaria chromosome. Look at an example of the files we will load:

 head ~/malaria/genome/fasta/MAL1.fasta

Add a fasta source to your Mine by following these steps:

  1. Select Add source option from the Sources menu
  2. Select the fasta type and name your source malaria-chromosome-fasta. Note: you must use this name as there is an integration keys set of this name predefined for you. We will deal with keys in more detail in the custom source section.
  3. Click the add button and save your changes.

The following properties should be defined for malaria-chromosome-fasta:

Field Value Notes
FASTA Class Name org.intermine.model.bio.Chromosome the type of feature that each sequence is for
fasta.dataSourceName PlasmoDB the source of identifiers to be created
Dataset Name PlasmoDB chromosome sequence a DataSet object is created as evidence
Taxon ID 36329 the organism id for Plasmodium falciparum
Location of Data Directory /home/gmod/Documents/Data/intermine/malaria/genome/fasta Where we unpacked the data to before

Entrez Organism

Some sources depend on other sources, usually in order to complete the database with fields that can be derived or fetched in some way, Entrez Organism is one of these. It fetches organism names from Entrez. Add this source to the mine - it does not need any special configuration.

Loading Custom Data Sources

As well as the standard data loaders that ship with the InterMine source tree, we supply a tool-chain for building your own data loaders for any custom data source. There are APIs for this tool-chain in Java and Perl. The next section will walk us through loading a custom data source by using the Perl data loading API.

Tool-chain details

  • Java - data loaders are written by sub-classing one of a set of basic data loading classes (eg. org.intermine.bio.dataconversion.BioFileConverter), which provide a scaffold.
  • Perl - data is loaded in a two stage process, by first using a set of Perl modules to convert your data into our XML format, which is then loaded into the database using a core dataloader.

Installing the Perl tool-chain

The Perl modules are located in our source tree at:

 SVN_ROOT/intermine/perl/InterMine-Item

and can be installed in the standard manner with the commands:

  perl Build.PL
  sudo ./Build installdeps [if you don't have the pre-requisites]
  ./Build test [optional]
  sudo ./Build install

Or the entire procedure above can be automated with your preferred CPAN client by installing InterMine::Item. eg:

  cpan InterMine::Item

Our example dataset

First let's look at the data we will be adding to the database. In this tutorial we will use data from the KEGG pathway database. In their words:

"KEGG PATHWAY mapping is the process to map molecular datasets, especially large-scale datasets in genomics, transcriptomics, proteomics, and metabolomics, to the KEGG pathway maps for biological interpretaion of higher-level systemic functions."

Specifically, the data we have will provide mappings between genes and KEGG pathways. It takes the form of two files in /home/gmod/Documents/Data/malaria/kegg, look at these now.

  • pfa_gene_map.tab - this has two tab delimited columns:
    • the first is the identifier of a malaria gene, note these are the same ids we have used for `Gene.primaryIdentifier` in other sources.
    • the second is a space separated list of KEGG pathway ids that the gene is involved in
  • map_title.tab - also has two tab delimited columns:
    • the first is a KEGG pathway identifier
    • the second the descriptive name of the pathway

Our parsing strategy

We will

  1. Create data items for the data source, data set, and organism
  2. Read in the pathways file
    1. Create a data item for each pathway in the file
    2. Remember which item was made for each id
  3. Read in the gene mappings file
    1. Create a data item for each gene in the file, linked to the pathway items made earlier


Adding The Source

There a couple of custom source types; since we will be using the Perl toolchain, add a new source of the type intermine-items-xml, and set the name to something sensible, such as kegg-pathways.

An Example Implementation of this Strategy

Click on open parser to edit and paste in the script below:

<perl>#!/usr/bin/perl

use warnings; use strict; use InterMine::Model; use InterMine::Item::Document;

@ARGV == 4 or die "Bad arguments: we need four arguments\n$0 model-file output-file pathways-file gene-mappings-file\n";

my ( $model_file, $out_file, $pathway_file, $gene_mappings_file ) = @ARGV;

  1. Create the writing apparatus

my $model = InterMine::Model->new( file => $model_file ); my $document = InterMine::Item::Document->new(

   model      => $model,
   output     => $out_file,
   auto_write => 1,

);

my $data_source = 'Kegg'; my $taxon_id = 36329; my %pathway_with;

  1. Create data items for the data source, data set, and organism

my $datasource_item = $document->add_item(

   'DataSource',
   'name' => $data_source,

);

my $dataset_item = $document->add_item(

   'DataSet',
   name       => $data_source . ' data set for taxon id: ' . $taxon_id,
   dataSource => $datasource_item,

);

my $org_item = $document->add_item(

   'Organism',
   taxonId  => $taxon_id,

);

  1. Read in the pathways file

open(my $pathways, '<', $pathway_file) or die "Could not open $pathway_file, $!"; for (<$pathways>) {

   chomp;
   my ($id, $title) = split(/\t/);
   ## Create a data item for each pathway in the file
   ## Remember which item was made for each id
   $pathway_with{$id} = $document->add_item(
       'Pathway',
       identifier => $id,
       name       => $title,
   );

} close $pathways or die "Could not close $pathway_file, $!";

  1. Read in the gene mappings file

open(my $gene_mappings, '<', $gene_mappings_file) or die "Couldn't open $gene_mappings_file, $!"; for (<$gene_mappings>) {

   chomp;
   my ($gene_id, $pathway_string) = split(/\t/);
   my @pathway_ids = split(/\s/, $pathway_string);
   my $pathway_items = [@pathway_with{@pathway_ids}];
   ## Create a data item for each gene in the file, linked to the pathway items made earlier
   $document->add_item('Gene',
      primaryIdentifier => $gene_id,
      organism          => $org_item,
      pathways          => $pathway_items,
      dataSets          => [$dataset_item],
   );

} close $gene_mappings or die "Could not close $gene_mappings_file, $!";

  1. Close the document

$document->close();

exit;</perl>

Additions

Our Model currently has no class "Pathway" (as you can confirm by browsing the model). We need to add it, and the Gene ↔ Pathway link. We can do this by using the Source Model Additions editor.

  1. Add a new class and name it Pathway
  2. Add an attribute in this class and name it name, with the type set to String
  3. Add an attribute in this class and name it identifier, with the type set to String
  4. Add a collection in this class and name it genes, with the type set to Gene and a reverse reference named pathways
  5. Click yes when asked if you want to make the reverse reference
  6. Change the field-type of the new reverse reference in the Gene class to collection
  7. Click no when asked if you want to make the reverse reference.

You should end up with:

Gene
FieldType Name Type Reverse-Reference
Collection pathways Pathway genes
Pathway
FieldType Name Type Reverse-Reference
Attribute name String -
Attribute identifier String -
Collection genes Gene pathways

Once these are added, if you reload the model, you should find the new Pathway class as part of the model

Dealing With Integration

As we are adding gene data from this source in addition to the other genes already in the database, we need to make sure they play nicely together. We do this by setting up "integration keys" that tell the integration process how to identify when we are adding details about an object we already have in the database, rather than adding an entirely new one.

To do this, on the source properties tab of the source details panel, click on Open keys file to edit the integration keys.

We already have keys defined for DataSet and DataSource: we only need to add the following line:

Gene.key = primaryIdentifier

We do not need to add a key for Pathway, as we are not adding pathways data from any other source.

Generating the XML

Now we are ready to generate the XML using our parser. First we need to generate the model:

  cd ~/Documents/Software/intermine/malariamine/dbmodel
  ant build-db

And now we can run our parser and generate XML

  perl ~/Documents/Software/intermine/bio/sources/kegg-pathways/kegg-pathways_parser.pl \
    ~/Documents/Software/intermine/malariamine/dbmodel/build/model/genomic_model.xml \
    ~/Documents/Data/intermine/malaria/kegg/pathways.xml \
    ~/Documents/Data/intermine/malaria/kegg/map_title.tab \
    ~/Documents/Data/intermine/malaria/kegg/pfa_gene_map.tab

And then finally tell our mine where the data is by filling in the Data file location in the kegg-pathways source properties section to /home/gmod/Documents/Data/intermine/malaria/kegg/pathways.xml

Running a Build

The build section of the MineManager runs the build process (a front-end for our project-build script).

MineManager-buildscreen.png

If you click build and there are problems with your configuration that would prevent a successful build, the MineManager will catch that and tell you:

MineManager-cannotbuild.png

In this case we would need to go back to the database options in the first section (Mine SettingsDatabasecreate databases).

After a successful build, you will see a summary of the time taken at each stage:

MineManager-built.png

The sources we have set up above should take about 650 sec (give or take) to integrate into the database ∴ time for a break!

Deployment

Once you have read access to a production database, you can build and release a web application against it.

Configuration

If you haven't already, use the MineManager to configure the tomcat properties (Mine SettingsWeb Settings):

Uses these settings for the tutorial (the tomcat settings refer to a preconfigured tomcat role).

Field Value Notes
tomcat username manager The name of a tomcat administrator
tomcat password manager the password for the tomcat administrator
superuser username choose a name The name for the webapp administrator
superuser password choose a password The password for the webapp administrator

UserProfile Initialisation

In addition to the ObjectStore DB which contains your data, there is a separate database which holds user information (accounts, saved preferences, query history, lists, templates, etc) and general webapp configuration (which technically all belongs to the superuser).

Since this mine is new, we need to build a new one (we will only ever need to do this once - repeating this step at a later date will delete all your users' data).

To build the database:

NOTE: This command will delete any data in the userprofile database.

 cd ~/Documents/Software/intermine/malariamine/webapp
 ant build-db-userprofile

This command creates the SuperUser account and loads the default-template-queries.xml file.

Starting the Tomcat Webserver

Tomcat is the webserver we use to serve InterMine webapps. Start Tomcat with this command:

 cd ~/Documents/Software/tomcat6 
 bin/startup.sh

Visit the Tomcat manager at http://localhost:8080/. The username and password required to access the manager are manager and manager

Deploying the Webapp to Tomcat

Run the following command to release your webapp:

 cd ~/Documents/Software/intermine/malariamine/webapp
 ant clean default remove-webapp release-webapp

This will fetch the model from the database and generate the model java code, remove and release the webapp. The default target forces a rebuild of the .war file. (The clean is not always necessary, but it doesn't hurt to include it, and remove-webapp is only really required when you have previously released before).

Visit your newly minted mine: http://localhost:8080/malariamine

Accessing Your Data Through the Webapp

In this section we will look at how you can examine, analyse and aggregate your data in the webapp, looking both at the webapp you have built, and FlyMine.

Single Objects (Report Pages)

Each object in the database (each Gene, Chromosome, Exon, Protein, etc) will have a report page that can display:

  • The properties of the object
  • Links to other objects this object references
  • Widgets that display data about the object (GBrowse/Cytoscape)
  • Links to sites that contain information about the objects
  • Homologues of the object in other organisms/mines
  • Templates that you can run on the given object, and the number of results you can expect.

In the top right there is a search box which uses the Lucene quick-search. Enter ald to find the Aldolase gene.

Report-page-props.png

Report-page-templates.png

Multiple Objects (Lists)

Lists of Objects of any type can be made and explored. The pages that display data on these lists are called "List Report Pages", and can display:

  • The properties of the objects in the list
  • Links to tools in other sites
  • Tools that convert a list into:
    • A list of a different type: (eg. geneexon)
    • A list of orthologues
  • Tools that aggregate data over the list:
    • Enrichment
    • Distribution
    • Expression
    • Localisation
  • Queries run on all objects in the list

Click the Lists tab to see the lists section. Here you can either:

  • Upload a list of identifiers to create a new list
  • View an existing list.

On the list view page you can perform set logic on lists of objects (intersection/union/etc) to derive new lists.

List-view-page.png

From the view sub-tab, select a list you think looks interesting:

List-analysis-page.png

Exporting Data and Summarising Columns

When viewing lists or Query results these two actions are always available:

Exporting Data

Data can be exported from Lists and Query results in a number of formats:

  • Flat file formats (TSV/CSV)
  • Excel .xls format
  • GFF3
  • Fasta
  • Galaxy (for use in a workflow)

The export link is always in the top left of the page:

Export-options.png

Viewing Column Summaries

Each column header also has a summary symbol (Σ) which helps you get an overview over the data contained in that column:

Column-header.png

Running Queries

Queries in the webapp are created and run using the QueryBuilder interface, which helps you build queries using the data model as a guide.

  1. Click on the QueryBuilder tab
    Query-builder-tab.png
  2. Select gene as the type of object we want to query for
    Query-select-gene.png
  3. Click summary next to gene in the Model browser
  4. Scroll down to pathways
    1. Expand the pathways collection
    2. Click constrain next to name
    3. Type in p into the value box in the pop-up
    4. Select Pentose Phosphate Pathway from the autocomplete drop-down
      Query-built-query.png
  5. Select Show results

You should see results like this: Query-results.png

Making Lists from Query Results

The query results page provides links to the report pages of individual objects, and we can create lists of the objects returned. To create a list, click the check-box in the header of the column containing the type of object you want to make into a list, here, any of the gene fields:

Making-a-list1.png

Then name your list:

Making-a-list2.png

And you're done.

Running Templates

Templates are queries that have been pre-written and saved for later re-use, either by and for a single user, or for all the users of the mine. Rather than running the same query over and over, they allow for their parameters to be changed, and they automatically present a simple web-form interface.

When we build our mine we included a number of default templates, and large mines such as FlyMine have many more. Click on the Templates tab to see what templates are available (you can type a name, or part of a name into the box at the top to filter the list of templates):

Flymine-templates.png

Select the Pathway → Genes template to see how the template interface differs from the query interface:

Template-form.png

Running this query should get us the same results (more or less) than the query we wrote ourselves. To see where it might differ, we can view the underlying query by selecting Edit Query.

Making Templates

To make templates you and others can use later, you need to be logged in. When you are, you will be able to edit templates that belong to you, and make new templates:

Making a template is as simple as making a query, and then clicking start building a template query.

Logged-in-options.png

You can choose what constraints are shown to the user (whether they are editable or not), and whether they are required or optional.

Template-editing.png

Accessing Your Data through the Webservice

As well as the graphical webapp interface, each mine also offers a webservice that exposes an external, scriptable programmatic API to the data (although this can be turned off at deployment). The webservice takes the form of a RESTful(-ish) set of resource paths, that accept either GET or POST HTTP requests (for further details see here).

Raw URLs

Anything you can do with the webservice ultimately boils down to requests to urls, and the clients we provide are simply ways to generate and validate urls, and manage the results they return. An example of a websevice url is:

http://preview.flymine.org/preview/service/template/results?name=Gene_Protein&constraint1=Gene&op1=LOOKUP&value1=big&extra1=&size=10&format=jsonobjects

Here the different parts are:

http://preview.flymine.org/preview/service
The base url for this service
template/results
The resource path (in this case, results for templates)
?name=Gene_Protein&constraint1=Gene&op1=LOOKUP&value1=big&extra1=&size=10&format=jsonobjects
The query string, a URL-encoded name-value pair set that tells the resource what we want to do

Asking the Webapp to generate them for you

Obviously generating these urls is possible, but non-obvious. The simplest way to get a url for a query you want to run again is to ask the webapp to generate it for you. You can do this when you are on the edit query page or a template form page by selecting the webservice url link at the bottom of the page:

Getting-query-xml.png

This will get you a url you can use with wget or curl, although it will be difficult to edit and adjust.

The Command-Line utilities

For very simple applications of the webservice, we also provide command line utilities that can take a query as XML or a template as a name and a list of parameters and return you the result as a flat-file. This is much more readable than simply using URLs.

(The command line utilities are installed automatically when the Perl Webservice client modules are installed)

Getting XML from the Webapp

Queries are represented in the webservice as XML strings, and rather than having to write them yourself (although you can), again the webservice will generate this for you if you want. Just select Query XML at the bottom of the page:

Getting-query-xml.png

Which in the case of our pathways query would look like this:

<xml> <query name="" model="genomic" view="Pathway.identifier Pathway.name Pathway.genes.primaryIdentifier Pathway.genes.symbol"

 longDescription="For a specified KEGG, REACTOME or FlyReactome pathway, list all the genes that are involved for a particular organism" 
 sortOrder="Pathway.identifier asc" constraintLogic="B and C and A">
 <pathDescription pathString="Pathway.genes" description="Gene"/>
 <constraint path="Pathway.name" code="A" op="=" value="Pentose phosphate pathway"/>
 <constraint path="Pathway.dataSets.name" code="B" op="=" value="KEGG pathways data set"/>
 <constraint path="Pathway.genes.organism.name" code="C" op="=" value="Drosophila melanogaster"/>

</query> </xml>

To run the xml you got, then use the run-im-query program:

run-im-query --url www.flymine.org/query path/to/query.xml

Running Templates

To run a template all we need is the name of the template, and the parameters we want to specify. This information is all included in the query string part of the webservice url. For example, to make a command line request for the pathways → genes template we can run the following command:

run-im-template --url www.flymine.org/query --title Pathway_Genes value1="Pentose phosphate pathway" value2="Drosophila melanogaster" value3="KEGG pathways data set"

Access from Perl & Java programs

To simplify access to the webservice from Perl and Java programs, we supply client software to run queries with. This software is included in our source tree:

  • Perl: ~/Documents/Software/intermine/perl/Webservice-InterMine
  • Java: ~/Documents/Software/intermine/webservice/client

But the easiest way to install these clients is respectively:

  • Perl: install with a cpan client:
cpan Webservice::InterMine
  • Java: download the client package from the appropriate webapp, by visiting the API tab

Webapp/Webservice Integration

Each mine now includes an API tab that provides links and guidance on using the programmatic client software. To get the Java client package for a particular webservice, make sure to click on the Java subtab (in the top-left), and then click the download link in the first section. Perl-api tab.png Java-api tab.png

In addition to this help page, every query and template you visit will offer to write a script or a java program for you that you can just save and run. To get this, click on the Perl or Java link to see the generated code:

Template-form.png

Additional help is provided at the CPAN: http://search.cpan.org/perldoc?Webservice::InterMine, or though the use of the perldoc command:

perldoc Webservice::InterMine

Accessing Templates

The following is the complete code you would get by clicking on the Perl link above:

<perl> use Webservice::InterMine 0.9412 'http://www.flymine.org/release-27.0/service';

  1. This is an automatically generated script to run the FlyMine template
  2. You should install the Webservice::InterMine modules to run this example, e.g. sudo cpan Webservice::InterMine
  1. template name - Pathway_Genes
  2. template description - For a specified KEGG, REACTOME or FlyReactome pathway, list all the genes that are involved for a particular organism

my $template = Webservice::InterMine->template('Pathway_Genes')

   or die 'Could not find template';
  1. You can edit the constraint values below
  2. A Pathway.name Show genes in pathway:
  3. B Pathway.dataSets.name From dataset (KEGG, Reactome or FlyReactome):
  4. C Pathway.genes.organism.name For organism:

my $results = $template->results_with(

   opA    => '=',
   valueA => 'Pentose phosphate pathway',
   opB    => '=',
   valueB => 'KEGG pathways data set',
   opC    => '=',
   valueC => 'Drosophila melanogaster',
   as     => 'string',

);

print $results."\n"; </perl>

The equivalent Java query would look like this:

<java> package flymine;

import java.util.ArrayList; import java.util.List;

import org.intermine.webservice.client.core.ServiceFactory; import org.intermine.webservice.client.services.TemplateService; import org.intermine.webservice.client.template.TemplateParameter;

/**

* This is an automatically generated Java program to run the FlyMine template.
* template name - Pathway_Genes
* template description - For a specified KEGG, REACTOME or FlyReactome pathway, list all the genes that are involved for a particular organism
*
* @author FlyMine
*
*/

public class TemplatePathwayGenes {

   private static String serviceRootUrl = "http://www.flymine.org/release-27.0/service";
   /**
    * @param args command line arguments
    */
   public static void main(String[] args) {
       TemplateService service = new ServiceFactory(serviceRootUrl, "TemplateService").getTemplateService();
       List<TemplateParameter> parameters = new ArrayList<TemplateParameter>();
       // You can edit the constraint values below
       // Constraint description - Show genes in pathway:
       parameters.add(new TemplateParameter("Pathway.name", "eq", "Pentose phosphate pathway"));
       // Constraint description - From dataset (KEGG, Reactome or FlyReactome):
       parameters.add(new TemplateParameter("Pathway.dataSets.name", "eq", "KEGG pathways data set"));
       // Constraint description - For organism:
       parameters.add(new TemplateParameter("Pathway.genes.organism.name", "eq", "Drosophila melanogaster"));
       // Name of a public template, private templates are not supported at the moment
       String templateName = "Pathway_Genes";
       // Number of results are fetched
       int maxCount = 10000;
       List<List<String>> result = service.getResult(templateName, parameters, maxCount);
       System.out.print("Results: \n");
       for (List<String> row : result) {
           for (String cell : row) {
               System.out.print(cell + " ");
           }
           System.out.print("\n");
       }
   }

} </java>

Accessing Queries

The Perl to access the same underlying query as that above using the query service would look like this:

<perl> use Webservice::InterMine 0.9412 'http://www.flymine.org/release-27.0/service';

  1. This is an automatically generated script to run the FlyMine query
  2. You should install the Webservice::InterMine modules to run this example, e.g. sudo cpan Webservice::InterMine
  1. query description - For a specified KEGG, REACTOME or FlyReactome pathway, list all the genes that are involved for a particular organism

my $query = Webservice::InterMine->new_query;

  1. The view specifies the output columns

$query->add_view(qw/

   Pathway.identifier
   Pathway.name
   Pathway.genes.primaryIdentifier
   Pathway.genes.symbol

/);

  1. Sort by

$query->set_sort_order('Pathway.identifier' => 'ASC');

  1. You can edit the constraint values below

$query->add_constraint(

   path  => 'Pathway.name',
   op    => '=',
   value => 'Pentose phosphate pathway',
   code => 'A',

);

$query->add_constraint(

   path  => 'Pathway.dataSets.name',
   op    => '=',
   value => 'KEGG pathways data set',
   code => 'B',

);

$query->add_constraint(

   path  => 'Pathway.genes.organism.name',
   op    => '=',
   value => 'Drosophila melanogaster',
   code => 'C',

);

  1. Constraint Logic

$query->logic('B and C and A');

print $query->results(as => 'string')."\n"; </perl>

The equivalent Java would look like this:

<java> package flymine;

import java.io.IOException; import java.util.List;

import org.intermine.metadata.Model; import org.intermine.webservice.client.core.ServiceFactory; import org.intermine.webservice.client.services.ModelService; import org.intermine.webservice.client.services.QueryService; import org.intermine.pathquery.PathQuery; import org.intermine.pathquery.OrderDirection; import org.intermine.pathquery.Constraints;

/**

* This is an automatically generated Java program to run the FlyMine query.
*
* @author FlyMine
*
*/

public class QueryClient {

   private static String serviceRootUrl = "http://www.flymine.org/release-27.0/service";
   /**
    * @param args command line arguments
    * @throws IOException
    */
   public static void main(String[] args) {
       QueryService service =
           new ServiceFactory(serviceRootUrl, "QueryService").getQueryService();
       Model model = getModel();
       PathQuery query = new PathQuery(model);
       // Add views
       query.addViews("Pathway.identifier",
               "Pathway.name",
               "Pathway.genes.primaryIdentifier",
               "Pathway.genes.symbol");
       // Add orderby
       query.addOrderBy("Pathway.identifier", OrderDirection.ASC);
       // Add constraints and you can edit the constraint values below
       query.addConstraint(Constraints.eq("Pathway.name", "Pentose phosphate pathway"), "A");
       query.addConstraint(Constraints.eq("Pathway.dataSets.name", "KEGG pathways data set"), "B");
       query.addConstraint(Constraints.eq("Pathway.genes.organism.name", "Drosophila melanogaster"), "C");
       // Add constraintLogic
       query.setConstraintLogic("B and C and A");
       // Number of results are fetched
       int maxCount = 10000;
       List<List<String>> result = service.getResult(query, maxCount);
       System.out.print("Results: \n");
       for (List<String> row : result) {
           for (String cell : row) {
               System.out.print(cell + " ");
           }
           System.out.print("\n");
       }
   }
   private static Model getModel() {
       ModelService service = new ServiceFactory(serviceRootUrl, "ModelService").getModelService();
       return service.getModel();
   }

} </java>

Data Formats

Thus far we have received all our results as tab-delimited rows of data, but there are other formats we can request:

Row Based Formats

tab
The default format - simple tab separated values
csv
As above, but comma separated, and double quoted
jsonrows
Row based json format: http://intermine.org/wiki/JSONRowFormat
xml
Structured data format with the structure

<xml><ResultSet><Row>...</Row>...</ResultSet></xml>

Record Based Formats

We have one format (jsonobjects) that treats records as the unit of the query, returning an object with arbitrarily deep nesting of references and collections: see http://intermine.org/wiki/JSONRowFormat for more. You can see an example of the results in this format below:

<javascript> {

 'rootClass': 'Gene',
 'modelName': 'genomic',
 'views':     ["Gene.primaryIdentifier", "Gene.symbol", "Gene.proteins.primaryAccession", "Gene.proteins.primaryIdentifier"],
 'executionTime':  '2011.01.14 13:32::14',
 'results':   [
   {
     "primaryIdentifier": null,
     "symbol":            null,
     "objectId":          1719268932,
     "class":             "Gene",
     "proteins":          [
       {
         "primaryAccession":  "A2AKB2",
         "primaryIdentifier": "A2AKB2_MOUSE",
         "objectId":          1719574559,
         "class":             "Protein"
       },
       {
         "primaryAccession":  "P61965",
         "primaryIdentifier": "WDR5_MOUSE",
         "objectId":          1719268927,
         "class":             "Protein"
       },
       {
         "primaryAccession":  "Q3UNQ3",
         "primaryIdentifier": "Q3UNQ3_MOUSE",
         "objectId":          1719447174, 
         "class":             "Protein"
       }
     ]
   }
 ]

} </javascript>

Getting the Total

Set the format to count

Access From Within the Browser

We have a javascript client as well, called IMBedding (http://www.intermine.org/imbedding) which enables queries to any Mine webservice from any browser, and display tables of data inline. Please look at the imbedding tutorial to see more, but an example is included below as a demonstration:

 <head>
    <!-- jQuery is hosted by Google -->
    <script src="https://ajax.googleapis.com/ajax/libs/jquery/1.4.4/jquery.min.js" 
        type="text/javascript">
    </script>
    <!-- jquery-jsonp is likewise available from an online repository -->
    <script src="http://jquery-jsonp.googlecode.com/files/jquery.jsonp-2.1.4.min.js" 
        type="text/javascript">
    </script>
    <!-- Similarly imbedding.js is hosted on intermine.org -->
    <script src="http://www.intermine.org/lib/imbedding/0.1/imbedding.min.js" 
        type="text/javascript">
    </script>
 </head>
 <div id="some-placeholder"></div>
 <script type="text/javascript">
    IMBedding.setBaseUrl("http://preview.flymine.org/preview");
    IMBedding.loadTemplate(
        {
            name:           "Gene_RegionOverlappingTFbindingsite",

            constraint1:    "Gene",
            op1:            "LOOKUP",
            value1:         "CG2328",
            code1:          "A",
        },            
        '#some-placeholder',
    );
 </script>

Evaluation

Please give us your comments on this session. We will ask for your feedback on each session and the course as a whole on the last day. Your comments will help guide the direction and content of future GMOD training and outreach efforts.


Next session →   GBrowse

Facts about "InterMine"RDF feed
Available on platformweb +
Has URLhttps://github.com/intermine/intermine.git +, http://www.intermine.org +, http://www.flymine.org +, http://yeastmine.yeastgenome.org/ +, http://www.mousemine.org/ +, http://ratmine.mcw.edu/ +, http://zmine.zfin.org/ + and http://www.wormbase.org/tools/wormmine +
Has descriptionInterMine should always be checked out in source code form + and InterMine makes it easy to integrate multiInterMine makes it easy to integrate multiple data sources into a single data warehouse. It has a core data model based on the Sequence Ontology and supports several biological data formats, just configure which organisms or data files are required. It is easy to extend the data model and integrate your own data, Java and Perl APIs and an XML format to help import custom data. Currently supported formats include Chado, GFF3, FASTA, GO and gene association files, UniProt XML, PSI XML (protein interactions), InParanoid orthologs, Ensembl, UniProt, and many others. A web application allows creation of custom queries, includes template queries (web forms to run 'canned' queries) and can upload and operate on lists of data. It is possible to configure/create widgets to analyse lists with graphs and enrichment statistics. An admin user can publish new template queries, change report pages and create public lists at any time without any programming. Many aspects of the web app can be configured and branded.the web app can be configured and branded. +
Has development statusactive +
Has input formatChado +, GFF3 +, FASTA +, GO ontology files +, GO gene association files +, UniProt XML +, PSI XML (protein interactions) +, InParanoid orthologs +, Ensembl +, Uniprot + and and many others. Custom formats can be supported through the data-source framework. +
Has licenceLGPL +
Has logoInterMineLogo.png +
Has output formatFlat files (tsv +, csv) +, JSON +, XML +, GFF3 +, BED + and FASTA +
Has software maturity statusmature +
Has support statusactive +
Has titleInterMine source at GitHub +, FlyMine +, YeastMine +, MouseMine +, RatMine +, ZMine + and WormMine +
Has topicInterMine +
Interaction typeconsumes data from +, can provide data for + and uses library for processing and loading data +
InteractorInterMine +
Interacts withJBrowse +, GBrowse +, Galaxy +, Pathway Tools + and Chado +
Is open sourceYes +
Link typedownload +, source code +, website + and wild URL +
Release date1 January 2002 +
Tool functionality or classificationDatabase tools +
Written in languageJava +, Python + and JavaScript +
Has subobjectThis property is a special property in this wiki.InterMine#https://github.com/intermine/intermine.git +, InterMine#http://www.intermine.org +, InterMine#http://www.flymine.org +, InterMine#http://yeastmine.yeastgenome.org/ +, InterMine#http://www.mousemine.org/ +, InterMine#http://ratmine.mcw.edu/ +, InterMine#http://zmine.zfin.org/ +, InterMine#http://www.wormbase.org/tools/wormmine +, InterMine +, InterMine +, InterMine +, InterMine + and InterMine +