Artemis-Chado Integration Tutorial
This Artemis-Chado Integration tutorial was presented by Robin Houston, Tim Carver and Giles Velarde at the 2009 GMOD Summer School - Europe, August 2009. The most recent Artemis tutorial can be found at the Artemis Tutorial page.
- Username: gmod
- Password: gmod
This tutorial describes the world as it existed on the day the tutorial was given. Please be aware that things like CPAN modules, Java libraries, and Linux packages change over time, and that the instructions in the tutorial will slowly drift over time. Newer versions of tutorials will be posted as they become available.
In this tutorial we present how to install and configure Artemis and ACT to use with a Chado database. The first two sections relate to installing Postgres and Chado, this is included for completeness only and you should refer to the Chado session for more details on this.
Artemis is a DNA sequence browser which works with flat files (e.g. EMBL, GenBank, GFF) and more recently with Chado databases. ACT (Artemis Comparison Tool) is based on Artemis. ACT uses BLAST comparison files to highlight regions of interest between pairs of sequences. Artemis and ACT in database mode are increasingly being used in the Pathogen Genomics Group at the Sanger Institute.
Download and Install Postgres
./configure --prefix=/home/gmod/gmod_test/pgsl --with-pgport=5432 --with-includes=/Developer make make install
cd /home/gmod/gmod_test/pgsl bin/initdb -D data/
Added the line to data/postgresql.conf:
listen_addresses = 'localhost'
Create the database:
postmaster -D data & createuser --createdb username createlang plpgsql template1 createdb --port=5432 chado_pathogen
Download and Install Chado
- Download stable release (gmod-1.0.tar.gz)
- Install BioPerl (http://www.bioperl.org/wiki/Installing_Bioperl_for_Unix)
- Install go-perl http://search.cpan.org/~cmungall/go-perl/
- Install Bundle::GMOD from cpan
export GMOD_ROOT=/usr/local/gmod CHADO_DB_NAME=chado_pathogen CHADO_DB_USERNAME=username CHADO_DB_PORT=5432
Now compile Chado and install the standard components (schema and ontologies):
perl Makefile.PL make sudo make install make load_schema make prepdb make ontologies
Examples of Loading Sequences into the Database
In this section we detail how to load 3 Plasmodium sequences into Chado for viewing in Artemis and ACT. Alternatively you can use your own sequences of interest.
The GenBank files are available from Entrez with the links below. Make sure you download it with the sequence by clicking on the option 'Show sequence' and 'Update View'. Then go to the Download menu and select GenBank(Full):
- NC_004314 (Plasmodium falciparum 3D7 chromosome 10)
- NC_011907 (Plasmodium knowlesi chromosome 6) and
- NC_011909 (Plasmodium knowlesi chromosome 8).
These are usually downloaded to the Desktop directory (depending on the browser). They are saved as something like sequences.gbwithparts. Re-name them as NC_004314.gbk, NC_011907.gbk and NC_011909.gbk. Pfalciparum and Pknowlesi will need to be added to your organism table in Chado.
INSERT INTO organism ( abbreviation, genus, species, common_name ) VALUES ( 'Pfalciparum', 'Plasmodium', 'falciparum', 'Pfalciparum'), ( 'Pknowlesi', 'Plasmodium', 'knowlesi', 'Pknowlesi');
Using the perl script bp_genbank2gff3.pl to convert the GenBank files to GFF3 format:
bp_genbank2gff3.pl -noCDS *.gbk
You need to modify the GFF files so that the correct SO term is used:
perl -pi~ -e s'|processed_transcript|mature_transcript|' *.gff
Then load the GFF3 files that have been created:
gmod_bulk_load_gff3.pl -organism Pfalciparum -dbname chado_pathogen \ -dbuser gmod -dbport 5432 -dbpass dd -recreate_cache < NC_004314.gbk.gff
gmod_bulk_load_gff3.pl -organism Pknowlesi -dbname chado_pathogen \ -dbuser gmod -dbport 5432 -dbpass dd -recreate_cache < NC_011907.gbk.gff
gmod_bulk_load_gff3.pl -organism Pknowlesi -dbname chado_pathogen \ -dbuser gmod -dbport 5432 -dbpass dd -recreate_cache < NC_011909.gbk.gff
Download Artemis and ACT
cvs -d :pserver:email@example.com:/cvsroot/pathsoft co artemis
Now compile the software:
cd artemis make
Try running the art script in the download:
./art -Dchado="localhost:5432/chado_pathogen?gmod" -Dibatis
This opens the login window:
The Artemis Database Manager and File Manager will open once your login has been authenticated. The top part of this relates to the Chado database and the bottom comprises the file management:
Select the sequence NC_004314 and double click on it to open it up in Artemis.
There are 3 main components to the Artemis window. The two top Feature Displays show the sequence at different levels of granularity and below these is a feature list:
- the top Feature Display is a zoomed out view of the sequence. The 3 forward and 3 reverse frames of translation are show with stop codons marked as black vertical lines.
- the second Feature Display shows the sequence at the nucleotide level. The amino acid translations are seen in this view.
- the Feature List shows the feature types and location. Options for displaying user defined qualifiers (e.g. Dbxref) can be accessed by right clicking on this list and selecting "Show Selected Qualifiers".
These three components are connected, so that if you select a feature in one then that feature becomes selected in the others. Double clicking on the feature centers the feature in both feature displays. The scroll bars on the right hand side of the feature displays allow you to zoom in and out.
The alternative way to open your sequence is to provide the entry (e.g. Pfalciparum:NC_004314) you want to open as a command line argument:
./art -Dchado="localhost:5432/chado_pathogen?gmod" -Dibatis \ Pfalciparum:NC_004314
For any of the gene features in Artemis you can select them and press the short cut key 'E' (Edit → Selected Features in Editor). This opens up the Gene Builder. Within this the Gene Model can be edited and annotation added.
It is also possible to launch the Artemis Gene Builder in a standalone mode for a particular gene:
etc/gene_builder -Dchado="localhost:5432/chado_pathogen?gmod" -Dibatis -Dshow_log PF10_0003
or in read-only mode you can open a gene in GeneDB (at the Sanger Institute):
etc/gene_builder -Dchado="db.genedb.org:5432/snapshot?genedb_ro" -Dibatis -Dshow_log -Dread_only PFA0010c
Note using the JVM option 'show_log' will open the log window.
Edit etc/options (to change settings globally) or create a file ~/.artemis_options in your home directory (for your own settings). There are various flags that can be used to configure Artemis and ACT with Chado.
chado_servers This allows you to provide a list of available servers for the user to select:
chado_servers = \ Plasmodium localhost:5432/chado_pathogen?username \ GeneDB db.genedb.org:5432/snapshot?genedb_ro
product_cv In the Pathogen Genomics Group the product qualifiers are stored as an ontology (as a cv in feature_cvterm). This can be changed so that they are stored as featureprop's by setting the product_cv option:
This will mean that the product will be shown in the "Core" section of the Artemis Gene Builder rather than the "Controlled Vocabulary" section.
synonym_cvname If synonym types are loaded into a CV, Artemis checks this ontology.
set_obsolete_on_delete This will set the default behaviour of Artemis when features are deleted. If set to:
the features will be made obsolete. The user is still prompted with the option to permanently delete the feature. If this line is not in the option file the default is to permanently delete features.
Selecting an alternative gene model Artemis supports 2 types of gene model representations:
A) Pathogen Genomics Gene Model - implicit CDS + explicit UTRs
gene | |- part_of mRNA | |---- part_of exon | |---- derives_from polypeptide | |---- part_of five_prime_UTR | |---- part_of three_prime_UTR
B) implicit CDS + UTRs
gene | |- part_of mRNA | |---- part_of exon | |---- derives_from polypeptide
The Artemis default is model A. To use model B then set:
sequence_update_features This lists the features that Artemis will maintain the feature.residue column for. This is generally useful for polypeptide and transcript features.
Artemis Database Manager
The database manager provides the list of organisms that have features with residues (currently Artemis searches for these on features of type: '*chromosome*', '*sequence*', 'supercontig', 'ultra_scaffold', 'golden_path_region', 'contig'). The database manager is cached between sessions (this is on by default and can be switched off with -Ddatabase_manager_cache_off). There is an option under the File menu to clear this cache.
Adding Controlled Vocabulary Qualifiers in the Artemis Gene Builder
These use evidence codes which are stored as a feature_cvtermprop's with a type_id which corresponds to a cvterm.name = 'evidence'. There is a useful SQL script (etc/chado_extra.sql) in the Artemis distribution for creating this term in Chado. Run this on the chado_pathogen instance of the database:
psql -d chado_pathogen -f etc/chado_extra.sql
(This will also create other terms that are used to store literature (PMID's) qualifiers.)
GO terms can now be selected in the Controlled Vocabulary (CV) section of the Gene Builder and added to features. Additional custom CV's can also be used. For Artemis to recognise it and display it the name of the CV needs to be prefixed by 'CC_'. These then appear in a drop down list when adding CV terms to a feature. Try adding a new CV:
INSERT INTO cv ( name, definition ) VALUES ( 'CC_test', 'test' );
and create a CvTerm in this CV:
INSERT INTO dbxref ( db_id, accession ) VALUES ( (SELECT db_id FROM db WHERE name = 'CCGEN'), 'test1' ); INSERT INTO cvterm ( cv_id, name, dbxref_id ) VALUES ( (SELECT cv_id FROM cv WHERE name ='CC_test'), 'test1', (SELECT dbxref_id FROM dbxref WHERE accession='test1') );
Now re-launch Artemis and open the Gene Builder at any feature and go to the 'Controlled Vocabulary' section and click the 'ADD' button. This CV (CC_test) will appear in the drop down menu:
Click on CC_test and hit the 'Next' button. This opens a keyword selection box. If you leave this blank all the terms are retrieved and displayed. If you keep clicking 'Next' this term is then added to the 'Controlled Vocabulary' section.
Transfer Annotation Tool (TAT)
This tool can be accessed from the Gene Builder - look for the TAT button. It allows you to transfer annotation between sequences. In database mode Artemis provides an editable list of genes constructed from ortholog/parlog links. These links can be added in the Gene Builder in the Match section (for example you can try creating the ortholog link between PF10_0165 in Pfalciparum and PKH_060110 in Pknowlesi).
Note that you can easily access the logging information Artemis produces. In the Artemis launch window under the 'Options' menu select the 'Show Log Window', this provides the logs. This is controlled by etc/log4j.properties. The logs can be useful for debugging and for monitoring activity if appended to a central file. See the log4j documentation for more information.
ACT can read sequences in from the database as well. However, it currently does not read the BLAST comparisons from Chado but reads this data from files. These comparisons are displayed as the matches between the sequences. To distinguish forward and reverse matches the forward matches are red and reverse matches are blue.
For convenience the comparison files have been pre-generated for this exercise and can be downloaded from:
wget ftp://ftp.sanger.ac.uk/pub/pathogens/workshops/GMOD2009/NC_004314_v_NC_011907_tblastx.gz wget ftp://ftp.sanger.ac.uk/pub/pathogens/workshops/GMOD2009/NC_004314_v_NC_011909_tblastx.gz
Note that both Artemis and ACT automatically open gzipped files. For details on generating these files go to ACT Comparison Files.
To run ACT use the act script:
./act -Dchado="localhost:5432/chado_pathogen?gmod" -Dibatis
From the 'File' menu select the option 'Open Database and SSH File Manager' and login. Drag and drop the Plasmodium entries from the Database Manager into the ACT selection window. Also, drag and drop the comparison files into this window, so it looks something like this (note the featureId numbers may well be different as these are the Chado feature_id):
Click on Apply to read these entries and open up ACT. You can use the right hand scroll bar to zoom in and out. If you zoom out you can indentify the regions that match between these sequences.
ACT can display multiple pairwise comparison. So the two P.knowlesi sequences can be compared to the P.falciparum sequence. From the ACT launch window go to the File menu and select 'Open Database and SSH File Manager'. Drag in the sequences and comparison files (clicking on 'more files' to add the additional sequence and comparison).
Zooming out you will see that Pfalciparum chromosome 10 matches to regions in Pknowlesi chromosome 7 and 9.
Writing Out Sequence Files
Artemis can write out EMBL and GFF files for an entry opened from the database. You can optionally flatten the gene model (i.e. gene, transcript, exon) to just a CDS feature. Also an option is given to ignore any obsolete features. For EMBL it uses mappings for conversion of the keys and qualifiers. These mappings are stored in the etc/key_mapping and etc/qualifier_mapping files.
A utility script (etc/write_db_entry) is also provided as a means of writing out multiple sequences from the database. The script takes the following options:
-h show help -f [y|n] flatten the gene model, default is y -i [y|n] ignore obsolete features, default is y -s space separated list of sequences to read and write out -o [EMBL|GFF] output format, default is EMBL -a [y|n] for EMBL submission format change to n, default is y
etc/writedb_entry -Dchado="localhost:5432/chado_pathogen?gmod" NC_004314
There is an Artemis mailing list: artemis-user.