Artemis-Chado Integration Tutorial
This Artemis-Chado Integration tutorial was presented by Robin Houston, Tim Carver and Giles Velarde at the 2009 GMOD Summer School - Europe, August 2009. The most recent Artemis tutorial can be found at the Artemis Tutorial page.
- Username: gmod
- Password: gmod
This tutorial describes the world as it existed on the day the tutorial was given. Please be aware that things like CPAN modules, Java libraries, and Linux packages change over time, and that the instructions in the tutorial will slowly drift over time. Newer versions of tutorials will be posted as they become available.
In this tutorial we present how to install and configure Artemis and ACT to use with a Chado database. The first two sections relate to installing Postgres and Chado, this is included for completeness only and you should refer to the Chado session for more details on this.
Artemis is a DNA sequence browser which works with flat files (e.g. EMBL, GenBank, GFF) and more recently with Chado databases. ACT (Artemis Comparison Tool) is based on Artemis. ACT uses BLAST comparison files to highlight regions of interest between pairs of sequences. Artemis and ACT in database mode are increasingly being used in the Pathogen Genomics Group at the Sanger Institute.
Download and Install Postgres
./configure --prefix=/home/gmod/gmod_test/pgsl --with-pgport=5432 --with-includes=/Developer make make install
cd /home/gmod/gmod_test/pgsl bin/initdb -D data/
Added the line to data/postgresql.conf:
listen_addresses = 'localhost'
Create the database:
postmaster -D data & createuser --createdb username createlang plpgsql template1 createdb --port=5432 chado_pathogen
Download and Install Chado
- Download stable release (gmod-1.0.tar.gz)
- Install BioPerl (http://www.bioperl.org/wiki/Installing_Bioperl_for_Unix)
- Install go-perl http://search.cpan.org/~cmungall/go-perl/
- Install Bundle::GMOD from cpan
export GMOD_ROOT=/usr/local/gmod CHADO_DB_NAME=chado_pathogen CHADO_DB_USERNAME=username CHADO_DB_PORT=5432
Now compile Chado and install the standard components (schema and ontologies):
perl Makefile.PL make sudo make install make load_schema make prepdb make ontologies
Examples of Loading Sequences into the Database
In this section we detail how to load 3 Plasmodium sequences into Chado for viewing in Artemis and ACT. Alternatively you can use your own sequences of interest.
The GenBank files are available from Entrez with the links below. Make sure you download it with the sequence by clicking on the option 'Show sequence' and 'Update View'. Then go to the Download menu and select GenBank(Full):
- NC_004314 (Plasmodium falciparum 3D7 chromosome 10)
- NC_011907 (Plasmodium knowlesi chromosome 6) and
- NC_011909 (Plasmodium knowlesi chromosome 8).
These are usually downloaded to the Desktop directory (depending on the browser). They are saved as something like sequences.gbwithparts. Re-name them as NC_004314.gbk, NC_011907.gbk and NC_011909.gbk. Pfalciparum and Pknowlesi will need to be added to your organism table in Chado.
INSERT INTO organism ( abbreviation, genus, species, common_name ) VALUES ( 'Pfalciparum', 'Plasmodium', 'falciparum', 'Pfalciparum'), ( 'Pknowlesi', 'Plasmodium', 'knowlesi', 'Pknowlesi');</sql> USING the perl script <tt>bp_genbank2gff3.pl</tt> TO CONVERT the GenBank files TO [[GFF3]] format: bp_genbank2gff3.pl -noCDS *.gbk You need TO MODIFY the GFF files so that the correct SO term IS used: perl -pi~ -e s'|processed_transcript|mature_transcript|' *.gff THEN LOAD the GFF3 files that have been created: gmod_bulk_load_gff3.pl -organism Pfalciparum -dbname chado_pathogen \ -dbuser gmod -dbport 5432 -dbpass dd -recreate_cache < NC_004314.gbk.gff gmod_bulk_load_gff3.pl -organism Pknowlesi -dbname chado_pathogen \ -dbuser gmod -dbport 5432 -dbpass dd -recreate_cache < NC_011907.gbk.gff gmod_bulk_load_gff3.pl -organism Pknowlesi -dbname chado_pathogen \ -dbuser gmod -dbport 5432 -dbpass dd -recreate_cache < NC_011909.gbk.gff ==Download Artemis AND ACT== You can download [http://www.sanger.ac.uk/Software/Artemis/ Artemis] AND [http://www.sanger.ac.uk/Software/ACT/ ACT] FROM their home pages at the Sanger Institute. FOR the most up-to-DATE developments download the software FROM the [[Glossary#CVS|CVS]] server: cvs -d :pserver:firstname.lastname@example.org:/cvsroot/pathsoft co artemis Now compile the software: cd artemis make OR download the development version FROM the [http://www.sanger.ac.uk/Software/Artemis/#development Development SECTION] ON the Artemis home page. Note that ON the Artemis web site there IS also a [http://www.sanger.ac.uk/Software/Artemis/stable/ stable] release available. ==Running Artemis== Try running the <tt>art</tt> script IN the download: ./art -Dchado="localhost:5432/chado_pathogen?gmod" -Dibatis This opens the login window: [[File:ArtemisLogin.gif]] The Artemis DATABASE Manager AND File Manager will OPEN once your login has been authenticated. The top part OF this relates TO the [[Chado]] DATABASE AND the bottom comprises the file management: [[File:DatabaseManager.gif]] SELECT the SEQUENCE NC_004314 AND DOUBLE click ON it TO OPEN it up IN Artemis. [[File:Artemis.gif]] There are 3 main components TO the Artemis window. The two top Feature Displays SHOW the SEQUENCE at different levels OF granularity AND below these IS a feature list: # the '''top Feature Display''' IS a zoomed OUT VIEW OF the SEQUENCE. The 3 forward AND 3 reverse frames OF translation are SHOW WITH stop codons marked AS black vertical LINES. # the '''second Feature Display''' shows the SEQUENCE at the nucleotide level. The amino acid translations are seen IN this VIEW. # the '''Feature List''' shows the feature types AND location. Options FOR displaying USER defined qualifiers (e.g. Dbxref) can be accessed BY RIGHT clicking ON this list AND selecting "Show Selected Qualifiers". These three components are connected, so that IF you SELECT a feature IN one THEN that feature becomes selected IN the others. DOUBLE clicking ON the feature centers the feature IN BOTH feature displays. The scroll bars ON the RIGHT hand side OF the feature displays allow you TO zoom IN AND OUT. The alternative way TO OPEN your SEQUENCE IS TO provide the entry (e.g. Pfalciparum:NC_004314) you want TO OPEN AS a command line argument: ./art -Dchado="localhost:5432/chado_pathogen?gmod" -Dibatis \ Pfalciparum:NC_004314 FOR any OF the gene features IN Artemis you can SELECT them AND press the short cut KEY 'E' (Edit → Selected Features IN Editor). This opens up the Gene Builder. WITHIN this the Gene Model can be edited AND annotation added. [[File:GeneBuilder.gif|GeneBuilder]] It IS also possible TO launch the Artemis Gene Builder IN a standalone mode FOR a particular gene: etc/gene_builder -Dchado="localhost:5432/chado_pathogen?gmod" -Dibatis -Dshow_log PF10_0003 OR IN read-ONLY mode you can OPEN a gene IN GeneDB (at the Sanger Institute): etc/gene_builder -Dchado="db.genedb.org:5432/snapshot?genedb_ro" -Dibatis -Dshow_log -Dread_only PFA0010c Note USING the JVM OPTION 'show_log' will OPEN the log window. ==Configuration Options== Edit <tt>etc/options</tt> (TO CHANGE settings globally) OR CREATE a file <tt>~/.artemis_options</tt> IN your home directory (FOR your own settings). There are various flags that can be used TO configure Artemis AND ACT WITH [[Chado]]. '''chado_servers''' This allows you TO provide a list OF available servers FOR the USER TO SELECT: chado_servers = \ Plasmodium localhost:5432/chado_pathogen?username \ GeneDB db.genedb.org:5432/snapshot?genedb_ro '''product_cv''' IN the Pathogen Genomics GROUP the product qualifiers are stored AS an ontology (AS a cv IN feature_cvterm). This can be changed so that they are stored AS featureprop's by setting the product_cv option: product_cv=no This will mean that the product will be shown in the "Core" section of the Artemis Gene Builder rather than the "Controlled Vocabulary" section. '''synonym_cvname''' If synonym types are loaded into a CV, Artemis checks this ontology. '''set_obsolete_on_delete''' This will set the default behaviour of Artemis when features are deleted. If set to: set_obsolete_on_delete=yes the features will be made obsolete. The user is still prompted with the option to permanently delete the feature. If this line is not in the option file the default is to permanently delete features. '''Selecting an alternative gene model''' Artemis supports 2 types of gene model representations: A) Pathogen Genomics Gene Model - implicit CDS + explicit UTRs gene | |- part_of mRNA | |---- part_of exon | |---- derives_from polypeptide | |---- part_of five_prime_UTR | |---- part_of three_prime_UTR B) implicit CDS + UTRs gene | |- part_of mRNA | |---- part_of exon | |---- derives_from polypeptide The Artemis default is model A. To use model B then set: chado_infer_CDS_UTR=yes '''sequence_update_features''' This lists the features that Artemis will maintain the feature.residue column for. This is generally useful for polypeptide and transcript features. ==Artemis Database Manager== The database manager provides the list of organisms that have features with residues (currently Artemis searches for these on features of type: '*chromosome*', '*SEQUENCE*', 'supercontig', 'ultra_scaffold', 'golden_path_region', 'contig'). The database manager is cached between sessions (this is on by default and can be switched off with <tt>-Ddatabase_manager_cache_off</tt>). There is an option under the File menu to clear this cache. ==Adding Controlled Vocabulary Qualifiers in the Artemis Gene Builder== These use evidence codes which are stored as a feature_cvtermprop's WITH a type_id which corresponds TO a cvterm.name = 'evidence'. There IS a useful [[Glossary#SQL|SQL]] script (<tt>etc/chado_extra.sql</tt>) IN the Artemis distribution FOR creating this term IN [[Chado]]. Run this ON the chado_pathogen instance OF the DATABASE: psql -d chado_pathogen -f etc/chado_extra.SQL (This will also CREATE other terms that are used TO store literature (PMID's) qualifiers.) GO terms can now be selected in the Controlled Vocabulary (CV) section of the Gene Builder and added to features. Additional custom CV's can also be used. FOR Artemis TO recognise it AND display it the name OF the CV needs TO be prefixed BY 'CC_'. These THEN appear IN a DROP down list WHEN adding CV terms TO a feature. Try adding a NEW CV: psql chado_pathogen <syntaxhighlight lang="sql"> INSERT INTO cv ( name, definition ) VALUES ( 'CC_test', 'test' ); </sql> AND CREATE a CvTerm IN this CV: <syntaxhighlight lang="sql"> INSERT INTO dbxref ( db_id, accession ) VALUES ( (SELECT db_id FROM db WHERE name = 'CCGEN'), 'test1' ); INSERT INTO cvterm ( cv_id, name, dbxref_id ) VALUES ( (SELECT cv_id FROM cv WHERE name ='CC_test'), 'test1', (SELECT dbxref_id FROM dbxref WHERE accession='test1') ); </sql> Now re-launch Artemis AND OPEN the Gene Builder at any feature AND GO TO the 'Controlled Vocabulary' SECTION AND click the 'ADD' button. This CV (CC_test) will appear IN the DROP down menu: [[File:AddCV.gif]] Click ON CC_test AND hit the 'Next' button. This opens a keyword selection box. IF you leave this blank ALL the terms are retrieved AND displayed. IF you keep clicking 'Next' this term IS THEN added TO the 'Controlled Vocabulary' SECTION. ==Transfer Annotation Tool (TAT)== This tool can be accessed FROM the Gene Builder - look FOR the TAT button. It allows you TO transfer annotation BETWEEN sequences. IN DATABASE mode Artemis provides an editable list OF genes constructed FROM ortholog/parlog links. These links can be added IN the Gene Builder IN the MATCH SECTION (FOR example you can try creating the ortholog link BETWEEN PF10_0165 IN ''Pfalciparum'' AND PKH_060110 IN ''Pknowlesi''). ==Logging Information== Note that you can easily access the logging information Artemis produces. IN the Artemis launch window UNDER the 'Options' menu SELECT the 'Show Log Window', this provides the logs. This IS controlled BY <tt>etc/log4j.properties</tt>. The logs can be useful FOR debugging AND FOR monitoring activity IF appended TO a central file. See the [http://logging.apache.org/log4j/ log4j] documentation FOR more information. ==Running ACT== ACT can READ sequences IN FROM the DATABASE AS well. However, it currently does NOT READ the BLAST comparisons FROM [[Chado]] but reads this DATA FROM files. These comparisons are displayed AS the matches BETWEEN the sequences. TO distinguish forward AND reverse matches the forward matches are red AND reverse matches are blue. FOR convenience the comparison files have been pre-generated FOR this exercise AND can be downloaded FROM: <span STYLE="color:red">wget</span> ftp://ftp.sanger.ac.uk/pub/pathogens/workshops/GMOD2009/NC_004314_v_NC_011907_tblastx.gz <span STYLE="color:red">wget</span> ftp://ftp.sanger.ac.uk/pub/pathogens/workshops/GMOD2009/NC_004314_v_NC_011909_tblastx.gz Note that BOTH Artemis AND ACT automatically OPEN gzipped files. FOR details ON generating these files GO TO [[ACT Comparison Files]]. TO run ACT USE the <tt>act</tt> script: ./act -Dchado="localhost:5432/chado_pathogen?gmod" -Dibatis FROM the 'File' menu SELECT the OPTION 'Open Database and SSH File Manager' AND login. Drag AND DROP the ''Plasmodium'' entries FROM the DATABASE Manager INTO the ACT selection window. Also, drag AND DROP the comparison files INTO this window, so it looks something LIKE this (note the featureId numbers may well be different AS these are the Chado feature_id): [[File:ActSelection2seqs.gif]] Click ON Apply TO READ these entries AND OPEN up ACT. You can USE the RIGHT hand scroll bar TO zoom IN AND OUT. IF you zoom OUT you can indentify the regions that MATCH BETWEEN these sequences. [[File:Pf10_Pk6.gif]] ACT can display multiple pairwise comparison. So the two <tt>P.knowlesi</tt> sequences can be compared TO the <tt>P.falciparum sequence</tt>. FROM the ACT launch window GO TO the File menu AND SELECT 'Open Database and SSH File Manager'. Drag IN the sequences AND comparison files (clicking ON 'more files' TO ADD the additional SEQUENCE AND comparison). [[File:ActSelection.gif]] Zooming OUT you will see that ''Pfalciparum'' chromosome 10 matches TO regions IN ''Pknowlesi'' chromosome 7 AND 9. [[File:Pk6_Pf10_Pk8.gif]] ==Writing OUT SEQUENCE Files== Artemis can WRITE OUT EMBL AND [[GFF]] files FOR an entry opened FROM the DATABASE. You can OPTIONALLY flatten the gene model (i.e. gene, transcript, exon) TO just a CDS feature. Also an OPTION IS given TO IGNORE any obsolete features. FOR EMBL it uses mappings FOR conversion OF the KEYS AND qualifiers. These mappings are stored IN the <tt>etc/key_mapping</tt> AND <tt>etc/qualifier_mapping</tt> files. A utility script (<tt>etc/write_db_entry</tt>) IS also provided AS a means OF writing OUT multiple sequences FROM the DATABASE. The script takes the following options: -h SHOW help -f [y|n] flatten the gene model, DEFAULT IS y -i [y|n] IGNORE obsolete features, DEFAULT IS y -s SPACE separated list OF sequences TO READ AND WRITE OUT -o [EMBL|GFF] output format, DEFAULT IS EMBL -a [y|n] FOR EMBL submission format CHANGE TO n, DEFAULT IS y Try running: etc/writedb_entry -Dchado="localhost:5432/chado_pathogen?gmod" NC_004314 ==Mailing List== There IS an Artemis mailing list: [http://lists.sanger.ac.uk/mailman/listinfo/artemis-users artemis-USER]. ==REFERENCES== * [http://www.sanger.ac.uk/Software/Artemis/ Artemis home page] * [http://www.sanger.ac.uk/Software/ACT/ ACT home page] * [http://www.sanger.ac.uk/Software/Artemis/v11/chado/ Artemis Connecting TO Chado DATABASES] * [http://www.sanger.ac.uk/Software/Artemis/v11/DATABASE/chado.practical.guide.pdf USER Practical Guide] [[Category:Tutorials]] [[Category:Annotation]]