Difference between revisions of "GBrowse UCSC Plugin Install HOWTO"

From GMOD
Jump to: navigation, search
m (AUTHORS)
(Download and install the GBrowse plugin(s): gmod-ucsc is now available by Git not CVS)
 
(9 intermediate revisions by 2 users not shown)
Line 6: Line 6:
 
* A working installation of [http://www.mysql.com/ MySQL http://www.mysql.com/]
 
* A working installation of [http://www.mysql.com/ MySQL http://www.mysql.com/]
 
* Plenty of disk space, ideally local to your server.
 
* Plenty of disk space, ideally local to your server.
 +
* For Conservation plugin: a C compiler and mysql client library.
  
 
Disk space requirements vary depending on how many assemblies, alignments and types of tracks you would like to install.  Things to consider:
 
Disk space requirements vary depending on how many assemblies, alignments and types of tracks you would like to install.  Things to consider:
Line 56: Line 57:
 
=== UCSC database names ===
 
=== UCSC database names ===
  
UCSC builds a separate MySQL database of annotations for each genome assembly.  Each database contains at least one table for each "track" (set of annotations).  The database name begins with 2 or 6 letters indicating the species, followed by a sequence number which usually starts with 1 on the first assembly that UCSC processes.
+
UCSC builds a separate MySQL database of annotations for each genome assembly.  Each database contains at least one table for each "track" (set of annotations).  The database name begins with 2 or 6 letters indicating the species, followed by a sequence number which usually starts with 1 on the first assembly that UCSC processes.
  
Jim Kent wrote the UCSC Genome Browser while working on the Human Genome Project, so all databases for the human genome start with "hg".  For example, hg18 is the database of annotations for the 18th assembly of the human genome displayed in the UCSC Genome Browser; the assembly itself is better known as NCBI Build 36.
+
Jim Kent wrote the UCSC Genome Browser while working on the Human Genome Project, so all databases for the human genome start with "hg".  For example, hg18 is the database of annotations for the 18th assembly of the human genome displayed in the UCSC Genome Browser; the assembly itself is better known as NCBI Build 36.
  
 
The next species added to the browser was mouse, and the first letters of its genus and species names were used: "mm" for Mus musculus.  That convention was followed for rat (rn), ''C. elegans'' (ce), ''D. melanogaster'' (dm) and several other species added before mid-2003.  Then, in order to avoid name clashes among the deluge of newly sequenced species, a new convention was established: when new species are added to the browser, we use the first three letters of the genus in lowercase, and first three letters of the species with the first letter capitalized (e.g. "bosTau" for ''Bos taurus'').
 
The next species added to the browser was mouse, and the first letters of its genus and species names were used: "mm" for Mus musculus.  That convention was followed for rat (rn), ''C. elegans'' (ce), ''D. melanogaster'' (dm) and several other species added before mid-2003.  Then, in order to avoid name clashes among the deluge of newly sequenced species, a new convention was established: when new species are added to the browser, we use the first three letters of the genus in lowercase, and first three letters of the species with the first letter capitalized (e.g. "bosTau" for ''Bos taurus'').
Line 81: Line 82:
  
 
The multi-species alignments are stored in external files in the MAF format; those files are indexed in the multiz''N''way database table.  The maf table refers to numeric IDs of the files; those numeric IDs and the actual file paths are stored in the extFile table.
 
The multi-species alignments are stored in external files in the MAF format; those files are indexed in the multiz''N''way database table.  The maf table refers to numeric IDs of the files; those numeric IDs and the actual file paths are stored in the extFile table.
 
The Conservation track will probably need a C extension for decent performance... describe that too!
 
  
 
=== Assembly and track meta-info ===
 
=== Assembly and track meta-info ===
Line 94: Line 93:
 
# Create mysql database and load tables.
 
# Create mysql database and load tables.
 
#* If adding the Conservation track, also download external data files.
 
#* If adding the Conservation track, also download external data files.
# Download the [[GBrowse_Install_HOWTO#Plugins|GBrowse plugin]] and glyph modules <span style="color: green">(from where?  UCSC?  modENCODE?  bundled with GBrowse?)</span>
+
# Download the [[GBrowse_Install_HOWTO#Plugins|GBrowse plugin]] and glyph modules
 
# Install and configure plugin(s).
 
# Install and configure plugin(s).
  
Line 106: Line 105:
 
These are the database tables required for each type of track:
 
These are the database tables required for each type of track:
  
{| cellspacing="0" border="1"
+
{| class="wikitable"
 
!Track type
 
!Track type
 
!MySQL table(s)
 
!MySQL table(s)
 
|-
 
|-
 
|all tracks
 
|all tracks
|trackDb, hgcentral.dbDb
+
|chromInfo, trackDb, hgcentral.dbDb
 
|-
 
|-
 
|Chain
 
|Chain
Line 127: Line 126:
 
All UCSC Genome Browser data can be downloaded from [http://hgdownload.cse.ucsc.edu/ hgdownload.cse.ucsc.edu].  The HTTP, FTP and rsync protocols are supported.
 
All UCSC Genome Browser data can be downloaded from [http://hgdownload.cse.ucsc.edu/ hgdownload.cse.ucsc.edu].  The HTTP, FTP and rsync protocols are supported.
  
For each database table $TABLE, there are two files: $TABLE.sql, which contains a CREATE statement for MySQL to create the table, and $TABLE.txt.gz, a gzip-compressed tab-separated text file with the contents of the table.  In these examples of how to fetch the data using different protocols, $TMPDIR is a local temporary storage directory, $DB is the database, and $TABLE is the table.  In practice, you will probably write a script that loops on multiple tables and possibly even multiple databases.
+
For each database table $TABLE, there are two files: $TABLE.sql, which contains a CREATE statement for MySQL to create the table, and $TABLE.txt.gz, a gzip-compressed tab-separated text file with the contents of the table.  In these examples of how to fetch the data using different protocols, $TMPDIR is a local temporary storage directory, $DB is the database, and $TABLE is the table.  In practice, you will probably write a script that loops on multiple tables and possibly even multiple databases.
  
 
  # Example 1: use rsync to fetch .sql and .txt.gz files:
 
  # Example 1: use rsync to fetch .sql and .txt.gz files:
Line 135: Line 134:
 
   $TMPDIR/$DB/
 
   $TMPDIR/$DB/
 
  gunzip $TMPDIR/$DB/$TABLE.txt.gz
 
  gunzip $TMPDIR/$DB/$TABLE.txt.gz
+
 
 
  # Example 2: use wget to fetch files using ftp:// or http:// :
 
  # Example 2: use wget to fetch files using ftp:// or http:// :
 
  mkdir -p $TMPDIR/$DB
 
  mkdir -p $TMPDIR/$DB
Line 144: Line 143:
 
  gunzip $TMPDIR/$DB/$TABLE.txt.gz
 
  gunzip $TMPDIR/$DB/$TABLE.txt.gz
  
Also download the central database .sql, which creates and populates several assembly metadata tables.
+
Also download the central database .sql, which creates and populates several assembly metadata tables.
  
 
  wget -N -O $TMPDIR/hgcentral.sql \
 
  wget -N -O $TMPDIR/hgcentral.sql \
Line 152: Line 151:
  
 
=== Create and load the MySQL database ===
 
=== Create and load the MySQL database ===
Having downloaded the necessary data files, create local MySQL database(s), ideally with the same names used by UCSC:
+
Having downloaded the necessary data files, create a local MySQL database hgcentral and a database for each assembly version for which you are downloading tracks.  In order for the Conservation plugin to work, your MySQL genome assembly database ''must'' have the same name as the UCSC database ($DB above).
 +
 
 +
mysql -u root -pPassword -e "create database hgcentral;"
 +
mysql -u root -pPassword hgcentral < $TMPDIR/hgcentral.sql
  
mysql -u root -pPassword < $TMPDIR/hgcentral.sql
 
 
  mysql -u root -pPassword -e "create database $DB;"
 
  mysql -u root -pPassword -e "create database $DB;"
  
Line 176: Line 177:
 
Ideally, create a local /gbdb/ and install the files in the same paths as referenced by the sql tables.  If that is not possible, replace the pathnames in the sql tables with your local paths to the corresponding files.
 
Ideally, create a local /gbdb/ and install the files in the same paths as referenced by the sql tables.  If that is not possible, replace the pathnames in the sql tables with your local paths to the corresponding files.
  
=== Download, install and configure the GBrowse plugin(s) ===
+
=== Download and install the GBrowse plugin(s) ===
The plugin and glyph modules are available by anonymous CVS:
+
The plugin and glyph modules are available by anonymous [[Glossary#Git|Git]]:
  
  # If you use csh or tcsh:
+
  git clone git://genome-source.cse.ucsc.edu/gmod-ucsc.git
setenv CVSROOT :pserver:anonymous@genome-test.cse.ucsc.edu:/cbse
+
+
# If you use bash or sh:
+
export CVSROOT=:pserver:anonymous@genome-test.cse.ucsc.edu:/cbse
+
+
cvs login
+
''Logging in to :pserver:anonymous@genome-test.cse.ucsc.edu:2401/cbse''
+
''CVS password:'' '''genome'''
+
+
cvs co -rbeta gmod-ucsc/Generic-Genome-Browser
+
  
<span style="color: green">Is this how we want to distribute the modules permanently, or can they be bundled with GBrowse?</span>
+
The directory structure mirrors that of the main GBrowse repository.  Copy or link all files from gmod-ucsc/Generic-Genome-Browser/ into the corresponding locations in your GBrowse installation's Generic-Genome-Browser tree, make and make install.  '''NOTE: this includes Makefile.PL and MANIFEST... hand-merge if you have local changes!'''
  
Install the GBrowse ucsc_*.pm glyph files in <span style="color: green">???/Bio/Graphics/Glyph/ directory.  (Will that be Generic_Genome_Browser/lib/Bio, bioperl/Bio, or some other local lib Bio?)</span>
+
Note: In order for the Conservation plugin to work, your GBrowse Makefile.PL needs to have been configured with DO_XS enabled (compilation of C extensions).
 
+
Install the GBrowse Ucsc*.pm plugin files in GBrowse's Generic_Genome_Browser/conf/plugins/ directory.
+
  
 
These are the Perl module files required for each plugin:
 
These are the Perl module files required for each plugin:
{| cellspacing="0" border="1"
+
{| class="wikitable"
 
!plugin
 
!plugin
 
!Perl modules (glyph and plugin)
 
!Perl modules (glyph and plugin)
Line 212: Line 201:
 
|-
 
|-
 
|UcscConservation
 
|UcscConservation
|Bio/Graphics/Glyph/ucsc_conservation.pm<br>Generic_Genome_Browser/conf/plugins/UcscConservation.pm
+
|Generic_Genome_Browser/conf/plugins/UcscWiggle.pm<br>Generic_Genome_Browser/conf/plugins/UcscConservation.pm
 
|}
 
|}
  
Conservation track only: download and compile C extension?? (To be determined later on in implementation, depending on speed of perl-only plugin.)
+
=== Conservation track only: C extension required ===
 +
While UcscChain and UcscNet have pure-Perl implementations, UcscConservation requires a compiled C extension, Bio::Graphics::Browser::UcscTrackImage. The Perl, XS, C and H files are in Generic_Genome_Browser/libucsc/*.  The top-level "make" and "make install" in Generic-Genome-Browser will descend into libucsc and build the extension if Makefile.PL and MANIFEST have been appropriately updated and if Makefile.PL was configured with DO_XS enabled.
  
Add the plugin names (UcscChain, UcscNet, and/or UcscConservation) to the GBrowse ''datasource''.conf plugins setting (see also [[GBrowse_Install_HOWTO#Plugins]]).
+
=== Configure plugins in GBrowse ===
 +
Add the plugin names (UcscChain, UcscNet, and/or UcscConservation to the plugins setting in the appropriate .conf file(s): for GBrowse 1.52, ''datasource''.conf; for GBrowse 2.0, GBrowse.conf. (see also [[GBrowse_Install_HOWTO#Plugins]])
  
 
Add plugin settings like the following to ''datasource''.conf:
 
Add plugin settings like the following to ''datasource''.conf:
Line 227: Line 218:
 
  seq_prefix = chr
 
  seq_prefix = chr
 
  split_prefix = chr
 
  split_prefix = chr
+
 
 
  [UcscChain:plugin]
 
  [UcscChain:plugin]
 
  default_enable = chainCb3
 
  default_enable = chainCb3
+
 
 
  [UcscNet:plugin]
 
  [UcscNet:plugin]
 
  default_enable = netCb3 netCaeRem2
 
  default_enable = netCb3 netCaeRem2
 +
 +
[UcscConservation:plugin]
 +
default_enable = multiz5way
  
 
These are the supported <code><nowiki>[UcscPlugin:plugin]</nowiki></code> settings, shared by all Ucsc plugins (but can be overridden in the section for each plugin if necessary):
 
These are the supported <code><nowiki>[UcscPlugin:plugin]</nowiki></code> settings, shared by all Ucsc plugins (but can be overridden in the section for each plugin if necessary):
{| cellspacing="0" border="1"
+
{| class="wikitable"
 
!setting
 
!setting
 
!required?
 
!required?
Line 274: Line 268:
 
|<span style="font-family: Courier New;">central_user</span>
 
|<span style="font-family: Courier New;">central_user</span>
 
|prob. not
 
|prob. not
|The MySQL user for <span style="font-family: Courier New;">central_db</span>, if not the same as <span style="font-family: Courier New;">user</span>.
+
|The MySQL user for <span style="font-family: Courier New;">central_db</span>, if not the same as <span style="font-family: Courier New;">user</span>.  Must be able to read and write to <span style="font-family: Courier New;">central_db</span>.
 
|-
 
|-
 
|<span style="font-family: Courier New;">central_pass</span>
 
|<span style="font-family: Courier New;">central_pass</span>
Line 289: Line 283:
 
|}
 
|}
  
This setting is specific to <code><nowiki>[UcscChain:plugin]</nowiki></code> and <code><nowiki>[UcscNet:plugin]</nowiki></code>:
+
This setting is only for actual track plugins (e.g. <code><nowiki>[UcscChain:plugin]</nowiki></code>), not for the base class <code><nowiki>[UcscPlugin:plugin]</nowiki></code>:
{| cellspacing="0" border="1"
+
{| class="wikitable"
 
!setting
 
!setting
 
!required?
 
!required?
Line 300: Line 294:
 
|}
 
|}
  
 +
=== Conservation track only: conf/UcscTrackImage.cfg file ===
 +
The Conservation plugin's C extension (Bio::Graphics::Browser::UcscTrackImage) requires a file named UcscTrackImage.cfg to exist in the GBrowse configuration directory, Generic-Genome-Browser/conf/.  The contents of the file have a lot in common with the GBrowse .conf settings above:
 +
 +
db.host=localhost
 +
db.user=''mysqlReadOnlyUser''
 +
db.password=''mysqlReadOnlyUsersPassword''
 +
db.trackDb=trackDb
 +
 +
central.db=hgcentral
 +
central.host=localhost
 +
central.user=''mysqlHgcentralRWUser''
 +
central.password=''mysqlHgcentralRWUser''
 +
 +
Note that the central.user account, unlike the db.user, needs to have write access -- but only to hgcentral.
  
 
=== Test GBrowse ===
 
=== Test GBrowse ===
Line 312: Line 320:
  
 
=== rsync mysql table binary files ===
 
=== rsync mysql table binary files ===
If you have permissions to modify MySQL's binary files, and are investing significant effort into developing an automated regular update of track data, you may want to try rsyncing the binary files directly from UCSC.  That replaces the downloading of $TABLE.{sql,txt.gz} files and loading into MySQL.
+
If you have permissions to modify MySQL's binary files, and are investing significant effort into developing an automated regular update of track data, you may want to try rsyncing the binary files directly from UCSC.  That replaces the downloading of $TABLE.{sql,txt.gz} files and loading into MySQL.
 
  rsync -navP rsync://hgdownload.cse.ucsc.edu/mysql/$DB/$TABLE.\* \
 
  rsync -navP rsync://hgdownload.cse.ucsc.edu/mysql/$DB/$TABLE.\* \
 
   /var/lib/mysql/$DB/
 
   /var/lib/mysql/$DB/
 
=== Build C lib for Conservation plugin ===
 
Will this have to be part of the regular installation for Conservation???  (To be determined later on in implementation, depending on speed of perl-only plugin.)
 
  
  
 
== Bug Reports and Support Requests ==
 
== Bug Reports and Support Requests ==
Please send general questions and/or bug reports to mailto:genome@soe.ucsc.edu.  If your question is specifically about fetching data from hgdownload.cse.ucsc.edu, send to mailto:genome_mirror@soe.ucsc.edu.
+
Please send general questions and/or bug reports to mailto:genome@soe.ucsc.edu.  If your question is specifically about fetching data from hgdownload.cse.ucsc.edu, send to mailto:genome_mirror@soe.ucsc.edu.
  
  
 
== AUTHORS ==
 
== AUTHORS ==
Angie Hinrichs mailto:angie@soe.ucsc.edu.
+
Angie Hinrichs mailto:angiehinrichs@users.sourceforge.net
  
 
[[Category:GBrowse]]
 
[[Category:GBrowse]]
 
[[Category:HOWTO]]
 
[[Category:HOWTO]]
 +
[[Category:MySQL]]

Latest revision as of 19:45, 15 January 2014

This page contains detailed installation instructions for setting up UCSC Genome Browser comparative genomics track data and GBrowse plugins to display them. It assumes that you already have a working installation of GBrowse and MySQL.


Prerequisites

Disk space requirements vary depending on how many assemblies, alignments and types of tracks you would like to install. Things to consider:

  • Temporary disk space for MySQL dump files used to load the database tables.
  • MySQL server disk space requirements for database tables.
  • If loading a Conservation track, permanent disk space for data files indexed by the MySQL database tables.

Here are some example approximate sizes for various tracks and organisms (note that each genome assembly usually has many Chain and Net tracks, and one Conservation track):

Track Description MySQL dump files MySQL table Data_length External files
Worm-Worm Chain 170M 120M n/a
Worm-Worm Net 25M 20M n/a
Worm 5-way Conservation 75M 65M 465M
Human-Mammal Chain 3,000M 2,000M n/a
Human-Mammal Net 510M 400M n/a
Human 28-way Conservation 2,530M 2,000M 79,300M

UCSC naming conventions and other relevant subtleties

An understanding of UCSC conventions is not really a prerequisite, but will be quite helpful for understanding the installation process.

UCSC database names

UCSC builds a separate MySQL database of annotations for each genome assembly. Each database contains at least one table for each "track" (set of annotations). The database name begins with 2 or 6 letters indicating the species, followed by a sequence number which usually starts with 1 on the first assembly that UCSC processes.

Jim Kent wrote the UCSC Genome Browser while working on the Human Genome Project, so all databases for the human genome start with "hg". For example, hg18 is the database of annotations for the 18th assembly of the human genome displayed in the UCSC Genome Browser; the assembly itself is better known as NCBI Build 36.

The next species added to the browser was mouse, and the first letters of its genus and species names were used: "mm" for Mus musculus. That convention was followed for rat (rn), C. elegans (ce), D. melanogaster (dm) and several other species added before mid-2003. Then, in order to avoid name clashes among the deluge of newly sequenced species, a new convention was established: when new species are added to the browser, we use the first three letters of the genus in lowercase, and first three letters of the species with the first letter capitalized (e.g. "bosTau" for Bos taurus).

Chromosome naming

Most genome assemblies use sequence names for chromosomes that match chromosome designations in the literature -- for example, 1, 2, 3 for human, and I, II, III for worm. UCSC prefixes those sequence names with "chr", to make it easier for its software to recognize chromosome sequence names. UCSC's annotations for human refer to chr1, chr2, chr3 etc., and for worm, chrI, chrII, chrIII etc.

Naming of database tables for comparative genomics tracks

For assemblies that consist of a reasonably small number of sequences (e.g. a few dozen chromosomes or mapping groups, possibly with concatenated unplaced sequences in virtual "chromosomes"), a track's data may be split into per-sequence tables instead of kept in one main table. This applies to the Chain tracks but not Net tracks, since Chain tracks contain much more data. When split, table names for a track end with the same track name, prefixed by the sequence name and _. For example, if a track named foo is split, its database tables are named chr*_foo (or using the MySQL wildcard, chr%\_foo).

On the other hand, if an assembly consists of thousands of separate unplaced sequences (e.g. WGS scaffolds), then it would kill MySQL to make one table per sequence for each split track. Therefore, splitting does not apply to such assemblies, and a track named foo would have one table named foo.

Chain tracks use two different types of database table to efficiently store the usually quite large amounts of data. The main tables are named chainOtherDb, and auxiliary "chainLink" tables are named chainOtherDbLink. (Or, if split, chr*_chainOtherDb and chr*_chainOtherDbLink.)

Chain and Net tracks contain pairwise alignments to a particular assembly of some other species' genome. The database table names for those tracks include the UCSC database name for that other species' assembly. For example, the chained alignments of UCSC's first D. ananassae build (the query) to UCSC's third D. melanogaster build (the target or reference) would be stored in the dm3 database, as the tables chr*_chainDroAna1 and chr*_chainDroAna1Link -- split because dm3 consists of a manageable number of chromosomes and concatenated unplaced sequences. The corresponding netted chains would be stored in netDroAna1.

The Conservation track combines two data types: per-base conservation scores and multi-species alignments. Both data types use not only database tables, but also external files, to store the data due to its massive size.

The scores are stored in UCSC's "wiggle" format, which utilizes both database tables and external files. The database table is usually called phastConsNway where N is the number of aligned species including the reference. The database references external files that contain binary compressed data.

The multi-species alignments are stored in external files in the MAF format; those files are indexed in the multizNway database table. The maf table refers to numeric IDs of the files; those numeric IDs and the actual file paths are stored in the extFile table.

Assembly and track meta-info

UCSC's names for databases and track tables are rather opaque by themselves, so we store information about them in separate tables. Assembly databases are described by the dbDb table in the central database, hgcentral. Each assembly database contains a trackDb table that describes all tracks in that database.


Installation Overview

  1. Download MySQL table data from UCSC.
  2. Create mysql database and load tables.
    • If adding the Conservation track, also download external data files.
  3. Download the GBrowse plugin and glyph modules
  4. Install and configure plugin(s).


Installation Details

Which UCSC database?

First, identify the UCSC database name that corresponds to the assembly version on which your GBrowse installation is built. For example, WormBase version WS170 corresponds to UCSC database ce4. This version correspondence is usually available on the assembly description page at http://genome.ucsc.edu/cgi-bin/hgGateway ; you might need to change the clade and genome in order to see your species of interest and its available assembly dates. If you still aren't sure about the correspondence, ask mailto:genome@soe.ucsc.edu for assistance.

Which tables for my track(s) of interest?

These are the database tables required for each type of track:

Track type MySQL table(s)
all tracks chromInfo, trackDb, hgcentral.dbDb
Chain chainOtherDb, chainOtherDbLink
or if split:
chr*_chainOtherDb, chr*_chainOtherDbLink
Net netOtherDb
Conservation multizNway, extFile, multizNwayFrames, multizNwaySummary, phastConsNway*


Download mysql dump files from UCSC

All UCSC Genome Browser data can be downloaded from hgdownload.cse.ucsc.edu. The HTTP, FTP and rsync protocols are supported.

For each database table $TABLE, there are two files: $TABLE.sql, which contains a CREATE statement for MySQL to create the table, and $TABLE.txt.gz, a gzip-compressed tab-separated text file with the contents of the table. In these examples of how to fetch the data using different protocols, $TMPDIR is a local temporary storage directory, $DB is the database, and $TABLE is the table. In practice, you will probably write a script that loops on multiple tables and possibly even multiple databases.

# Example 1: use rsync to fetch .sql and .txt.gz files:
mkdir -p $TMPDIR/$DB
rsync -avP \
  rsync://hgdownload.cse.ucsc.edu/genome/goldenPath/$DB/database/$TABLE.{sql,txt.gz} \
  $TMPDIR/$DB/
gunzip $TMPDIR/$DB/$TABLE.txt.gz
# Example 2: use wget to fetch files using ftp:// or http:// :
mkdir -p $TMPDIR/$DB
wget -N -O $TMPDIR/$DB/$TABLE.sql \
  ftp://hgdownload.cse.ucsc.edu/genome/goldenPath/$DB/database/$TABLE.sql
wget -N -O $TMPDIR/$DB/$TABLE.txt.gz \
  ftp://hgdownload.cse.ucsc.edu/genome/goldenPath/$DB/database/$TABLE.txt.gz
gunzip $TMPDIR/$DB/$TABLE.txt.gz

Also download the central database .sql, which creates and populates several assembly metadata tables.

wget -N -O $TMPDIR/hgcentral.sql \
  http://hgdownload.cse.ucsc.edu/admin/hgcentral.sql

If you encounter any problems while fetching the files, contact mailto:genome-mirror@soe.ucsc.edu . Please include which database and file(s), which version of downloading tool you were using, error messages if any, and any other relevant info.

Create and load the MySQL database

Having downloaded the necessary data files, create a local MySQL database hgcentral and a database for each assembly version for which you are downloading tracks. In order for the Conservation plugin to work, your MySQL genome assembly database must have the same name as the UCSC database ($DB above).

mysql -u root -pPassword -e "create database hgcentral;"
mysql -u root -pPassword hgcentral < $TMPDIR/hgcentral.sql
mysql -u root -pPassword -e "create database $DB;"

Here is how to create and populate a single table. In practice, you will most likely write a script that loops on $TABLE and possibly on $DB:

mysql $DB -u root -pPassword < $TABLE.sql
mysql $DB -u root -pPassword -e "load data local infile '$TMPDIR/$DB/$TABLE.txt' into table $TABLE"

Conservation track only: download external files

The Conservation track combines multiple alignments and per-base scores for multiple species. This is a very large amount of data, and in order to access and display it quickly, the actual data are kept in external files, and the database tables simply index those files. So, having installed the multizNway and phastConsNway database tables, you can use mysql to determine what files are needed from UCSC.

Determine the files needed for multizNway and phastConsNway by running by running these mysql commands:

select distinct(path) from extFile,multizNway where extFile.id = multizNway.extFile;
select distinct(file) from phastConsNway;

The paths all start with "/gbdb/", and can be downloaded from hgdownload.cse.ucsc.edu using the /gbdb/... path as follows, depending on your chosen protocol:

rsync://hgdownload.cse.ucsc.edu/gbdb/...
ftp://hgdownload.cse.ucsc.edu/gbdb/...
http://hgdownload.cse.ucsc.edu/gbdb/...

Ideally, create a local /gbdb/ and install the files in the same paths as referenced by the sql tables. If that is not possible, replace the pathnames in the sql tables with your local paths to the corresponding files.

Download and install the GBrowse plugin(s)

The plugin and glyph modules are available by anonymous Git:

git clone git://genome-source.cse.ucsc.edu/gmod-ucsc.git

The directory structure mirrors that of the main GBrowse repository. Copy or link all files from gmod-ucsc/Generic-Genome-Browser/ into the corresponding locations in your GBrowse installation's Generic-Genome-Browser tree, make and make install. NOTE: this includes Makefile.PL and MANIFEST... hand-merge if you have local changes!

Note: In order for the Conservation plugin to work, your GBrowse Makefile.PL needs to have been configured with DO_XS enabled (compilation of C extensions).

These are the Perl module files required for each plugin:

plugin Perl modules (glyph and plugin)
all Bio/Graphics/Glyph/ucsc_glyph.pm
Bio/Graphics/Browser/Plugin/UcscPlugin.pm
UcscChain Bio/Graphics/Glyph/ucsc_chain.pm
Generic_Genome_Browser/conf/plugins/UcscChain.pm
UcscNet Bio/Graphics/Glyph/ucsc_net.pm
Generic_Genome_Browser/conf/plugins/UcscNet.pm
UcscConservation Generic_Genome_Browser/conf/plugins/UcscWiggle.pm
Generic_Genome_Browser/conf/plugins/UcscConservation.pm

Conservation track only: C extension required

While UcscChain and UcscNet have pure-Perl implementations, UcscConservation requires a compiled C extension, Bio::Graphics::Browser::UcscTrackImage. The Perl, XS, C and H files are in Generic_Genome_Browser/libucsc/*. The top-level "make" and "make install" in Generic-Genome-Browser will descend into libucsc and build the extension if Makefile.PL and MANIFEST have been appropriately updated and if Makefile.PL was configured with DO_XS enabled.

Configure plugins in GBrowse

Add the plugin names (UcscChain, UcscNet, and/or UcscConservation to the plugins setting in the appropriate .conf file(s): for GBrowse 1.52, datasource.conf; for GBrowse 2.0, GBrowse.conf. (see also GBrowse_Install_HOWTO#Plugins)

Add plugin settings like the following to datasource.conf:

[UcscPlugin:plugin]
db = ce4
user = mysqlReadOnlyUser
pass = mysqlReadOnlyUsersPassword
seq_prefix = chr
split_prefix = chr
[UcscChain:plugin]
default_enable = chainCb3
[UcscNet:plugin]
default_enable = netCb3 netCaeRem2
[UcscConservation:plugin]
default_enable = multiz5way

These are the supported [UcscPlugin:plugin] settings, shared by all Ucsc plugins (but can be overridden in the section for each plugin if necessary):

setting required? description
db Yes The name of the MySQL database where UCSC tables have been loaded.
user Yes The name of a MySQL user that has permission to read db.
pass Yes The password for user.
seq_prefix probably If the datasource has sequence names like I, II, III or 2L, 2R, 3L, set this to chr so that they can be translated into UCSC's names: chrI, chrII, chrIII or chr2L, chr2R, chr3L etc. This is usually the case.
split_prefix probably If all of the sequences in the datasource are chromosome names (not separate scaffolds, contigs etc), set this to chr if seq_prefix is chr.
host prob. not The machine on which the MySQL server, if not the same machine on which GBrowse runs.
port prob. not The port number used by the MySQL server, if not the same as MySQL's default.
central_db prob. not The name of the MySQL database in which the dbDb table is stored, if not hgcentral.
central_user prob. not The MySQL user for central_db, if not the same as user. Must be able to read and write to central_db.
central_pass prob. not The password for central_user, if not the same as pass.
central_host prob. not The machine on which the MySQL server that serves up central_db, if not the same as host.
central_port prob. not The port number used by central_host, if not the same as port.

This setting is only for actual track plugins (e.g. [UcscChain:plugin]), not for the base class [UcscPlugin:plugin]:

setting required? description
default_enable should be A whitespace-separated list of chain or net tracks to display by default. E.g., chainOtherDb1 chainOtherDb2 in the UcscChain:plugin section, netOtherDb1 netOtherDb2 in the UcscNet:plugin section. If not specified, all tracks will be displayed by default which could overwhelm the display.

Conservation track only: conf/UcscTrackImage.cfg file

The Conservation plugin's C extension (Bio::Graphics::Browser::UcscTrackImage) requires a file named UcscTrackImage.cfg to exist in the GBrowse configuration directory, Generic-Genome-Browser/conf/. The contents of the file have a lot in common with the GBrowse .conf settings above:

db.host=localhost
db.user=mysqlReadOnlyUser
db.password=mysqlReadOnlyUsersPassword
db.trackDb=trackDb
central.db=hgcentral
central.host=localhost
central.user=mysqlHgcentralRWUser
central.password=mysqlHgcentralRWUser

Note that the central.user account, unlike the db.user, needs to have write access -- but only to hgcentral.

Test GBrowse

Reinstall GBrowse by running 'make install' in Generic_Genome_Browser/, watch for errors at the end of the GBrowse web server's error log file, and start using GBrowse. Here is an example command to watch the log file that might work if you are using Apache:

tail -f /usr/local/apache/logs/error_log

In GBrowse, below the image, in the Tracks section, Analysis subsection, set the checkboxes by the UCSC plugin track names and click Update Image. UCSC tracks should appear in the display. If not, there may be no data in the current region; try viewing a region that contains an exon or gene. Or the error log may contain a message that indicates what is missing.


Performance tweaks

rsync mysql table binary files

If you have permissions to modify MySQL's binary files, and are investing significant effort into developing an automated regular update of track data, you may want to try rsyncing the binary files directly from UCSC. That replaces the downloading of $TABLE.{sql,txt.gz} files and loading into MySQL.

rsync -navP rsync://hgdownload.cse.ucsc.edu/mysql/$DB/$TABLE.\* \
  /var/lib/mysql/$DB/


Bug Reports and Support Requests

Please send general questions and/or bug reports to mailto:genome@soe.ucsc.edu. If your question is specifically about fetching data from hgdownload.cse.ucsc.edu, send to mailto:genome_mirror@soe.ucsc.edu.


AUTHORS

Angie Hinrichs mailto:angiehinrichs@users.sourceforge.net