JBrowse Configuration Guide

From GMOD
Revision as of 19:50, 3 August 2011 by Ian Holmes (Talk | contribs)

Jump to: navigation, search

Setting up JBrowse involves placing a copy of the jbrowse repository somewhere in the web-servable part of your server's filesystem, and then running several server-side scripts that use a data source (e.g. files on your computer) to produce additional JSON-format files in the data subdirectory. If these JBrowse-generated files are in a location where the webserver can present them to clients, then a user pointing their web browser at the appropriate URL will see the JBrowse interface, including sequence and feature tracks reflecting the data source.

There is a particular order that should be followed when adding data to JBrowse. Reference sequences should be added first, followed by feature tracks. Once all of the tracks have been added, it is possible to make the names of each feature searchable. While there is some flexibility in this order of events (it is possible to add additional reference sequences after feature tracks have been added, for example), the first step will always be to specify a sequence or set of sequences, and the last step will always be to make the named features searchable (assuming it is desired that all feature names are searchable).

User Interface

1. Location Marker: Click and drag to move to a different genomic position.
2. Hidden Tracks: Drag a track to this area to hide it.
3. Window Slider: Resize the viewing field.
4. Scroll Buttons: Click to scroll by a fixed amount at a given zoom level.
5. Viewing Field: Drag a track to this area to make it visible. Depending on the track, some zooming may be necessary.
6. Zoom Buttons: Click to zoom. Per click, the larger buttons zoom more than the smaller buttons.
7. Chromosome Selector: Choose which chromosome to view.
8. Search Bar: Browse to a certain region by searching for a location or feature name.

Reference Sequences

The reference sequence is a sequence that is representative of the feature data. It might be a consensus sequence from an alignment, or simply a sequence of interest. Before any feature tracks can be input to JBrowse, the reference sequence must be taken into consideration. This is handled by the prepare-refseqs.pl script.

prepare-refseqs.pl

This script is used to input sequence data into JBrowse, and must be run prior to the addition of feature tracks. The simplest way to use it is with the --fasta option, which uses a single sequence or set of reference sequences from a FASTA file:

bin/prepare-refseqs.pl --fasta <fasta file> [options]

If the file has multiple sequences, each sequence will become a reference sequence by default. You may switch between these sequences by selecting the sequence of interest via the pull-down menu to right of the large "zoom in" button.

You may use any alphabet you wish for your sequences (i.e., you are not restricted to the nucleotides A, T, C, and G; any alphanumeric character, as well as several other characters, may be used). Hence, it is possible to browse RNA and protein in addition to DNA. However, some characters should be avoided, because they will cause the sequence to "split" - part of the sequence will be cut off and and continue on the next line. These characters are the hyphen and question mark. Unfortunately, this prevents the use of hyphens to represent gaps in a reference sequence.

In addition to reading from a fasta file, prepare-refseqs.pl can read sequences from a gff file or a database. In order to read fasta sequences from a database, a config file must be used.

Syntax used to import sequences from gff files:

bin/prepare-refseqs.pl --gff <gff file with sequence information> [options]

Syntax used to import sequences with a config file:

bin/prepare-refseqs.pl --conf <config file that references a database with sequence information> --[refs|refid] <reference sequences> [options]
Option Value
fasta, gff, or conf Path to the file that JBrowse will use to import sequences. With the fasta and gff options, the sequence information is imported directly from the specified file. With the conf option, the specified config file includes the details necessary to access a database that contains the sequence information. Exactly one of these three options must be used.
out A path to the output directory (default is 'data' in the current directory)
seqdir The directory where the reference sequences are stored (default: <output directory>/seq)
noseq Causes no reference sequence track to be created. This is useful for reducing disk usage.
refs A comma-delimited list of the names of sequences to be imported as reference sequences. This option (or refid) is required when using the conf option. It is not required when the fasta or gff options are used, but it can be useful with these options, since it can be used to select which sequences JBrowse will import.
refids A comma-delimited list of the database identifiers of sequences to be imported as reference sequences. This option is useful when working with a Chado database that contains data from multiple different species, and those species have at least one chromosome with the same name (e.g. chrX). In this case, the desired chromosome cannot be uniquely identified by name, so it is instead identified by ID. This ID can be found in the 'feature_id' column of 'feature' table in a Chado database.

Feature Tracks

The feature tracks are the most important components of JBrowse. They can be used to visualize information about a sequence, such as sequence conservation, RNA base pairing, and the locations of transposons. There are a number of scripts that can be used to input various types of feature tracks into JBrowse:

flatfile-to-json.pl

This script inputs a single track into JBrowse. To put multiple tracks into JBrowse, it must be executed repeatedly.

Terminology: A flat file is a database that exists entirely in a single file. For this script, the flat file must be a GFF3, GFF2, or BED file.

Basic syntax:

bin/flatfile-to-json.pl --[gff|gff2|bed] <flat file> --tracklabel <track name> [options]

Hint: flatfile-to-json.pl simplifies the process of inputting a small number of tracks into JBrowse, since it does not use a config file. If you have many tracks, you will probably want to use a config file, because its structure will make the task of editing tracks easier. In that case, the appropriate script will be biodb-to-json.pl.

Summary of flatfile-to-json.pl options.
Option Value
gff, gff2, or bed The name of the file that contains the feature data. The names of these options correspond to the file types, with the exception of gff, which uses a GFF3 file instead of a GFF file. Exactly one of these three options must be used.
tracklabel The internal name that JBrowse will give to this feature track. This option requires a value.
key The external, human-readable label seen on the feature track when it is viewed in JBrowse. The value of key defaults to the value of tracklabel.
autocomplete Dictates what the features of the track will be searchable by after running generate-names.pl. This option can be used with the arguments "label", "alias", "all", or "none". By default, "none" is used.
  • label: Make the features searchable by the viewable name that they are associated with in JBrowse. In a gff3 file, this will be the "Name" in the attributes column.
  • alias: Make the features searchable by an alternate name defined in the input file. In a gff3 file, this will be the "Alias" in the attributes column.
  • all: Make the features searchable by both their labels and their aliases.
  • none: Make the features searchable by neither their labels nor their aliases.
out A path to the output directory (default is 'data' in the current directory).
cssClass The css class that will be used to create the feature track. This option makes it possible to choose how the feature track will look by selecting a template class from genome.css. The default css class is 'feature'.
getType Causes the 'type' to be included in the output JSON file. The type is the feature that has been predicted (e.g. promoter site, gene). If a gff file is being used, the type will be in column 3.
getPhase Causes the 'phase' to be included in the output JSON file. The phase describes the reading frame of a DNA (or messenger RNA) sequence. If the phase is relevant, it can have the values 0, 1, or 2; otherwise, the value associated with the phase is '.'. If a gff file is being used, the phase will be in column 8.
getSubs Causes subfeature data to be included in the output JSON file.
getLabel Causes the 'Name' attribute associated with each feature to be included the output JSON file. This will cause a textual name to appear below the features in the track. If a gff3 file is being used, the 'Name' attribute will be in column 9 when it is defined.
urlTemplate A url that your browser will visit when you click on a feature in this track. This is especially useful if you want to link a feature to a page with more information about that feature.
arrowheadClass When this option is used, directional features will be given an arrowhead. The presence and orientation of the arrowhead for each individual feature will depend on data in the input file. Arrowhead classes are defined in genome.css. There is only one that comes with JBrowse (transcript-arrowhead).
subfeatureClasses The css class(es) that will be used for the subfeatures of a feature track. This option makes it possible to choose how the subfeatures will appear. Any of the classes in genome.css can be used for the subfeatures. This option must be used with getSubs in order for subfeatures to appear.
clientConfig Any visual additions or edits for the main features of the track (not for subfeatures). These edits must be specified in JSON syntax.
type The type of feature that will appear in the feature track. This option is useful when the input file contains features of several different types, and you are interested in only having one type of feature (e.g. only having features that are genes) in the feature track. In gff3 files, the type is in the third column.
extraData Use additional information from the input file to create variations in the appearance or behavior of individual features. This option is meant to be used in conjunction with other options. For each feature in the track, a perl subroutine is used to extract additional information, which is then associated with a variable. The value of this variable can be different for each feature. When the name of this variable is surrounded by curly braces and used in the argument for a different option, such as urlTemplate, the feature-specific data is used.
nclChunk The NCList chunk size. This option should not be used unless an error such as "json or perl structure exceeds maximum nesting level" is encountered. If this error does occur, lower the chunk size (the default is 50000).

bam-to-json.pl

This script inputs a track into JBrowse using a BAM file. Tracks added with this script are similar in appearance to tracks added by flatfile-to-json.pl.

Special dependencies: SAMtools, Bio::DB::SAM

Basic syntax:

bin/bam-to-json.pl --bam <bam file> --tracklabel <track name> [options]
Option Value
bam The name of the bam file that contains the feature data. This option requires a value.
tracklabel The internal name that JBrowse will give to this feature track. This option requires a value.
key The external, human-readable label seen on the feature track when it is viewed in JBrowse. The value of key defaults to the value of tracklabel.
out A path to the output directory (default is 'data' in the current directory).
cssClass The css class that will be used to create the feature track. This option makes it possible to choose how the feature track will look by selecting a template class from genome.css. The default css class is 'feature'.
clientConfig Any visual additions or edits for the main features of the track (not for subfeatures). These edits must be specified in JSON syntax.
nclChunk The NCList chunk size in bytes. This option should not be used unless an error such as "json or perl structure exceeds maximum nesting level" is encountered. If this error does occur, lower the chunk size (the default is 50000 bytes).
compress This option causes the output JSON files for the track (trackData.json and hist-*.json) to be compressed with gzip.

biodb-to-json.pl

This script uses a config file to produce a set of feature tracks in JBrowse. It can be used to obtain information from any database with appropriate schema, or from flat files. Because it can produce several feature tracks in a single execution, it is useful for large-scale feature data entry into JBrowse.

Basic syntax:

bin/biodb-to-json.pl --conf <config file> [options]
Option Value
conf The name of the JSON configuration file that will be used. This option must be specified.
out A path to the output directory (default is 'data' in the current directory).
track The identifier of a single track that will be updated or added to JBrowse. In the list of key-value pairs comprising an individual track definition in the config file, the identifier will be the value associated with "track".
ref A comma-delimited list of reference sequence names, used to limit database queries to a subset of JBrowse reference sequences. By default, the database is queried for all reference sequences in JBrowse.
refid A comma-delimited list of reference sequence IDs from a Chado database, used to limit database queries to a subset of JBrowse reference sequences. By default, the database is queried for all reference sequences in JBrowse.
compress This option causes the output JSON files for the track (trackData.json and hist-*.json) to be compressed with gzip.

ucsc-to-json.pl

This script uses data from UCSC genome annotation database. To reach this data, go to hgdownload.cse.ucsc.edu and click the link for the genome of interest. Next, click the "Annotation Database" link. The data relevant to ucsc-to-json.pl (*.sql and *.txt.gz files) can be downloaded from either this page or the FTP server described on this page.

Together, a *.sql and *.txt.gz pair of files (such as cytoBandIdeo.txt.gz and cytoBandIdeo.sql) constitute a database table. Ucsc-to-json.pl uses the *.sql file to get the column labels, and it uses the *.txt.gz file to get the data for each row of the table. For the example pair of files above, the name of the database table is "cytoBandIdeo". This will become the name of the JBrowse track that is produced from the data in the table.

In addition to all of the feature-containing tables that you want to use as JBrowse tracks, you will also need to download the trackDb.sql and trackDb.txt.gz files for the organism of interest.

Basic syntax:

bin/ucsc-to-json.pl --in <directory with files from UCSC> --track <database table name> [options]

Hint: If you're using this approach, it might be convenient to also download the sequence(s) from UCSC. These are usually available from the "Data set by chromosome" link for the particular genome or from the FTP server.

Option Value
in A directory containing all of the *.sql and *.txt.gz data from UCSC. This directory must contain the trackDb.sql and trackDb.txt.gz files for the organism of interest, as well as all of the feature-containing tables that you wish to use as JBrowse tracks.
track The name of the database table. If you leave off the .sql or .txt.gz extensions of the table files you wish to use, you will have this value.
out A path to the output directory (default is 'data' in the current directory).
cssClass The css class that will be used to create the feature track. This option makes it possible to choose how the feature track will look by selecting a template class from genome.css. The default css class is 'feature'.
arrowheadClass When this option is used, directional features will be given an arrowhead. The presence and orientation of the arrowhead for each individual feature will depend on data in the input file. Arrowhead classes are defined in genome.css. There is only one that comes with JBrowse (transcript-arrowhead).
subfeatureClasses The css class(es) that will be used for the subfeatures of a feature track. This option makes it possible to choose how the subfeatures will appear. Any of the classes in genome.css can be used for the subfeatures.
clientConfig Any visual additions or edits for the main features of the track (not for subfeatures). These edits must be specified in JSON syntax.
nclChunk The NCList chunk size in bytes. This option should not be used unless an error such as "json or perl structure exceeds maximum nesting level" is encountered. If this error does occur, lower the chunk size (the default is 50000 bytes).
compress This option causes some of the output JSON files (trackData.json and hist-*.json) to be compressed with gzip.
sortMem The maximum amount of RAM (in bytes) to use for sorting the features. The default value is 536870912 bytes (512MiB).

draw-basepair-track.pl

This script inputs a single base pairing track into JBrowse. A base pairing track is a distinctive track type that represents base pairing between nucleotides as arcs.

Basic syntax:

bin/draw-basepair-track.pl --gff <gff file> --tracklabel <track name> [options]
Summary of draw-basepair-track.pl options.
Option Value
gff The name of the gff file that will be used. This option must be specified.
tracklabel The internal name that JBrowse will give to this feature track. This option requires a value.
key The external, human-readable label seen on the feature track when it is viewed in JBrowse. The value of key defaults to the value of tracklabel.
out A path to the output directory (default is 'data' in the current directory).
tile The directory where the tiles, or images corresponding to each zoom level of the track, are stored. Defaults to data/tiles.
bgcolor The color of the track background. Specified as "RED,GREEN,BLUE" in base ten numbers between 0 and 255. Defaults to "255,255,255".
fgcolor The color of the track foreground (i.e. the base pairing arcs). Specified as "RED,GREEN,BLUE" in base ten numbers between 0 and 255. Defaults to "0,255,0".
width The width in pixels of each tile. The default value is 2000.
height The height in pixels of each tile. Changing this parameter will cause a corresponding change in the top-to-bottom height of the track in JBrowse. The default value is 100.
thickness The thickness of the base pairing arcs in the track. The default value is 2.
nolinks Disables use of file system links to compress duplicate image files.

wig-to-json.pl

Using a WIG file, this script inputs a single wiggle track into JBrowse. In a wiggle track, a numeric value is associated with each nucleotide position in the reference sequence. This is represented in JBrowse as a track that looks like a histogram, where the horizontal axis is for each nucleotide position, and the vertical axis is for the number associated with that position. The vertical axis currently does not have a scale; rather, the heights for each position are relative to each other.

Special dependencies: libpng

In order to use wig-to-json.pl, the code for wig2png must be compiled. This can be done with the following command:

make

Note: If you are using Mac OS X, it might be necessary to execute 'make' in the following way:

make GCC_LIB_ARGS=-L/usr/X11/lib GCC_INC_ARGS=-I/usr/X11/include

Basic syntax:

bin/wig-to-json.pl --wig <wig file> --tracklabel <track name> [options]

Hint: If you are using this type of track to plot a measure of a prediction's quality, where the range of possible quality scores is from some lowerbound to some upperbound (for instance, between 0 and 1), you can specify these bounds with the max and min options.

Summary of wig-to-json.pl options.
Option Value
wig The name of the wig file that will be used. This option must be specified.
tracklabel The internal name that JBrowse will give to this feature track. This option requires a value.
key The external, human-readable label seen on the feature track when it is viewed in JBrowse. The value of key defaults to the value of tracklabel.
out A path to the output directory (default is 'data' in the current directory).
tile The directory where the tiles, or images corresponding to each zoom level of the track, are stored. Defaults to data/tiles.
bgcolor The color of the track background. Specified as "RED,GREEN,BLUE" in base ten numbers between 0 and 255. Defaults to "255,255,255".
fgcolor The color of the track foreground (i.e. the vertical bars of the wiggle track). Specified as "RED,GREEN,BLUE" in base ten numbers between 0 and 255. Defaults to "105,155,111".
width The width in pixels of each tile. The default value is 2000.
height The height in pixels of each tile. Changing this parameter will cause a corresponding change in the top-to-bottom height of the track in JBrowse. The default value is 100.
min The lowerbound to use for the track. By default, this is the lowest value in the wiggle file.
max The upperbound to use for the track. By default, this will be the highest value in the wiggle file.

Naming

generate-names.pl

This script makes it possible to search for features by label (the visible name below a feature in JBrowse) and/or by alias (a secondary name that is not visible in the web browser, but may be present in the JSON used by the JBrowse client). For tracks that are added using flatfile-to-json.pl or biodb-to-json.pl, searchability depends on how the 'autocomplete' option is used. If a track is input with the autocomplete option set to 'alias', for instance, features will be searchable by alias after generate-names.pl is run (provided that alias names are present in the original data source). For tracks added using ucsc-to-json.pl, features will be searchable by label after running generate-names.pl.

To search for a term, use the text box at the top of the JBrowse window.

Basic syntax:

bin/generate-names.pl [options]

Note that generate-names.pl does not require any arguments. However, some options are available:

Option Value
dir A path to the output directory (default is 'data/names' in the current directory).
thresh A lower-bound on the Patricia trie chunk size. Specifically, the lowest possible chunk size is (thresh + 1). The default value is 200. In this context, a chunk is a group of connected Patricia trie nodes that can be visualized as a single entity, and the chunk size is the total number of genomic features contained in a chunk. The lower the value of thresh, the more chunks there will be.
verbose This setting causes information about the division of nodes into chunks to be printed to the screen.

Removing Tracks

While JBrowse does not support a script that removes individual tracks, there are a number of possible options that can be taken to change or remove a track:

1. Overwrite the unwanted track with a new track. This is useful when a mistake was made in preparing a track, and you are interested in removing the track only so that you can replace it with a correct track that has the same tracklabel (the 'tracklabel' is a track's internal name). This is done by writing the new information with the same value associated with the tracklabel option.

2. Remove the entire data directory. This is useful when you want to completely remove a track or set of tracks, rather than replacing them with different tracks. This is perhaps the fastest way to remove a track, but it has the obvious pitfall that you might also be removing tracks that you wanted to keep. If you don't have very many feature tracks, or if biodb-to-json.pl is being used to generate most of the feature tracks, (in which case most of the tracks can be recovered with a single execution of biodb-to-json.pl), this option will be fine.

3. Remove the information about the specific tracks from the data directory. This allows you to remove a track without removing every track, combining the advantages of the previous two methods for removing a set of tracks. The disadvantage is that you must manually remove an entry from a file that is interpreted by JBrowse. The important part to remove will be in trackInfo.js if you want to remove a feature track or refSeqs.js if you want to remove a sequence track.

See also

External Links