Databases and GMOD

From GMOD
Jump to: navigation, search

This introduces the broad topic of databases in GMOD. It introduces some database terminology for those that are new to databases. It also covers how databases are implemented and used in GMOD, and what database management system choices are available.


Database Terminology

What's a Database?

The term database is mentioned in all sorts of contexts in GMOD. It is even part of the project's name. Despite its central role in GMOD, the term database is often used to mean four different things. This section distinguishes its various meanings and introduces more precise terms that should be (but probably aren't) used throughout GMOD.

Database

A database is any organized set of data that is readable by a computer. This may or may not use a formal database schema and may or may not be in a database management system.

A database can be implemented in a relational database management system using a defined database schema. For example, the database behind the FlyBase web site contains data on drosopholids, and uses the Chado schema and the PostgreSQL database management system.

A database can also be in regular files with a well defined format.

Database Management System

Database management systems (DBMSs) are software systems that can manage data. Oracle, MySQL, PostgreSQL, and Sybase are all examples of DBMSs. DBMSs are containers of databases. That is, they are the systems that manage databases, which is distinct from the data that they manage.

Database Schema

A database schema is the design of a particular database, independent of its contents. Chado is an example of a database schema. Designs (like Chado) can be reused across multiple databases.

Database Web Site

Web sites that feature a lot of database driven content, such as FlyBase (http://flybase.org) or ParameciumDB (http://paramecium.cgm.cnrs-gif.fr), are often referred to as databases. This is somewhat accurate as there are databases backing the web sites, but it is also misleading. These websites also show information that doesn't come from their database and they also may not show everything in their databases.

Relational Database Terminology

The brief guide to databases defines some basic database concepts, and the Relational Databases section in Overview explains many terms that are useful in understanding relational databases.

GMOD Database Components

There are three main GMOD components that are fundamentally about databases, and several more that help you manage databases or that use (or can use) databases to accomplish their purpose.

GMOD's database related components are:

Chado

Chado is the modular database schema of GMOD. Chado is about organizing your data in a database so that you can manage it and can connect other GMOD components to it (either directly or via data exports). When someone speaks of the GMOD Schema they are speaking about Chado.

BioMart

BioMart is a data warehouse package tailored for biological data. It takes existing databases (for example, the FlyBase Chado database), transforms them into a data warehouse and then provides a web interface for supporting arbitrary queries against the data.

InterMine

InterMine also integrates multiple data sources into a single data warehouse. It has a core data model based on the sequence ontology and supports several biological data formats. It is easy to extend the data model and integrate your own data, Java and Perl APIs and an XML format to help import custom data. A web application allows creation of custom queries, includes template queries (web forms to run 'canned' queries) and can upload and operate on lists of data. Many aspects of the web app can be configured and branded.

Database Tools

  • Argos, a.k.a. Flybase-NG, a.k.a. biodb, is designed to provide automatic replication, installation and updates of genome and organism databases and information servers, including FlyBase and euGenes. It should be not too difficult to add other organism/genome services to this replication structure.
  • GMODTools is a Perl package that generates Fasta, GFF, DNA and other bulk genome annotation files from Chado databases.
  • Modware is an object-oriented Perl API for Chado. It allows object-oriented querying and loading of a Chado database and returns data structures that a programmer can readily use without knowing the details of how the object is stored in the relational schema.
  • XORT is a utility written in Perl that can be used to read to and write from a relational schema using XML as an interchange format. In GMOD XORT is frequently used to transfer data to and from Chado.

GMOD Components that Require a DBMS

GMOD DBMS Choices

Several GMOD Components rely on databases to store their data. All such components have a default DBMS that the developers had in mind when they created the component. The default DBMS is most often PostgreSQL or MySQL. PostgreSQL, commonly known as Postgres, and MySQL are both open-source DBMSs with large and active user communities. It is possible to use a DBMS other than the default but it does involve more work, sometimes a lot more work.

See the component descriptions to find out if they need an underlaying database and what their default DBMS is.

Can I Use Something Besides the Default DBMS?

The answer is yes, but it will mean extra work.

You may want to do this if you are already using a DBMS that you understand. DBMS administration is non-trivial and adding one or two more DBMSs to the list you have to support may or may not be more effort than porting the component to use your DBMS of choice. However, do keep in mind that one of the reasons why MySQL and Postgres are often picked as default DBMSs is that they are comparatively easy to administer.

DBMSs in Use in the GMOD Community

Postgres and MySQL are the most popular DBMSs, but several others are in use in the GMOD community.

PostgreSQL

Postgres (officially known as PostgreSQL) is the default DBMS for Chado, GMOD's modular database schema. Galaxy can also run using Postgres. Work is being done on Ergatis to also support PostgreSQL.

See the PostgreSQL page from more information on Postgres.

MySQL

MySQL adapters exist for GBrowse and it also used by CMap

Using MySQL with Chado is discouraged. See MySQL for more information on MySQL in GMOD.

SQLite

SQLite is a software library that implements a self-contained, serverless, zero-configuration, transactional SQL database engine. SQLite is the most widely deployed SQL database engine in the world. The source code for SQLite is in the public domain.

SQLite is used in a few places in GMOD. MAKER uses it for its data store. It is the default database when installing Galaxy, but most users then upgrade to a different DBMS, usually PostgreSQL. It can also be used with GBrowse.

DB2

Xenbase uses DB2 for their Chado installation. DB2 is a high-end database from IBM that has a free version and also a free academic licenses. DB2 is one of the big players in the commercial database market.

Oracle

Oracle is a supported DBMS for CMap and GBrowse.

ApiDB uses Oracle for its database needs, and dictyBase uses Oracle for Chado and GMODTools.

Work is being done on Ergatis to also support Oracle.

Oracle is a high-end database management system from Oracle Corporation. It is the most popular commercial database in the world.

Sybase

Due to its heritage at JCVI, the default database of the Ergatis workflow management tool was originally Sybase. Currently, it is usually used with MySQL and PostgreSQL by way of the Coati development framework.

See Also

Relational Databases in Overview.