NOTE: We are working on migrating this site away from MediaWiki, so editing pages will be disabled for now.

Difference between revisions of "Databases and GMOD"

From GMOD
Jump to: navigation, search
m (Relational Database Terminology)
m (SQLite: removing unneeded quotebox)
 
(29 intermediate revisions by 7 users not shown)
Line 1: Line 1:
This introduces the broad topic of databases in GMOD.  It introduces some [[#What Is a Database?|database terminology]] for those that are new to databases.  It also covers how databases are implemented and used in GMOD, and what database management system choices are available.
+
This introduces the broad topic of databases in GMOD.  It introduces some [[#Database Terminology|database terminology]] for those that are new to databases.  It also covers how databases are implemented and used in GMOD, and what database management system choices are available.
 +
 
  
 
  
 
== Database Terminology ==
 
== Database Terminology ==
 
  
 
=== What's a Database? ===
 
=== What's a Database? ===
  
The term ''database'' is mentioned in all sorts of contexts in GMOD.  It is even part of the project's name.  Despite its central role in GMOD, the term ''database'' is often used to mean 4 different things.  This section distinguishes its various meanings and introduces more precise terms that should be (but probably aren't) used throughout GMOD.
+
The term ''database'' is mentioned in all sorts of contexts in GMOD.  It is even part of the project's name.  Despite its central role in GMOD, the term ''database'' is often used to mean four different things.  This section distinguishes its various meanings and introduces more precise terms that should be (but probably aren't) used throughout GMOD.
  
 +
==== Database ====
 +
 +
A ''database'' is any organized set of data that is readable by a computer.    This may or may not use a formal ''database schema'' and may or may not be in a ''database management system''.
 +
 +
A database can be implemented in a ''[[Glossary#Relational Database Management System|relational database management system]]'' using a defined ''[[Glossary#Database Schema|database schema]]''.  For example, the database behind the [http://flybase.org FlyBase web site] contains data on drosopholids, and uses the [[Chado]] schema and the [[PostgreSQL]] database management system.
 +
 +
A database can also be in regular files with a well defined format.
  
 
==== Database Management System ====
 
==== Database Management System ====
  
 
''Database management systems'' (DBMSs) are software systems that can manage data.  Oracle, MySQL, PostgreSQL, and Sybase are all examples of DBMSs.  DBMSs are ''containers of databases''.  That is, they are the systems that manage databases, which is distinct from the data that they manage.
 
''Database management systems'' (DBMSs) are software systems that can manage data.  Oracle, MySQL, PostgreSQL, and Sybase are all examples of DBMSs.  DBMSs are ''containers of databases''.  That is, they are the systems that manage databases, which is distinct from the data that they manage.
 
  
 
==== Database Schema ====
 
==== Database Schema ====
  
 
A ''database schema'' is the design of a particular database, independent of its contents.  Chado is an example of a database schema.  Designs (like Chado) can be reused across multiple databases.
 
A ''database schema'' is the design of a particular database, independent of its contents.  Chado is an example of a database schema.  Designs (like Chado) can be reused across multiple databases.
 
  
 
==== Database Web Site ====
 
==== Database Web Site ====
  
Web sites that feature a lot of database driven content, such as [http://flybase.org FlyBase] or [http://parameciumdb.org ParameciumDB], are often refered to as databases.  This is somewhat accurate as there are databases backing the web sites, but it is also misleading.  These websites show information that doesn't come from their database and they also may not show everything in their databases.
+
Web sites that feature a lot of database driven content, such as [[:Category:FlyBase|FlyBase]] (http://flybase.org) or [[ParameciumDB]] (http://paramecium.cgm.cnrs-gif.fr), are often referred to as databases.  This is somewhat accurate as there are databases backing the web sites, but it is also misleading.  These websites also show information that doesn't come from their database and they also may not show everything in their databases.
 
+
 
+
==== Database ====
+
 
+
A ''database'' is any organized set of data that is readable by a computer.    This may or may not use a formal ''database schema'' and may or may not be in a ''database management system''.
+
 
+
A database can be implemented in a ''[[Glossary#Relational Database Management System|relational database management system]] using a defined ''[[Glossary#Database Schema|database schema]]''.  For example, the database behind the [http://flybase.org FlyBase web site] contains data on drosopholids, and uses the [[Chado]] schema and the [[PostgreSQL]] database management system.
+
 
+
A database can also be in regular files with a well defined format.
+
 
+
  
 
=== Relational Database Terminology ===
 
=== Relational Database Terminology ===
  
The [[GMOD for the Biologist#Relational Databases|Relational Databases]] section in [[GMOD for the Biologist]] explains many terms that are useful in understanding relational databases.
+
The [[A_Brief_Guide_to_Databases|brief guide to databases]] defines some basic database concepts, and the [[Overview#Relational_Databases|Relational Databases]] section in [[Overview]] explains many terms that are useful in understanding relational databases.
  
 
== GMOD Database Components ==
 
== GMOD Database Components ==
  
There are two main GMOD components that are fundamentally about databases, and several more that help you manage databases or that use (or can use) databases to accomplish their purpose.
+
There are three main GMOD components that are fundamentally about databases, and several more that help you manage databases or that use (or can use) databases to accomplish their purpose.
 
+
In decreasing order of database-centricity, GMOD's database related components are:
+
  
 +
GMOD's database related components are:
  
 
=== Chado ===
 
=== Chado ===
  
 
[[Chado]] is the modular [[#Database Schema|database schema]] of GMOD.  Chado is about organizing your data in a database so that you can manage it and can connect other GMOD components to it (either directly or via data exports).  When someone speaks of the GMOD Schema they are speaking about Chado.
 
[[Chado]] is the modular [[#Database Schema|database schema]] of GMOD.  Chado is about organizing your data in a database so that you can manage it and can connect other GMOD components to it (either directly or via data exports).  When someone speaks of the GMOD Schema they are speaking about Chado.
 
  
 
=== BioMart ===
 
=== BioMart ===
  
[[BioMart]] is a data warehouse package tailored for biological data.  It takes existing databases (for example, it take the FlyBase Chaod database), transforms them into a data warehouse and then provides a web interface for supporting arbitrary queries against the data.
+
[[BioMart]] is a data warehouse package tailored for biological data.  It takes existing databases (for example, the [[:Category:FlyBase|FlyBase]] [[Chado]] database), transforms them into a data warehouse and then provides a web interface for supporting arbitrary queries against the data.
  
 +
=== InterMine ===
 +
 +
[[InterMine]] also integrates multiple data sources into a single data warehouse. It has a core data model based on the sequence ontology and supports several biological data formats. It is easy to extend the data model and integrate your own data, Java and Perl APIs and an XML format to help import custom data.  A web application allows creation of custom queries, includes template queries (web forms to run 'canned' queries) and can upload and operate on lists of data. Many aspects of the web app can be configured and branded.
  
 
=== Database Tools ===
 
=== Database Tools ===
  
Links to DB tools.
+
* [[Argos]], a.k.a. Flybase-NG, a.k.a. biodb, is designed to provide automatic replication, installation and updates of genome and organism databases and information servers, including FlyBase and euGenes. It should be not too difficult to add other organism/genome services to this replication structure.
 
+
* [[GMODTools]] is a Perl package that generates Fasta, [[GFF]], DNA and other bulk genome annotation files from [[Chado]] databases.
 
+
* [[Modware]] is an object-oriented Perl API for [[Chado]]. It allows object-oriented querying and loading of a Chado database and returns data structures that a programmer can readily use without knowing the details of how the object is stored in the relational schema.
 +
* [[XORT]] is a utility written in Perl that can be used to read to and write from a [[Glossary#Schema|relational schema]] using [[Glossary#XML|XML]] as an interchange format. In GMOD XORT is frequently used to transfer data to and from [[Chado]].
  
 
=== GMOD Components that Require a DBMS ===
 
=== GMOD Components that Require a DBMS ===
  
 
* [[CMap]]
 
* [[CMap]]
* [[Ergatis]]
+
* [[GBrowse_syn]]
  
 
== GMOD DBMS Choices ==
 
== GMOD DBMS Choices ==
Line 72: Line 68:
  
 
See the component descriptions to find out if they need an underlaying database and what their default DBMS is.
 
See the component descriptions to find out if they need an underlaying database and what their default DBMS is.
 
  
 
=== Can I Use Something Besides the Default DBMS? ===
 
=== Can I Use Something Besides the Default DBMS? ===
  
The answer is yes, but it will mean extra work.
+
The answer is yes, but it will mean extra work.
  
 
You may want to do this if you are already using a DBMS that you understand.  DBMS administration is non-trivial and adding one or two more DBMSs to the list you have to support may or may not be more effort than porting the component to use your DBMS of choice.  However, do keep in mind that one of the reasons why MySQL and Postgres are often picked as default DBMSs is that they are comparatively easy to administer.
 
You may want to do this if you are already using a DBMS that you understand.  DBMS administration is non-trivial and adding one or two more DBMSs to the list you have to support may or may not be more effort than porting the component to use your DBMS of choice.  However, do keep in mind that one of the reasons why MySQL and Postgres are often picked as default DBMSs is that they are comparatively easy to administer.
Line 83: Line 78:
  
 
Postgres and MySQL are the most popular DBMSs, but several others are in use in the GMOD community.
 
Postgres and MySQL are the most popular DBMSs, but several others are in use in the GMOD community.
 
 
  
 
==== PostgreSQL ====
 
==== PostgreSQL ====
  
Postgres is the default DBMS for [[Chado]], GMOD's modular database schema.
+
[[PostgreSQL|Postgres]] (officially known as ''PostgreSQL'') is the default DBMS for [[Chado]], GMOD's modular database schema.  [[Galaxy]] can also run using Postgres.  Work is being done on [[Ergatis]] to also support PostgreSQL.
  
 
See the [[PostgreSQL]] page from more information on Postgres.
 
See the [[PostgreSQL]] page from more information on Postgres.
Line 94: Line 87:
 
==== MySQL ====
 
==== MySQL ====
  
MySQL adapters exist for [[GBrowse]] anot '''other components?'''.
+
MySQL adapters exist for [[GBrowse]] and it also used by [[CMap]]
  
See [[MySQL]] for more information on MySQL in GMOD.
+
Using MySQL with [[Chado]] is discouraged.  See [[MySQL]] for more information on MySQL in GMOD.
 +
 
 +
==== SQLite ====
 +
 
 +
[http://sqlite.org/ SQLite] is a software library that implements a self-contained, serverless, zero-configuration, transactional SQL database engine. SQLite is the most widely deployed SQL database engine in the world. The source code for SQLite is in the public domain.
 +
 
 +
SQLite is used in a few places in GMOD.  [[MAKER]] uses it for its data store.  It is the default database when installing [[Galaxy]], but most users then upgrade to a different {{GlossaryLink|DBMS|DBMS}}, usually [[PostgreSQL]].  It can also be used with [[GBrowse]].
  
 
==== DB2 ====
 
==== DB2 ====
Line 104: Line 103:
 
==== Oracle ====
 
==== Oracle ====
  
[http://apidb.org ApiDB] uses [http://http://www.oracle.com/database/ Oracle] for its database needs.  Oracle is a high-end database from [http://oracle.com Oracle Corporation].  It is the most popular commercial database in the world.
+
Oracle is a supported DBMS for [[CMap]] and [[GBrowse]].
  
 +
[http://apidb.org ApiDB] uses [http://http://www.oracle.com/database/ Oracle] for its database needs, and [http://www.dictybase.org dictyBase] uses [[:Image:Just-050516.ppt|Oracle for Chado and GMODTools]].
  
 +
Work is being done on [[Ergatis]] to also support Oracle.
 +
 +
Oracle is a high-end database management system from [http://oracle.com Oracle Corporation].  It is the most popular commercial database in the world.
  
 
==== Sybase ====
 
==== Sybase ====
  
Due to its heritage at JCVI, the default database of the [[Ergatis]] workflow management component is Sybase.  Work is being done to also support the [[PostgreSQL]] and [[#Oracle|Oracle]] DBMSs.
+
Due to its heritage at JCVI, the default database of the [[Ergatis]] workflow management tool was originally Sybase.  Currently, it is usually used with MySQL and PostgreSQL by way of the [https://sourceforge.net/projects/coati-api/ Coati] development framework.
 
+
 
+
  
 
== See Also ==
 
== See Also ==
  
[[GMOD for the Biologist#Relational Databases|Relational Databases]] in [[GMOD for the Biologist]].
+
[[Overview#Relational Databases|Relational Databases]] in [[Overview]].
 +
 
 +
[[Category:Database Tools]]
 +
[[Category:PostgreSQL]]
 +
[[Category:MySQL]]

Latest revision as of 22:31, 25 March 2014

This introduces the broad topic of databases in GMOD. It introduces some database terminology for those that are new to databases. It also covers how databases are implemented and used in GMOD, and what database management system choices are available.


Database Terminology

What's a Database?

The term database is mentioned in all sorts of contexts in GMOD. It is even part of the project's name. Despite its central role in GMOD, the term database is often used to mean four different things. This section distinguishes its various meanings and introduces more precise terms that should be (but probably aren't) used throughout GMOD.

Database

A database is any organized set of data that is readable by a computer. This may or may not use a formal database schema and may or may not be in a database management system.

A database can be implemented in a relational database management system using a defined database schema. For example, the database behind the FlyBase web site contains data on drosopholids, and uses the Chado schema and the PostgreSQL database management system.

A database can also be in regular files with a well defined format.

Database Management System

Database management systems (DBMSs) are software systems that can manage data. Oracle, MySQL, PostgreSQL, and Sybase are all examples of DBMSs. DBMSs are containers of databases. That is, they are the systems that manage databases, which is distinct from the data that they manage.

Database Schema

A database schema is the design of a particular database, independent of its contents. Chado is an example of a database schema. Designs (like Chado) can be reused across multiple databases.

Database Web Site

Web sites that feature a lot of database driven content, such as FlyBase (http://flybase.org) or ParameciumDB (http://paramecium.cgm.cnrs-gif.fr), are often referred to as databases. This is somewhat accurate as there are databases backing the web sites, but it is also misleading. These websites also show information that doesn't come from their database and they also may not show everything in their databases.

Relational Database Terminology

The brief guide to databases defines some basic database concepts, and the Relational Databases section in Overview explains many terms that are useful in understanding relational databases.

GMOD Database Components

There are three main GMOD components that are fundamentally about databases, and several more that help you manage databases or that use (or can use) databases to accomplish their purpose.

GMOD's database related components are:

Chado

Chado is the modular database schema of GMOD. Chado is about organizing your data in a database so that you can manage it and can connect other GMOD components to it (either directly or via data exports). When someone speaks of the GMOD Schema they are speaking about Chado.

BioMart

BioMart is a data warehouse package tailored for biological data. It takes existing databases (for example, the FlyBase Chado database), transforms them into a data warehouse and then provides a web interface for supporting arbitrary queries against the data.

InterMine

InterMine also integrates multiple data sources into a single data warehouse. It has a core data model based on the sequence ontology and supports several biological data formats. It is easy to extend the data model and integrate your own data, Java and Perl APIs and an XML format to help import custom data. A web application allows creation of custom queries, includes template queries (web forms to run 'canned' queries) and can upload and operate on lists of data. Many aspects of the web app can be configured and branded.

Database Tools

  • Argos, a.k.a. Flybase-NG, a.k.a. biodb, is designed to provide automatic replication, installation and updates of genome and organism databases and information servers, including FlyBase and euGenes. It should be not too difficult to add other organism/genome services to this replication structure.
  • GMODTools is a Perl package that generates Fasta, GFF, DNA and other bulk genome annotation files from Chado databases.
  • Modware is an object-oriented Perl API for Chado. It allows object-oriented querying and loading of a Chado database and returns data structures that a programmer can readily use without knowing the details of how the object is stored in the relational schema.
  • XORT is a utility written in Perl that can be used to read to and write from a relational schema using XML as an interchange format. In GMOD XORT is frequently used to transfer data to and from Chado.

GMOD Components that Require a DBMS

GMOD DBMS Choices

Several GMOD Components rely on databases to store their data. All such components have a default DBMS that the developers had in mind when they created the component. The default DBMS is most often PostgreSQL or MySQL. PostgreSQL, commonly known as Postgres, and MySQL are both open-source DBMSs with large and active user communities. It is possible to use a DBMS other than the default but it does involve more work, sometimes a lot more work.

See the component descriptions to find out if they need an underlaying database and what their default DBMS is.

Can I Use Something Besides the Default DBMS?

The answer is yes, but it will mean extra work.

You may want to do this if you are already using a DBMS that you understand. DBMS administration is non-trivial and adding one or two more DBMSs to the list you have to support may or may not be more effort than porting the component to use your DBMS of choice. However, do keep in mind that one of the reasons why MySQL and Postgres are often picked as default DBMSs is that they are comparatively easy to administer.

DBMSs in Use in the GMOD Community

Postgres and MySQL are the most popular DBMSs, but several others are in use in the GMOD community.

PostgreSQL

Postgres (officially known as PostgreSQL) is the default DBMS for Chado, GMOD's modular database schema. Galaxy can also run using Postgres. Work is being done on Ergatis to also support PostgreSQL.

See the PostgreSQL page from more information on Postgres.

MySQL

MySQL adapters exist for GBrowse and it also used by CMap

Using MySQL with Chado is discouraged. See MySQL for more information on MySQL in GMOD.

SQLite

SQLite is a software library that implements a self-contained, serverless, zero-configuration, transactional SQL database engine. SQLite is the most widely deployed SQL database engine in the world. The source code for SQLite is in the public domain.

SQLite is used in a few places in GMOD. MAKER uses it for its data store. It is the default database when installing Galaxy, but most users then upgrade to a different DBMS, usually PostgreSQL. It can also be used with GBrowse.

DB2

Xenbase uses DB2 for their Chado installation. DB2 is a high-end database from IBM that has a free version and also a free academic licenses. DB2 is one of the big players in the commercial database market.

Oracle

Oracle is a supported DBMS for CMap and GBrowse.

ApiDB uses Oracle for its database needs, and dictyBase uses Oracle for Chado and GMODTools.

Work is being done on Ergatis to also support Oracle.

Oracle is a high-end database management system from Oracle Corporation. It is the most popular commercial database in the world.

Sybase

Due to its heritage at JCVI, the default database of the Ergatis workflow management tool was originally Sybase. Currently, it is usually used with MySQL and PostgreSQL by way of the Coati development framework.

See Also

Relational Databases in Overview.