Difference between revisions of "Chado Manual"

From GMOD
Jump to: navigation, search
m (Modules)
m
Line 1: Line 1:
 
==Introduction==
 
==Introduction==
  
===A Feature is a Sequence===
+
Introduction
  
Chado does not distinguish between a sequence and a sequence feature, on the theory that a feature is a piece of a sequence, and a piece of a sequence is a sequence. Both are represented as a row in the ''feature'' table.
+
Modularity
  
====Feature types====
+
The Chado schema has been designed with modularity and compartmentalization of function in mind. Groups of tables concerned with a single knowledge domain are called ’’modules ’’. There is a core module, ’’general’’, concerned with data underlying all other classes, these tables store information about databases, databases identifiers, and general information about Chado tables. Equal in importance in Chado is ’’cv’’, the module concerned with ’’’c ’’’ontrolled ’’’v ’’’ocabularies or ontologies. 
 +
 
 +
All other sets of tables, or ’’modules ’’, link to these ’’general ’’ and ’’cv ’’ tables but are limited in scope to specific biological domains. For example, the ’’sequence ’’ module is concerned with protein and nucleotide sequence, the ’’pub ’’ module is concerned with articles and publications, and so on.  In addition to these limitations in scope we see an effective absence of redundancy. For example, there is a module called ’’companalysis ’’, short for ’’computational analysis”. Its tables are responsible for describing algorithms and the output of algorithms. The ’’rad ’’ module (for microarrays) uses  ’’companalysis ’’ in order to refer to algorithms in addition. The uniqueness, and generality, of the modules implies that one can rely on pre-existing modules for function if one is interested in introducing new modules. The acceptance of ontologies as general standards, and Chado’s use of these ontologies, also make Chado a good platform for annotation of biological data.
 +
 
 +
 
 +
Extensibility
 +
 
 +
Chado should be considered a highly extensible database due to its modular design. The clear segregation of function into modules, or sets of tables, should allow the introduction of new modules
 +
 
 +
 
 +
Ontologies
 +
 
 +
One of the more profound, recent changes in the nature of biology has to do with the adoption of ontologies, or controlled vocabularies, as a way to describe and organize data. Our most popular ontologies have arisen from the need to describe the remarkable variety of living things, and are very detailed and broad. Simultaneously these ontologies have served to categorize and classify the contents of entire databases that had been previously been atomized, or only partially coherent. Chado has been built from the outset to integrate with these ontologies, and this feature makes it extremely expressive.
 +
 
 +
 
 +
Associated Software
 +
 
 +
Chado is considered to be one of the key components in the GMOD suite.
 +
 
 +
Complexity and Detail
 +
 
 +
Part of the impetus for the creation of Chado was the need for a database that could describe ’’’all ’’’ the detail that would be found by extensive research done on a model organism.
 +
 
 +
 
 +
 
 +
Support
 +
 
 +
The community using Chado, and GMOD, is extensive and growing.
 +
 
 +
 
 +
--comparison table--
  
Feature types are taken from the  [http://www.sequenceontology.org/ Sequence Ontology] controlled vocabulary (see also [[Chado_CV_Module|Controlled Vocabulary module]]). A selection of Chado-relevant types from SO are shown below:
 
  
 
===Modules===
 
===Modules===

Revision as of 15:01, 23 February 2007

Introduction

Introduction

Modularity

The Chado schema has been designed with modularity and compartmentalization of function in mind. Groups of tables concerned with a single knowledge domain are called ’’modules ’’. There is a core module, ’’general’’, concerned with data underlying all other classes, these tables store information about databases, databases identifiers, and general information about Chado tables. Equal in importance in Chado is ’’cv’’, the module concerned with ’’’c ’’’ontrolled ’’’v ’’’ocabularies or ontologies.

All other sets of tables, or ’’modules ’’, link to these ’’general ’’ and ’’cv ’’ tables but are limited in scope to specific biological domains. For example, the ’’sequence ’’ module is concerned with protein and nucleotide sequence, the ’’pub ’’ module is concerned with articles and publications, and so on. In addition to these limitations in scope we see an effective absence of redundancy. For example, there is a module called ’’companalysis ’’, short for ’’computational analysis”. Its tables are responsible for describing algorithms and the output of algorithms. The ’’rad ’’ module (for microarrays) uses ’’companalysis ’’ in order to refer to algorithms in addition. The uniqueness, and generality, of the modules implies that one can rely on pre-existing modules for function if one is interested in introducing new modules. The acceptance of ontologies as general standards, and Chado’s use of these ontologies, also make Chado a good platform for annotation of biological data.


Extensibility

Chado should be considered a highly extensible database due to its modular design. The clear segregation of function into modules, or sets of tables, should allow the introduction of new modules


Ontologies

One of the more profound, recent changes in the nature of biology has to do with the adoption of ontologies, or controlled vocabularies, as a way to describe and organize data. Our most popular ontologies have arisen from the need to describe the remarkable variety of living things, and are very detailed and broad. Simultaneously these ontologies have served to categorize and classify the contents of entire databases that had been previously been atomized, or only partially coherent. Chado has been built from the outset to integrate with these ontologies, and this feature makes it extremely expressive.


Associated Software

Chado is considered to be one of the key components in the GMOD suite.

Complexity and Detail

Part of the impetus for the creation of Chado was the need for a database that could describe ’’’all ’’’ the detail that would be found by extensive research done on a model organism.


Support

The community using Chado, and GMOD, is extensive and growing.


--comparison table--


Modules

We organised the tables into distinct modular components with tightly defined dependencies. This is recogised as good software engineering practice, it allows different software components to focus on the specific data compartments required. It allows for extensibility and schema evolution within specific modules without disrupting the rest of the schema. Finally, it allows for a mix and match approach - it is the authors' hope that the schema modules will be adopted by other model organism and bioinformatics groups; these groups may want to swap in their own table variants within specific modules, or add modules of their own.


Module Dependencies

general: NO DEPENDENCIES organism: general pub: general cv: general pub sequence: cv general pub genetic sequence cv general pub expression: sequence cv general pub map: sequence cv general pub


Inter-module Linking Tables

These can be thought of as floating outside of the respective modules they bridge, although they are generally bundled with one or the other module.

REVIEW - Not complete

Module Module Linking Table
sequence expression feature_expression
cv expression expression_cvterm
pub expression expression_pub
cv genetic phenotype_cvterm
sequence genetic feature_genotype
general organism organism_dbxref
general pub pub_dbxref
general pub journal_dbxref
pub sequence featureprop_pub
general sequence feature_dbxref
cv sequence feature_cvterm
organism sequence feature_organism
general sequence feature_synonym
general sequence gene_synonym

Chado Naming Conventions

Case sensitivity

We use lowercase in all tables and column names - DBMSs differ in how they treat case sensitivity. For example, Oracle will automatically capitalize everything. So it's best to be neutral and use lowercase.

Table names

In table names, we use underscores for linking tables; e.g. feature_dbxref is a linking table between feature and dbxref.

Where a table name is a noun phrase rather than a single noun, we concatenate the words together. For instance the table for describing feature properties is called featureprop. It could be argued this is harder to read, but it does allow consistent usage of underscores as above. FeatureProp could be used where it is known the DBMS is case insensitive.

Column names

In column names, we also use concatenated noun phrases, except in the case of primary or foreign keys, e.g. dbxref_id.

We try to keep column names unique where appropriate, which is useful for large join statements or views, in avoiding column name clash between different tables. The convention is to use an abbreviated form of the table name plus a noun describing the column, for instance fmin in the feature table. By consistently using abbreviated forms we stop column names getting too big (many DBMSs will complain about long column names).

Primary and foreign key names

We use the same column name for primary and foreign key columns - very useful for NATURAL JOIN statements.

Constraints

Constraint names are a concatentation of table name, underscore, the letter c, and a digit. For example: feature_phenotype_c1.

Indexes

Index names are a concatentation of table name, underscore, the string idx, and a digit. For example: feature_phenotype_idx1.

Views

The names of views are lowercase. Where a table name is a noun phrase rather than a single noun, we concatenate the words together using the underscore. For example the view used to query for nucleotide motifs is called nucleotide_motif and the view used to find exons from pseudogenes is called pseudogenic_exon.

Design Patterns

1.1.1 Module System


Module Metadata

View Layers

Views can be thought of as virtual tables. They provide a powerful abstraction layer over the database. All views should be portable across all DBMSs

Views in chado are defined on a per module basis. View definitions are maintained in the chado/modules/MODULE-NAME/views directory.

Included in the view directory are report views. These can usually be found in a file called chado/modules/MODULE-NAME/views/MODULE-NAME-report.sql

Collections of view definitions are bundled into packages, each package is a .sql file.


Inter-schema Bridges

GODB Bridge


BioSQL Bridge


DBMS Functions

DBMS Functions in Chado are entirely optional.

Functions in chado are defined on a per module basis. Function definitions are maintained in the chado/modules/MODULE-NAME/functions directory.

Collections of function definitions are bundled into packages. Each package comes with an interface descriptions and one or more implementations.


Function Interface Definitions

The interface descriptions are stored in a *.sqlapi file. The syntax used is a variant of SQL and is intended primarily as a consistent way of providing information for human, although it should be parseable by software.

Here is an example, taken from the top of the chado/modules/sequence/functions/subsequence.sqlapi package. This package provides basic subsequencing functions. It has dependencies on two other function packages, declared at the top of the file. The package declares multiple functions, only the first of which is show here, a function for extracting subsequences from the sequence of a feature.

<sql> IMPORT reverse_complement(TEXT) FROM 'sequtil'; IMPORT get_feature_relationship_type_id(TEXT) FROM 'sequence-cv-helper';


-- basic subsequencing functions --


DECLARE FUNCTION subsequence( srcfeature_id INT REFERENCES feature(feature_id), fmin INT, fmax INT, strandINT )

RETURNS TEXT;

COMMENT ON FUNCTION subsequence(INT,INT,INT,INT) IS 'extracts a subsequence from a feature referenced by srcfeature_id, within the interbase boundaries determined by fmin and fmax, reverse complementing if strand = -1. The sequence can be DNA or AA. Strand must always by >0 for AA sequences'; </sql>


Function Implementations

The goal is to provide implementations for different dialects of procedural SQL. Currently only PostgreSQL dialect is supported. The psql implementations are stored in *.plpgsql files.