Difference between revisions of "Chado General Module"

From GMOD
Jump to: navigation, search
m (Introduction)
 
(5 intermediate revisions by one other user not shown)
Line 37: Line 37:
  
 
The two main tables are [[#Table: dbxref|dbxref]] (for the identifier itself) and [[#Table:_db|db]] (for the name of the DB or ID-
 
The two main tables are [[#Table: dbxref|dbxref]] (for the identifier itself) and [[#Table:_db|db]] (for the name of the DB or ID-
granting authority). By separating the database into its own table rather than duplicating the name in
+
granting authority). Giving the database authority its own table, rather than repeatedly duplicating the name in
the dbxref we retain normalization
+
the dbxref table, retains data normalization.
  
 
A dbxref identifier has two key parts: a db id column that refers to an entry in the db table,
 
A dbxref identifier has two key parts: a db id column that refers to an entry in the db table,
Line 49: Line 49:
 
within the bioinformatics and biomedical realm. See below for more on uniqueness. This name is
 
within the bioinformatics and biomedical realm. See below for more on uniqueness. This name is
 
typically in short mnemonic (but human-friendly) form, and uniquely identifies the DB/authority
 
typically in short mnemonic (but human-friendly) form, and uniquely identifies the DB/authority
(enforced by uniqueness constraint). Examples include FlyBase, GO, MGI. Short human-friendly
+
(enforced by uniqueness constraint). Short human-friendly
 
names are encouraged, although longer names (such as full LSID prefixes) may also be used. The
 
names are encouraged, although longer names (such as full LSID prefixes) may also be used. The
 
name should be a valid XML NMTOKEN (see XML specification for details) - for example, it should
 
name should be a valid XML NMTOKEN (see XML specification for details) - for example, it should
Line 97: Line 97:
 
== Table: db ==
 
== Table: db ==
  
A database authority. Typical databases in bioinformatics are FlyBase, GO, UniProt, NCBI, MGI, etc. The authority is generally known by this shortened form, which is unique within the bioinformatics and biomedical realm. To Do - add support for URIs, URNs (e.g. LSIDs). We can do this by treating the URL as a URI - however, some applications may expect this to be resolvable - to be decided.
+
The db table contains one row per database authority, that is, one row per curator/creator of bioinformatic data collections. Typical databases in bioinformatics are FlyBase, GO, UniProt, NCBI, MGI, etc. The authority is generally known by this shortened form, which is unique within the bioinformatics and biomedical realm. To Do - add support for URIs, URNs (e.g. LSIDs). We can do this by treating the URL as a URI - however, some applications may expect this to be resolvable - to be decided.
  
 
{| border="1" cellpadding="3"
 
{| border="1" cellpadding="3"
Line 138: Line 138:
  
 
----
 
----
 
 
  
 
== Table: dbxref ==
 
== Table: dbxref ==
  
A unique, global, public, stable identifier. Not necessarily an external reference - can reference data items inside the particular chado instance being used. Typically a row in a table can be uniquely identified with a primary identifier (called dbxref_id); a table may also have secondary identifiers (in a linking table <T>_dbxref). A dbxref is generally written as <DB>:<ACCESSION> or as <DB>:<ACCESSION>:<VERSION>.
+
The dbxref table contains one row per version per collection of bioinformatic data, one row per Chado "database".  The table provides a unique, global, public, stable identifier that can be used to reference a database version. Not necessarily an external reference - can reference data items inside the particular chado instance being used. Typically a row in a table can be uniquely identified with a primary identifier (called dbxref_id); a table may also have secondary identifiers (in a linking table <T>_dbxref). A dbxref is generally written as <DB>:<ACCESSION> or as <DB>:<ACCESSION>:<VERSION>.
  
 
{| border="1" cellpadding="3"
 
{| border="1" cellpadding="3"
Line 318: Line 316:
  
 
[[Category:Chado Modules]]
 
[[Category:Chado Modules]]
 +
[[Category:!Lacking ERD]]

Latest revision as of 04:36, 18 February 2015

Introduction

General purpose tables are housed in the module general. The primary purpose of this module is to provide a means of providing data entities with stable, unique identifiers. In Chado, all identifiable data entities have bipartite identifiers, consisting of a dbname plus an accession, together with an optional version suffix.

By convention, these are normally presented using a ’:’ separator. An example of an identifier in this notation would be GO:0008045 or FlyBase:FBgn00000001. In the Chado schema the atomic units are the dbname and the accession, the separator is introduced only in the presentation layer. Each dbname uniquely identifies the authority responsible for a particular ID-space (so there cannot be two GO in any single Chado instance). The accession must be unique within the ID-space. Thus there can be two accessions 0008045, but there can only be one data artefact identified as GO:0008045.

These uniqueness constraints are encoded in the schema, so it is impossible for any Chado relational database instance to violate them.

Each identifier is stored as a row in the dbxref table, with the dbname stored in the db table. Keeping the dbname in a separate db table ensures that the Chado schema retains its commitment to normalization. Entries in other tables can refer to entries in the dbxref table by means of foreign keys.

Note that all stable identifiers are stored in the dbxref table, whether or not they refer to ’external’ data entities. Chado does not have an explicit notion of a data entity being external. Some dbxrefs have further information fully fleshed out in other tables in the database, and others are ’dangling’ dbxrefs.

Design patterns

This page or section needs to be edited. Please help by editing this page to add your revisions or additions.

  • Primary identifiers: ENTITY.dbxref id REFERENCES dbxref(dbxref id)
  • Secondary identifiers: ENTITY DBXREF.dbxref id

About the Tables

The two main tables are dbxref (for the identifier itself) and db (for the name of the DB or ID- granting authority). Giving the database authority its own table, rather than repeatedly duplicating the name in the dbxref table, retains data normalization.

A dbxref identifier has two key parts: a db id column that refers to an entry in the db table, and an accession column, that must be a locally unique identifier within the db referred to by the db id column. An optional third column is the version column. Taken together, these 3 columns constitute a unique key.

The db is a database authority. Typical dbs in bioinformatics are FlyBase, GO, UniProt, NCBI, MGI, etc. The authority is generally known by this sortened form (the db.name, which is unique within the bioinformatics and biomedical realm. See below for more on uniqueness. This name is typically in short mnemonic (but human-friendly) form, and uniquely identifies the DB/authority (enforced by uniqueness constraint). Short human-friendly names are encouraged, although longer names (such as full LSID prefixes) may also be used. The name should be a valid XML NMTOKEN (see XML specification for details) - for example, it should not start with a number. This constraint is to help syntactic interoperability with other identifier schemes. To ensure interoperability with other Chado databases, the same db.names should be used (e.g. FlyBase should be used consistently instead of FB). This will prevent duplicate dbxref rows being created if and when databases are merged. At the same time, uniqueness must be preserved: there must not be two GOs.

URLs and URIs

This page or section needs to be edited. Please help by editing this page to add your revisions or additions.

See the following for background:

Basically, a URI is an addressing scheme. The form of URI most people are familiar with are URLs; but not all URIs are URLs. Another URI addressing scheme is the URN; for example, LSIDs use URNs.

People often expect URLs to be resolvable using standard technology (e.g. a web browser) to a resource intended for humans, but this isn’t always the case. URNs may require other software to resolve them; e.g. an LSID resolver.

This column is nullable, so it is possible to defer decision on what the unique URI for a particular authority is if this information is not known up-front. See below for mechanisms for assigning URIs to DBs and ensuring uniqueness.

Note that it is perfectly acceptable for the db.name column to be the same as the url column (provided it is a valid URI). However, it is encouraged that a short form is used as the db.name.

About URL

A W3C compliant URL with the address of a website containing information about the DB or authority. For example, http://www.flybase.org, http://www.geneontology.org. The URL is intended for humans rather than software agents.

About URI

A W3C compliant URI that contains a unique namespace for the DB/authority. Some ID schemes (e.g. LSID) require this. The URI is intended for software agents rather than humans. It does not need to be a resolvable URL. However, certain DBs may prefer the URI to be a resolvable URL that has human-readable information on the other end. Other DBs may provide URNs (e.g. LSID URNs) that require software agents to be resolved.

To Do

The db table probably should have columns for both URL and URI. The former is intended just to go to a website like the FlyBase or GO home page. The latter is intended as a globally unique addressing scheme that should be interoperable with other schemes. For example GO may be a unique identifier for the Gene Ontology ID space by fiat within the bioinformatics community, but not outside. Although Chado only cares about the former, it may have to interoperate with schemes that care about truly global uniqueness, hence URIs.

Identifiers and Interoperability between Chado Instances

This page or section needs to be edited. Please help by editing this page to add your revisions or additions.

Tables

Table: db

The db table contains one row per database authority, that is, one row per curator/creator of bioinformatic data collections. Typical databases in bioinformatics are FlyBase, GO, UniProt, NCBI, MGI, etc. The authority is generally known by this shortened form, which is unique within the bioinformatics and biomedical realm. To Do - add support for URIs, URNs (e.g. LSIDs). We can do this by treating the URL as a URI - however, some applications may expect this to be resolvable - to be decided.

db Structure
F-Key Name Type Description
db_id serial PRIMARY KEY
name character varying(255) UNIQUE NOT NULL
description character varying(255)
urlprefix character varying(255)
url character varying(255)

Tables referencing this one via Foreign Key Constraints:


Table: dbxref

The dbxref table contains one row per version per collection of bioinformatic data, one row per Chado "database". The table provides a unique, global, public, stable identifier that can be used to reference a database version. Not necessarily an external reference - can reference data items inside the particular chado instance being used. Typically a row in a table can be uniquely identified with a primary identifier (called dbxref_id); a table may also have secondary identifiers (in a linking table <T>_dbxref). A dbxref is generally written as <DB>:<ACCESSION> or as <DB>:<ACCESSION>:<VERSION>.

dbxref Structure
F-Key Name Type Description
dbxref_id serial PRIMARY KEY

db

db_id integer UNIQUE#1 NOT NULL
accession character varying(255) UNIQUE#1 NOT NULL

The local part of the identifier. Guaranteed by the db authority to be unique for that db.
version character varying(255) UNIQUE#1 NOT NULL DEFAULT ''::character varying
description text

Tables referencing this one via Foreign Key Constraints:


Table: project

project Structure
F-Key Name Type Description
project_id serial PRIMARY KEY
name character varying(255) UNIQUE NOT NULL
description character varying(255) NOT NULL

Tables referencing this one via Foreign Key Constraints:



Table: tableinfo

tableinfo Structure
F-Key Name Type Description
tableinfo_id serial PRIMARY KEY
name character varying(30) UNIQUE NOT NULL
primary_key_column character varying(30)
is_view integer NOT NULL
view_on_table_id integer
superclass_table_id integer
is_updateable integer NOT NULL DEFAULT 1
modification_date date NOT NULL DEFAULT now()

Tables referencing this one via Foreign Key Constraints: