Difference between revisions of "Chado Phenotype Module at FlyBase"

From GMOD
Jump to: navigation, search
m ({{ChadoModuleTableLink|Genetic|environment}})
m (Text replace - "</sql>" to "</syntaxhighlight>")
 
(6 intermediate revisions by 2 users not shown)
Line 45: Line 45:
 
== {{ChadoModuleTableLink|Phenotype|phenotype}} ==
 
== {{ChadoModuleTableLink|Phenotype|phenotype}} ==
  
<div class="indent">{{ChadoTable_phenotype}}</div>
+
<div class="quotebox">{{ChadoTable_phenotype}}</div>
  
 
=== What is a Phenotype? ===
 
=== What is a Phenotype? ===
Line 105: Line 105:
  
  
<sql>select count(*), c.name
+
<syntaxhighlight lang="sql">select count(*), c.name
 
   from phenotype p, cvterm t, cv c
 
   from phenotype p, cvterm t, cv c
 
  where t.cvterm_id = p.observable_id
 
  where t.cvterm_id = p.observable_id
Line 111: Line 111:
 
  group by 2
 
  group by 2
 
  order by 1 desc
 
  order by 1 desc
  limit 50;</sql>
+
  limit 50;</syntaxhighlight>
  
 
  count |          name
 
  count |          name
Line 157: Line 157:
 
== {{ChadoModuleTableLink|Genetic|phenotype_comparison}} ==
 
== {{ChadoModuleTableLink|Genetic|phenotype_comparison}} ==
  
<div class="indent">{{ChadoTable_phenotype_comparison}}</div>
+
<div class="quotebox">{{ChadoTable_phenotype_comparison}}</div>
  
 
This table is good for binary comparisons.  How broadly useful is that?
 
This table is good for binary comparisons.  How broadly useful is that?
Line 205: Line 205:
 
== {{ChadoModuleTableLink|Phenotype|phenotype_cvterm}} ==
 
== {{ChadoModuleTableLink|Phenotype|phenotype_cvterm}} ==
  
<div class="indent">{{ChadoTable_phenotype_cvterm}}</div>
+
<div class="quotebox">{{ChadoTable_phenotype_cvterm}}</div>
  
 
Can imagine it just attaches a bag of terms to the phenotype.  There is no semantics on how the terms relate to each other or to the phenotype.
 
Can imagine it just attaches a bag of terms to the phenotype.  There is no semantics on how the terms relate to each other or to the phenotype.
Line 237: Line 237:
 
Find rows that use rank in the unique constraint.
 
Find rows that use rank in the unique constraint.
  
<sql>
+
<syntaxhighlight lang="sql">
 
select * from phenotype_cvterm pcv1
 
select * from phenotype_cvterm pcv1
 
  where exists (
 
  where exists (
Line 244: Line 244:
 
       and pcv1.phenotype_id = pcv2.phenotype_id
 
       and pcv1.phenotype_id = pcv2.phenotype_id
 
       and pcv1.cvterm_id = pcv2.cvterm_id) limit 10;
 
       and pcv1.cvterm_id = pcv2.cvterm_id) limit 10;
</sql>
+
</syntaxhighlight>
 
  phenotype_cvterm_id | phenotype_id | cvterm_id | rank
 
  phenotype_cvterm_id | phenotype_id | cvterm_id | rank
 
  --------------------+--------------+-----------+------
 
  --------------------+--------------+-----------+------
Line 259: Line 259:
 
== {{ChadoModuleTableLink|Genetic|phenstatement}} ==
 
== {{ChadoModuleTableLink|Genetic|phenstatement}} ==
  
<div class="indent">{{ChadoTable_phenstatement}}</div>
+
<div class="quotebox">{{ChadoTable_phenstatement}}</div>
  
 
=== {{ChadoModuleTableLink|Genetic|phenstatement}} @ Flybase ===
 
=== {{ChadoModuleTableLink|Genetic|phenstatement}} @ Flybase ===
Line 287: Line 287:
 
== {{ChadoModuleTableLink|Genetic|phendesc}} ==
 
== {{ChadoModuleTableLink|Genetic|phendesc}} ==
  
<div class="indent">{{ChadoTable_phendesc}}</div>
+
<div class="quotebox">{{ChadoTable_phendesc}}</div>
  
 
You can't link directly from a {{ChadoModuleTableLink|Phenotype|phenotype}} to {{ChadoTableName|phendesc}}.  You can link directly from a {{ChadoModuleTableLink|Genetic|phenstatement}} to a {[ChadoTableName|phendesc}} using (genotype_id, environment_id, and pub_id).  Note that this does not take full advantage of the unique index on those 3 columns plus type_id.
 
You can't link directly from a {{ChadoModuleTableLink|Phenotype|phenotype}} to {{ChadoTableName|phendesc}}.  You can link directly from a {{ChadoModuleTableLink|Genetic|phenstatement}} to a {[ChadoTableName|phendesc}} using (genotype_id, environment_id, and pub_id).  Note that this does not take full advantage of the unique index on those 3 columns plus type_id.
Line 318: Line 318:
 
==== type_id ====
 
==== type_id ====
  
<sql>
+
<syntaxhighlight lang="sql">
 
select count(*), t.name, c.name
 
select count(*), t.name, c.name
 
   from phendesc p, cvterm t, cv c
 
   from phendesc p, cvterm t, cv c
Line 326: Line 326:
 
  order by 1 desc
 
  order by 1 desc
 
  limit 50;
 
  limit 50;
</sql>
+
</syntaxhighlight>
 
  count |          name            |    name
 
  count |          name            |    name
 
  ------+---------------------------+---------------
 
  ------+---------------------------+---------------
Line 388: Line 388:
 
== {{ChadoModuleTableLink|Genetic|environment_cvterm}} ==
 
== {{ChadoModuleTableLink|Genetic|environment_cvterm}} ==
  
No doc, but you can guess.  No semantics here.  My guess is associates a list of CV terms with an environment.
+
Definition:
 +
<div class="quotebox">{{ChadoTable_environment_cvterm}}</div>
 +
 
 +
No semantics here.  My guess is associates a list of CV terms with an environment.
  
 
=== {{ChadoModuleTableLink|Genetic|environment_cvterm}} @ Flybase ===
 
=== {{ChadoModuleTableLink|Genetic|environment_cvterm}} @ Flybase ===
Line 402: Line 405:
 
== {{ChadoModuleTableLink|Genetic|genotype}} ==
 
== {{ChadoModuleTableLink|Genetic|genotype}} ==
  
: The environmental component of a phenotype description.
+
Definition:
 +
<div class="quotebox">{{ChadoTable_genotype}}</div>
  
 
=== {{ChadoModuleTableLink|Genetic|genotype}} @ Flybase ===
 
=== {{ChadoModuleTableLink|Genetic|genotype}} @ Flybase ===
Line 435: Line 439:
 
== {{ChadoModuleTableLink|Genetic|feature_genotype}} ==
 
== {{ChadoModuleTableLink|Genetic|feature_genotype}} ==
  
: No description
+
Definition:
 +
<div class="quotebox">{{ChadoTable_feature_genotype}}</div>
  
 
This is the table that connects data to features, finally.  The rank and group have some semantics.
 
This is the table that connects data to features, finally.  The rank and group have some semantics.

Latest revision as of 23:33, 8 October 2012

Under Construction

This page or section is under construction.

Should be done before Jan 1, 2011

The Chado Phenotype module was developed at FlyBase.

To try and understand the existing phenotype module during the GMOD Evo Hackathon, participants spent some time exploring the phenotype tables at FlyBase. The notes from that exploration are on this page.

Phenotypes are not cleanly implemented in a single module. The tables are spread across the Phenotype and Genetic modules. This page discusses the tables from both modules. It also covers some related tables.

These notes reflect the state of the FlyBase Chado database during the first two weeks on November 2010.

Tables

Table rows at FlyBase
feature_phenotype 0
phenotype 10,925
phenotype_comparison 100,235
phenotype_cvterm 7,714
phenstatement 20,673
phendesc 112,390

And those are the only tables that have "phen" in their name.

Important Linked Tables

And that's it?

phenotype

<protect>
Table: phenotype
Module: Phenotype

A phenotypic statement, or a single atomic phenotypic observation, is a controlled sentence describing observable effects of non-wild type function. E.g. Obs=eye, attribute=color, cvalue=red.

phenotype columns
FK Name Type Description
phenotype_id serial PRIMARY KEY
uniquename text UNIQUE

NOT NULL

cvterm observable_id integer The entity: e.g. anatomy_part, biological_process.
cvterm attr_id integer Phenotypic attribute (quality, property, attribute, character) - drawn from PATO.
value text Value of attribute - unconstrained free text. Used only if cvalue_id is not appropriate.
cvterm cvalue_id integer Phenotype attribute value (state).
cvterm assay_id integer Evidence type.

Tables referencing phenotype via foreign key constraints:

</protect>

What is a Phenotype?

From conversation with Jim Balhoff and Matt Yoder:

A phenotype is what you observe.

This means that stage or time since conception is not part of the phenotype. It is part of the experiment. (Unless you are observing development rate, in which case, stage would be the phenotype.)

PATO at FlyBase?

That table description in the schema sounds like the old PATO, not the new EQ model.

Does this table implement PATO at FlyBase?

No

  • observable_id is the Entity. Usually points to anatomy or cellular_component CVs.
  • But attr_id is not used and cvalue_id points to compound terms in the FlyBase Miscellaneous CV.

phenotype @ Flybase

Counts
Table Rows 10925
observable_id is null 1
attr_id is null 1
cvalue_id is null 1
assay_id is null 1
value is null 10925

All the CVterm FKs that are null have the uniquename of "unspecified"

uniquename and observable_id

Most uniquenames are observable_id -> cvterm.name + blank or integer, but not when name is "unspecified", uniquename can be anything. Some examples:

  • viable
  • long lived | dominant
  • fertile
  • mitotic cell cycle defective | recessive | somatic clone
  • lethal | larval stage

There are 1473 of these (every one of them unspecified).


SELECT COUNT(*), c.name
  FROM phenotype p, cvterm t, cv c
 WHERE t.cvterm_id = p.observable_id
   AND t.cv_id = c.cv_id
 GROUP BY 2
 ORDER BY 1 DESC
 LIMIT 50;
count |           name
------+--------------------------
 8914 | FlyBase anatomy CV
 1473 | FlyBase miscellaneous CV
  436 | cellular_component
   91 | biological_process
    9 | property type
    1 | SO
(6 rows)


attr_id and assay_id

attr_id is supposed to be "Phenotypic attribute (quality, property, attribute, character) - drawn from PATO." but is not.

assay_id is supposed to be "Evidence type" but it is not.


At FlyBase, in all rows (except the null one) both of these point to cvterm "unspecified"

So, this is not used for PATO at Flybase. I'm not sure why it isn't null.

cvalue_id

Schema doc says "Phenotype attribute value (state)."

There are only 118 distinct values. Some of them:

  • DNA repair defective
  • Minute
  • aging defective
  • auditory perception defective
  • auxotroph
  • bang sensitive
  • behavior defective

All from: FlyBase miscellaneous CV

value

value is null in every row in FlyBase.

phenotype_comparison

<protect>
Table: phenotype_comparison
Module: Genetic

Comparison of phenotypes e.g., genotype1/environment1/phenotype1 "non-suppressible" with respect to genotype2/environment2/phenotype2.

phenotype_comparison columns
FK Name Type Description
phenotype_comparison_id serial PRIMARY KEY
genotype genotype1_id integer UNIQUE#1

NOT NULL

environment environment1_id integer UNIQUE#1

NOT NULL

genotype genotype2_id integer UNIQUE#1

NOT NULL

environment environment2_id integer UNIQUE#1

NOT NULL

phenotype phenotype1_id integer UNIQUE#1

NOT NULL

phenotype phenotype2_id integer
pub pub_id integer UNIQUE#1

NOT NULL

organism organism_id integer NOT NULL


Tables referencing phenotype_comparison via foreign key constraints:

</protect>

This table is good for binary comparisons. How broadly useful is that?

phenotype_comparison @ Flybase

Counts
Table Rows 100,235
phenotype2_id is null 0
phenotype1_id = phenotype2_id 42,510
genotype1_id = genotype2_id 0
environment1_id = environment2_id 99,293

Every column but phenotype2_id is not nullable, and that column has no nulls in it either.

environment[12]_id

The above counts imply a lot comparisons using different genotypes in the same environment. No it does not. In every one of those 99K comparisons, environment is "unspecified".

So, FlyBase has known environments for 1% of its comparisons.

phenotype[12]_id and genotype[12]_id

The 42K comparisons where the phenotypes are the same mean that the genotype did not change this phenotype.

type_id

type_id is a phantom column. It does not exist in current Chado or at FlyBase.

How do we say how they compare without this column?

What makes two G/E/P pairs comparable?

phenotype_cvterm

<protect>
Table: phenotype_cvterm
Module: Phenotype


phenotype_cvterm columns
FK Name Type Description
phenotype_cvterm_id serial PRIMARY KEY
phenotype phenotype_id integer UNIQUE#1

NOT NULL

cvterm cvterm_id integer UNIQUE#1

NOT NULL

rank integer UNIQUE#1

NOT NULL


Tables referencing phenotype_cvterm via foreign key constraints:

  • None.</protect>

Can imagine it just attaches a bag of terms to the phenotype. There is no semantics on how the terms relate to each other or to the phenotype.

phenotype_cvterm @ Flybase

Counts
Table Rows 7714

rank

Flybase's table def has a rank column, which is > 0 for about 2000 rows. The rank column is part of standard Chado, but not listed on the web site.

There is a unique constraint on (phenotype_id, cvterm_id, rank). Why?

cvterm_id

Links to terms in 5 different CVs

  • FlyBase miscellaneous CV (75%)
  • FlyBase anatomy CV
  • SO
  • FlyBase development CV (25%)
  • biological_process

Semantics?

Find rows that use rank in the unique constraint.

SELECT * FROM phenotype_cvterm pcv1
 WHERE EXISTS (
    SELECT * FROM phenotype_cvterm pcv2
     WHERE pcv1.phenotype_cvterm_id <> pcv2.phenotype_cvterm_id
       AND pcv1.phenotype_id = pcv2.phenotype_id
       AND pcv1.cvterm_id = pcv2.cvterm_id) LIMIT 10;
phenotype_cvterm_id | phenotype_id | cvterm_id | rank
--------------------+--------------+-----------+------
               8763 |        15842 |     60843 |    0
               8764 |        15842 |     60843 |    1
              12006 |        19433 |     60843 |    1
              12007 |        19433 |     60843 |    2
(4 rows)

So, what does that mean?

Rank is used for some sort of code that is not stored in the database?

phenstatement

<protect>
Table: phenstatement
Module: Genetic

Phenotypes are things like "larval lethal". Phenstatements are things like "dpp-1 is recessive larval lethal". So essentially phenstatement is a linking table expressing the relationship between genotype, environment, and phenotype.

phenstatement columns
FK Name Type Description
phenstatement_id serial PRIMARY KEY
genotype genotype_id integer UNIQUE#1

NOT NULL

environment environment_id integer UNIQUE#1

NOT NULL

phenotype phenotype_id integer UNIQUE#1

NOT NULL

cvterm type_id integer UNIQUE#1

NOT NULL

pub pub_id integer UNIQUE#1

NOT NULL


Tables referencing phenstatement via foreign key constraints:

  • None.</protect>

phenstatement @ Flybase

Counts
Table Rows 203673

All columns are not null; all except PK are FKs. Therefore phenstatements require

  • genotype
  • environment
  • phenotype
  • cvterm (type_id ?)
  • pub

Well, sort of.

  • type_id is "unspecified" for every row.
  • 7% of pubs link to one with a blank title
  • 97% of environments point to "unspecified"

There are 72K linked genotypes. They use a custom markup language in the uniquename.

phendesc

<protect>
Table: phendesc
Module: Genetic

A summary of a _set_ of phenotypic statements for any one gcontext made in any one publication.

phendesc columns
FK Name Type Description
phendesc_id serial PRIMARY KEY
genotype genotype_id integer UNIQUE#1

NOT NULL

environment environment_id integer UNIQUE#1

NOT NULL

description text NOT NULL
cvterm type_id integer UNIQUE#1

NOT NULL

pub pub_id integer UNIQUE#1

NOT NULL


Tables referencing phendesc via foreign key constraints:

  • None.</protect>

You can't link directly from a phenotype to phendesc. You can link directly from a phenstatement to a {[ChadoTableName|phendesc}} using (genotype_id, environment_id, and pub_id). Note that this does not take full advantage of the unique index on those 3 columns plus type_id.

This table establishes two things:

  • A general comment about a set of phenstatement's.
  • A general type for the overall phenotype. See below for options at FlyBase.

This table means there can only be one phendesc for each environment/genotype/pub publication. I think that makes sense.

It does mean we wont link to phenstatements with the wrong genotype/environment, but it also means we can orphan phendesc rows that don't join with any phenstatements.

phendesc @ Flybase

Counts
Table Rows 112,390
description is null 0

No columns are nullable. There is description, 4 FKs (genotype_id, environment_id, type_id, pub_id), and the PK.

Description is not unique with "Homozygous lethal." and "homozygous lethal" making up ~1% of records.

type_id

SELECT COUNT(*), t.name, c.name
  FROM phendesc p, cvterm t, cv c
 WHERE p.type_id = t.cvterm_id
   AND t.cv_id = c.cv_id
 GROUP BY 2,3
 ORDER BY 1 DESC
 LIMIT 50;
count |           name            |     name
------+---------------------------+---------------
71688 | single_mutant_pheno       | phendesc type
18648 | genetic_interaction_pheno | phendesc type
12809 | aberr_pheno               | phendesc type
 7662 | interallele_comp          | phendesc type
 1583 | xeno_interaction_pheno    | phendesc type
(5 rows)

pub_id

11,930 rows (about 10% of rows) link to the pub with no title.

environment_id

101,746 rows (over 90%) link to "unspecified"

genotype_id

There is wide distribution of genotypes. The top genotype has 122 phendesc records.

Description

Free form description of phenotype. Almost all entries are informative, and none are null.

environment

Definition:

<protect>
Table: environment
Module: Genetic

The environmental component of a phenotype description.

environment columns
FK Name Type Description
environment_id serial PRIMARY KEY
uniquename text UNIQUE

NOT NULL

description text

Tables referencing environment via foreign key constraints:

</protect>

environment @ Flybase

Counts
Table Rows 15

description is null in all rows.

Uniquename choice are:

unspecified
conditional qualifier
temperature conditional
drug conditional
conditional tetracycline
conditional RU486
Drosophila cell culture
In transgenic Drosophila (intraspecific)
In transgenic Drosophila (allele of one drosophilid species in genome of another drosophilid)
In transgenic Drosophila (allele of foreign species in genome of drosophilid)
Whole-organism transient assay (intraspecific)
Whole-organism transient assay (allele from one drosophilid species assayed in another drosophilid)
Whole-organism transient assay (allele of foreign species assayed in drosophilid)
conditional - heat sensitive
conditional - cold sensitive

environment_cvterm

Definition:

<protect>
Table: environment_cvterm
Module: Genetic


environment_cvterm columns
FK Name Type Description
environment_cvterm_id serial PRIMARY KEY
environment environment_id integer UNIQUE#1

NOT NULL

cvterm cvterm_id integer UNIQUE#1

NOT NULL


Tables referencing environment_cvterm via foreign key constraints:

  • None.</protect>

No semantics here. My guess is associates a list of CV terms with an environment.

environment_cvterm @ Flybase

Counts
Table Rows  ?

genotype

Definition:

<protect>
Table: genotype
Module: Genetic

Genetic context. A genotype is defined by a collection of features, mutations, balancers, deficiencies, haplotype blocks, or engineered constructs.

genotype columns
FK Name Type Description
genotype_id serial PRIMARY KEY
name text Optional alternative name for a genotype,

for display purposes.

uniquename text UNIQUE

NOT NULL
The unique name for a genotype; typically derived from the features making up the genotype.

description character varying(255)

Tables referencing genotype via foreign key constraints:

</protect>

genotype @ Flybase

Counts
Table Rows 263,153
description is null 263,153
name is null 165,210
description and name are null 165,210

There is no genotype_cvterm table.

uniquename

Uniquename is the only informative column is 2/3 of the rows. It uses a custom FlyBase encoding of information.

name

In the cases where this is not null it uses a custom FlyBase nomenclature that often includes one or more FlyBase IDs.

feature_genotype

Definition:

<protect>
Table: feature_genotype
Module: Genetic


feature_genotype columns
FK Name Type Description
feature_genotype_id serial PRIMARY KEY
feature feature_id integer UNIQUE#1

NOT NULL

genotype genotype_id integer UNIQUE#1

NOT NULL

feature chromosome_id integer UNIQUE#1

A feature of SO type "chromosome".

rank integer UNIQUE#1

NOT NULL
rank can be used for n-ploid organisms or to preserve order.

cgroup integer UNIQUE#1

NOT NULL
Spatially distinguishable group. group can be used for distinguishing the chromosomal groups, for example (RNAi products and so on can be treated as different groups, as they do not fall on a particular chromosome).

cvterm cvterm_id integer UNIQUE#1

NOT NULL


Tables referencing feature_genotype via foreign key constraints:

  • None.</protect>

This is the table that connects data to features, finally. The rank and group have some semantics.

feature_genotype @ Flybase

Counts
Table Rows 551,551
chromosome_id is null 0

cvterm_id

All cvterm_ids are unspecified. Every one of them. No doc on this.

cgroup

Spatially distinguishable group. group can be used for distinguishing the chromosomal groups, for example (RNAi products and so on can be treated as different groups, as they do not fall on a particular chromosome).

455K are 0; After that it diminishes up to 6, which only has 5 records.

What does this mean? It's not an FK to anywhere, just an integer that must be unique in combination with feature_id, genotype_id, chromosome, rank, cgroup, and cvterm_id (every non-PK column).

chromosome_id

Every record points to the same feature_id, which has the name "unspecified". Is this supposed to be a denormalization optimization that they don't use at FlyBase?

rank

rank can be used for n-ploid organisms or to preserve order.