Difference between revisions of "Chado Sequence Module"

From GMOD
Jump to: navigation, search
m (feature phenotype)
m
Line 1,366: Line 1,366:
 
relative to itself
 
relative to itself
  
 
===featureloc===
 
  
 
  Column Datatype Description
 
  Column Datatype Description

Revision as of 17:39, 14 February 2007

Introduction

The central module in Chado is the sequence module. The fundamental table within this module is the feature table, for describing biological sequence features. Chado defines a feature to be a region of a biological polymer (typically a DNA, RNA, or a polypeptide molecule) or an aggregate of regions on this polymer. As the term is used here, region can be the entire extent of the molecule, or a junction between two bases. Features can be typed according to a classification scheme[6], they can be localized relative to other features, and they can form part-whole and other relationships with other features.

There are many different types of features. Examples include gene, exon, transcript, regulatory region, chromosome, sequence variation, polypeptide, protein domain and cross-genome match regions. Chado does not have a different table for each kind of feature; all features are stored in the feature table. Types of feature are differentiated using a type id column, which is a foreign key to the cvterm table in the cv (ontology) module, described later. This allows us to type features according to the Sequence Ontology. The use of ontologies to type tables gives Chado a subtyping mechanism, which is absent from the standard relational model. For example, SO tells us that mRNA and snRNA are different kinds of transcript. This is discussed in more in the next section. For the purposes of discussion in this document, it can be assumed that any reference to genes, exons, polypeptides, SNPs, chromosomes, transcripts and various kinds of RNAs and so on refers to features of that sequence ontology type.

The Chado feature table has a text-valued column named residues for storing the sequence of the feature. The value of this column is string of IUPAC[REF] symbols corresponding to the sequence of biochemical residues encoded by the feature. This column is optional, because the sequence of the feature may not be known. Even if the sequence of a feature is known, it may not be desirable to store it in the feature table, as it may be possible to infer the sequence from the sequence of other features in the database. For example, exon sequences are generally not stored, as these can trivially be inferred from the sequence of the genomic feature on which the exon is located. In contrast, mRNA and other processed transcript sequences are stored as it is less trivial and more computationally expensive to dynamically splice together the mRNA sequence.

It is important to realize that the existence of a row in the feature table does not necessarily imply that the feature has been characterized as a result of genome annotation. It is possible to have features of SO type gene for genes that have only been characterized through genetic studies [REF], and for which neither sequence nor sequence location is known. This is in contrast to other feature schemas (such as GFF) in which it is not possible to represent features without representing a location in sequence coordinates. This design decision is crucial for the use of Chado as a database for integrating information about the same entity from multiple perspectives.

Because the sequence is stored as a column in the feature table rather than as an independent table, sequences cannot exist in the absence of a row in the feature table; sequences are dependent upon features. This is in contrast with almost all other genomics schemas that allow independent treatment of sequences and features. This design decision follows for both philosophical and prag- matic reasons. The feature table also contains columns seqlen and md5checksum, for storing the length of the sequence and the 32-character checksum computed using the MD5 [RL Rivest. RFC 1321: The md5 message-digest algorithm. Technical report, Internet Activities Board, April 1992.] algorithm. The length and checksum can be stored even when the residues column is null valued. The checksum is useful for checking if two or more features share the same sequence, without comparing the entire sequence string.

The existence of these columns means that this table is no longer in third normal form (3NF)[REF], which is usually a desirable formal property of relational database. On balance, the utility of these columns outweighs the disadvantages of violating 3NF [updates]. In practical terms, it means that the values of the residues, seqlen and md5checksum columns are interdependent and cannot be updated independently of one another.

The feature table has a Boolean valued column, is analysis, indicating whether this is an an- notation or a computed feature from a computational analysis. Annotations are features that are generated or blessed by a human curator, or in some cases by an integrated genome pipeline[7-9] capable of synthesising gene models and other annotations from in-silico analyses. They constitute the definitive version of a particular feature, in contrast to the features generated by gene prediction programs and sequence similarity searches such as BLAST.

The feature table has a dbxref id column that refers to a global, stable public identifier for the feature. This column is optional, because not all classes of features have such identifiers for example, features resulting from gene predictions and blast HSP features may be less stable and thus lack public identifiers. It is recommended that most annotated features have dbxref ids. The organism id column refers to a row in the organism table (defined in the organism module). This column is mandatoryall features derive from a single organism.

The name and uniquename columns allow features to be labelled. The name column is optional, but it is recommended that all annotated features (as opposed to those that arise from purely computational methods) have names. The name should be a simple, concise, human-friendly display label (such as a gene or gene product symbol, as defined by the nomenclature rules of governing the organism). User interface software (such as GBrowse[10] and Apollo[11]) can use the name column for labelling feature glyphs in user displays. Uniqueness of name within any particular organism or genome project is a desirable characteristic, but is not enforced in the schema, since there are occasions where name clashes are unavoidable. In contrast, the uniquename column is required, and guaranteed to be unique when taken in combination with organism id and type id this is enforced by a constraint in the relational schema. The uniquename may be human-friendly (for example, it can be the same as the name); however, it is not guaranteed to be so, and in general should not be displayed to the end user. Its use is mainly as an alternate unique key on the table .

The uniquename normally conforms to some naming rule these rules may vary across chado

instances, but they should all guarantee the uniqueness of the uniquename, organism id, type id triple.


Feature synonyms

In addition to having a name or symbol, it is common for features such as genes to have multiple synonyms or aliases. These synonyms may exist due to different publications referring to the same gene with different symbols, or because one gene was once believed to be two or more separate genes. A common curation operation on genes[REF] is splitting and merging, which results in the creation of synonyms.

This is modelled in Chado with a synonym table and a feature synonym linking table; thus

multiple features can potentially share the same, and a single feature can be have multiple synonyms. Use of a synonym in the literature is indicated with a pub id foreign key referencing the pub table (described later in the section on publications module), indicating historical provenance for the use of a synonym.


Feature locations

Features can potentially be localized using a sequence coordinate system. A relative localization model is used, so all feature localizations must be relative to another feature. Some features such as those of type chromosome are not localized in sequence coordinates. Locations are stored in the featureloc table, also part of the sequence module. Other non-sequence oriented kinds of localization (such as physical localization from in situ experiments, or genetic localizations from linkage studies) are modelled outside the sequence module (for example, in the expression or map module).

A feature can have zero or more featurelocs, although it will typically have either one (for local-

ized features for which the location is known) or zero (for unlocalized features such as chromosomes, or for features for which the location is not yet known, such as a gene discovered using classical genetics techniques). Features with multiple featurelocs will be explained later.

A featureloc is an interval in interbase sequence coordinates (see figure), bounded by the fmin and fmax columns, each representing the lower and upper linear position of the boundary between bases or base pairs, with directionality indicated by the strand column. Interbase coordinates were chosen over the more commonly used base-oriented coordinate system because they are more nat- urally amenable to the standard arithmetic operations that are typically performed upon sequence coordinates. This leads to cleaner and more efficient database coding logic that is arguably less prone to errors. Of course, interbase coordinates are typically transformed into the more common base-oriented system used by BLAST reports and so forth prior to presentation to the end-user.

The relational schema includes a constraint which ensures that fmin ¡= fmax is always true any attempt to set the database in a state which violates this will flag an error .

As mentioned previously, a featureloc must be localized relative to another feature, indicated using the srcfeature id foreign key column, referencing the feature table. There is nothing in the schema prohibiting localization chains; for example, locating an exon relative to a contig that is itself localized relative to a chromosome (see figure). The majority of Chado database instances will not require this flexibility; features are typically located relative to chromosomes or chromosomes arms. Nevertheless, the ability to store such localization networks or location graphs can be useful for unfinished genomes or parts of genomes such as heterochromatin [REF], in which it is desirable to locate features relative to stable contigs or scaffolds, which are themselves localized in an unstable assembly to chromosomes or chromosome arms. Localization chains do not necessarily only span assemblies protein domains may be localized relative to polypeptide features, themselves localized to a transcript (or to the genome, as is more common). Chains may also span sequence alignments.

The Feature Location Graph

We will now present a short formal treatment of the properties of these hierarchies of localization using graph theory. This treatment can be ignored for the purposes of understanding the basics of the Chado schema; the end-user of the database will be entirely unaware of such technicalities. However, for the purposes of software engineering and ensuring interoperability between different Chado database instances and different applications, formal treatments such as these are an essential requirement for software specifications.

We can define a featureloc graph (LG) as being a set of vertices and edges, with each feature constituting a vertex, and each featureloc constituting an edge going from the parent feature id vertex to the srcfeature id vertex. The node is labelled with column values from the feature table, and the edge is labelled with column values from the featureloc table. The LG is not allowed to contain cycles it is a directed acyclic graph (DAG). This includes self-cycles - no feature may be localized relative to itself.

The roots of the LG are the features that do not have featureloc row typically chromosomes or chromosome arms, although LG roots may also be unassembled contigs, scaffolds or features for which sequence localization is not get known (such as genes discovered through classical genetics techniques). The leaves of the LG are any features that are not present as a srcfeature id in any featurelocs row typically the bulk of features, such as genes, exons, matches and so on. The depth of a particular LG g, denoted D(g), is the maximum number of edges between any leaf- root pair. As has been previously noted, many Chados will have LGs with a uniform depth of 1. Such LGs are said to be simple and the features within them are said to be singletons. The maximum depth of all LGs in a particular database instance i is denoted LGDmax(i).

The schema does not constrain the maximum depth of the LG. This flexibility proves useful when applying Chado to the highly variable needs of multiple different genome projects; however, it can lead to efficiency problems when querying the database. It can also make it more difficult to write software to interoperate with the database, as the software must take into account different contingencies. We can solve this problem by collapsing the LG, in which a graph of arbitrary depth is flattened to a depth of 1, transforming or projecting featurelocs onto the root features (typically chromosomes or chromosome arms). The original featurelocs are left unaltered in the database, and additional redundant featurelocs between leaf and root features are added to the database. These new featurelocs are known as inferred featurelocs. In the schema inferred featurelocs are differentiated from direct featurelocs using the locgroup column. Direct (non-inferred) localizations are indicated by the locgroup column taking value 0, and transitive localizations are indicated by this column having value ¿0.

The terminology used above can be used to define specifications for applications intended to interoperate with the database. Feature location pairs Certain kinds of features have paired loca- tions. These include hits and high-scoring- pairs (HSPs) coming from sequence search programs such as BLAST, and syntenic chromosomal regions. These kinds of features have two featurelocs (in contrast to the usual 1) one on the query feature and one on the subject (hit) feature. We differentiate the two featurelocs with the rank column. A rank of 0 indicates a location relative to the query (as is the default for most features), and a rank of 1 indicates a location relative to the subject (hit) feature.

For multiple alignments (e.g. CLUSTALW [REF] results), this scheme is extended to unbounded ranks [0..n], with arbitrary ordering. Alignments are stored in the residue info column. CIGAR format[REF] is used for pairwise alignments.

Multiple featurelocs may also be required for features of type sequence variant (SO:0000109), indicating points or extents which vary between reference and non- reference sequences. From a modelling standpoint, variants are conceptually similar to alignments; with variants we are noting a difference as opposed to a similarity. Here a rank of zero indicates the wild-type (or reference) fea- ture and a rank of one or more indicates the variant (or non-reference) feature, with the residue info column representing the sequence on wild-type and variant. [?figure ] A featureloc is uniquely iden- tified by the [feature id, rank, locgroup] triple. This means that no feature can have more than one featureloc with the same rank and locgroup. In other words, rank and locgroup uniquely identify a featureloc for any particular feature.


Difference between the chado location model and other schemas

There is a crucial difference between the Chado location model and the sequence location model used in other schemas, such as GFF, GenBank, BioSQL, BioPerl, etc.

First, Chado is the only model to use the concept of rank and locgroup. Second, and perhaps more important, all these other models allow discontiguous locations (also known as split locations). These will be familiar to anyone who has inspected GenBank annotated DNA records for an or- ganism that has introns within the transcripts; the transcript location is modelled as a sequence of non-contiguous intervals on the genome. The interval represents the location of an exon.

Although Chado allows a feature to have multiple locations, this is only with variable rank and locgroup this is enforced by a uniqueness constraint in the relational schema. We made a conscious decision to avoid discontiguous locations, because the extra degree of freedom this affords results in either redundancies or ambiguities. Redundancies arise when exons are stored in addition to a discontiguous transcript, and ambiguities arise by virtue of the fact that explicit representation of the exons may be seen as optional. Ambiguities are undesirable as it makes it harder for databases to interoperate. The omission of discontiguous locations does not restrict the expressive capacity of Chado in any way, because any discontiguous location can be modelled as a collection of features with contiguous locations. For example, a transcript with a discontiguous location can be modelled as a collection of exons with contiguous featurelocs, and a transcript with a single contiguous featureloc representing the outer boundaries defined by the outermost exons.


Extensible feature properties

The feature table has a fairly limited set of columns for recording feature data. For example, there is no anticodon column for recording the RNA triplet for the adapter in a tRNA feature (all feature types, including tRNAs, are recorded as rows in the feature table). If we were to add columns such as anticodon then the number of columns in the table would become very large and difficult to manage; most would end up being nullable (for example, anticodon does not apply to non-tRNA features). This is because different organisms, different types of feature and different projects have differing needs regarding what extra data should be attached to any one feature. How then are we to attach both biologically relevant and project specific data to features? Chado solves this by using an extensible mechanism for attaching attribute- value pairs to features via the featureprop table. The featureprop.type id foreign key column references a property in the Sequence Feature Property Ontology (SFPO)[url], distributed as part of Chado. The value text column stores the value filler for that property. Sets or lists of values for any property can be stored in the featureprop table, differentiated by the value of the rank column. Provenance for the featureprop assignment is stored using the featureprop pub table in the publications module, described later, allowing multiple publications to be associated with any one assignment.

Because featureprop values can be of an arbitrary size, they are modelled using a SQL TEXT type. This has some disadvantages from a query efficiency perspective.

Numeric values cannot be indexed correctly, and sorting the results of a query can only be done via a SQL casting operation, or in software outside of the database management system, either of which may result in poorer performance. This is one of several areas in Chado where performance has been traded in favour of a simpler, more abstract and generic model. Later on we will look at strategies for offsetting some of these performance penalties.

[example table]


Feature annotations

Detailed annotations, such as associations to Gene Ontology[5] (GO) terms or Cell Ontology[12] terms, can be attached to features using the feature cvterm linking table. This allows multiple ontology terms to be associated with each feature.

Provenance data can be attached with the feature cvtermprop and feature cvterm dbxref higher- order linking tables. It is up to the curation policy of each individual Chado database instance to decide which kinds of features will be linked using feature cvterm. Some may link terms to gene features, others to the distinct gene products (processed RNAs and polypeptides) linked to the gene features (see next section)


Relationships between features

Biological features are inter-related; exons are part of transcripts, transcripts are part of genes, and polypeptides are derived from messenger RNAs. Relationships between individual features are stored in the feature relationship table, which connects two features via the subject id and object id columns (foreign keys referring to the feature table) and a type id (a foreign key referring to a relationship type in an ontology, either SO[6], or the OBO relationship ontology, OBO-REL[13]) indicating the nature of the relationship between subject and object features. The core relationships between features are part-whole (part of) or temporal (derives from). ”Subject” and ”Object” describes the linguistic role the two features play in a sentence describing the feature relationship. In English, many sentences follow a subject, predicate, object word order. To say that ”exons are part of transcripts” is the correct way to describe a typical biological relationship. To say ”transcripts are part of exons” is either grammatically or biologically incorrect.

We use this same terminology (which comes from RDF[REF]) again in the cv module. The collection of features and feature relationships can be considered as vertices and edges in a graph, known as the Feature Graph (FG). Some example feature graphs are shown [figure FEATURE- GRAPH]. The FG is independent of the LG in general the FG and the LG should have no edges in common if there is a featureloc connecting two features, then the addition of a feature relationship between these same two features is redundant.

The FG is required in order to query the database for such things as alternately spliced genes, exons shared between transcripts, etc.

Although the chado schema admits any FG, certain configurations are biologically meaningless, and should not be used. The FG can be constrained by the Sequence Ontology. Standardized FG structures are required for complex applications to be interoperable - this is discussed later on.

Unlike the LG, the FG may be cyclic, although cycles in the FG are not common. The subset of the FG corresponding to certain kinds of relationship may be acyclic for example, the subset of the FG connecting parts with wholes via part of must be acyclic.


Canonical gene models


Regulatory regions


Sequence variants


Feature example


[Diagram showing an example that puts this all together]


  canonical-gene-model
  The "central dogma" gene model - gene makes mRNA makes polypeptide
  For many people this may be the only data they store in Chado. The
  typical protein coding gene model consists of a gene, one or more
  mRNAs, one or more exons, and at least one polypeptide.
  Alternately spliced genes have a 1 to many relation between gene and
  mRNA. Exons can be part_of more than one mRNA. No two distinct exon
  rows should have exact same featureloc coordinates (this indicates
  they are the same exon).
  Every [1]feature must have a [2]featureloc with rank=0 and locgroup=0.
  The value of the srcfeature_id column should be identical (i.e. all
  features are located relative to the same feature), except in rare
  circumstances such as when a feature crosses two contigs. Software is
  not guaranteed to support this. The srcfeature_id can point to a
  [3]contig, a [4]chromosome[5]chromosome_arm or other appropriate
  assembly unit.

This scenario involves rows in the following tables:

  table
  type_id
  number comments
  feature SO:gene 1
  feature SO:mRNA
  feature exon
  feature polypeptide
  Tool: apollo
  Status: supported
  Tool: gbrowse
  Status: supported

Example

  [.] Download:
  noncoding-gene
  Similar to [6]canonical-gene-model, except with noncoding-RNA
  Not all genes are protein-coding. Genes can code for tRNA, miRNA,
  snoRNA, etc. A noncoding gene model is identical to a
  [7]canonical-gene-model, with the following exceptions:
    * There is no polypeptide feature
    * Instead of an mRNA feature, there is a feature that is some other
      sub-type of [8]RNA
  Tool: apollo
  Status: supported
  Tool: gbrowse
  Status: supported
  pseudogene
  A pseudogene is a non-functional relic of a gene
  See [9]pseudogene. A pseudogene may look like an ordinary gene, and
  may even have discernable parts such as exons. It may sometimes be
  desirable to annotate the exon structure of a pseudogene - this can in
  principle be done using SO types such as [10]decayed_exon. In practice
  no-one is using Chado to do this. There are currently two practices:
    * pseudogenes are treated analagously to [11]noncoding-genes. That
      is, there are normal "gene" and "exon" features. However, in place
      of a subtype of RNA, there is a feature of type pseudogene. This
      practice is STRONGLY DISCOURAGED (it is not compliant with the
      relations in SO, it gives false counts to the number of real genes
      in the database). Note that this is the current default for
      FlyBase.
    * Pseudogenes are normal [12]singleton-features. There is no
      annotation of exon structure. This practice is encouraged. If at a
      later date it becomes desirable to annotated the exon structure of
      a pseudogene, it will be compatible with this.
  Tool: apollo
  Status: unclear
  Apollo by default treats pseudogenes using the first method, above. It
  may also be possible to configure it to the second, singleton, method.
  Annotating the exon structure of pseudogenes the correct way has not
  yet been attempted to our knowledge.
  singleton-feature
  Many types of features are singletons - that is they are not related
  to other features through feature_relationships. Storage of these is
  basic and as one may expect
  Singleton features present no major problems. Unlike genes, which
  typically have parts (with the parts having subparts), singletons do
  not form feature graphs (or rather, they form feature graphs
  consisting of single nodes). Singleton features are located relative
  to other features (usually the genome, but once can have singletons
  that are located relative to other features - this may not be
  supported by all applications)
  Tool: gbrowse
  Status: suppported
  Tool: apollo
  Status: suppported
  Apollo supports singletons provided they are located relative to the
  genome (singletons located relative to other features will be
  ignored). It may be necessary to configure apollo to make the feature
  type "1-level"
  dicistronic-gene
  A dicistronic gene is a gene with a mRNA that codes for two distinct
  non-overlapping CDSs
  Dicistronic genes (see for example, the dmel Adh and Adhr genes) have
  totally distinct gene products deriving from the same transcript. To
  confuse matters, the two polypeptides are commonly refered to as being
  derived from two distinct genes (e.g. Adh and Adhr). The entire
  genomic region comprising the transcript (e.g. Adh+Adhr) that includes
  both CDSs is refered to as the [13]gene_cassette. In a database such
  as FlyBase, there are 3 gene IDs stored in the database - one for each
  of the two non-overlapping genes, and one for the gene cassette
  Dicistronic genes make it difficult to have a formal definition of
  gene that corresponds nicely with how biologists use the term.
  There are currently two proposals for handling dicistronic genes. The
  first is a hack and introduces redundancy, but works well with
  existing software and tools. The second is prefered from a modeling
  standpoint, but introduces a lot of complexity to software
  operon
  Bacterial genes are often transcribed in groups; eg LacZ
  There are many similarities with [14]dicistronic-genes here.
  trans-spliced-gene
  A trans-spliced gene has one or more transcripts in which that
  transcript may be spliced together from different parts of the genome
  A trans spliced transcript is spliced from exons coming from different
  parts of the genome. The distance between each trans spliced part may
  be large, or it may be in the same location on the opposite strand.
  Most C elegans genes have a trans spliced leader sequence. This is
  different from the trans splicing involved in dmel , where we observe
  what appears to be two transcripts on separate strands (both
  containing coding sequence) joining together in a single functional
  transcript
  There are two proposals for dealing with this. One treats the trans
  spliced transcript as a single transcripts, with exons coming from
  different locations. The other treats the trans spliced transcript as
  a mature transcript created from two distinct primary transcripts.
  Note that these proposals focus on the dmel example. A solution for
  the C elegans example is not proposed (not sure if we even need one?)
  We treat this as an ordinary gene model, but relax our rules for exon
  locations in a transcript
  For example, for the canonical Dmel trans spliced gene, we would allow
  transcripts to have exons on different strands. Note that in Chado,
  exon ordering comes from [15]feature_relationship.rank (between exon
  and transcript), NOT from the featureloc of the exon. Chado has no
  problem with this. However, some software may make assumptions that
  all exons are on the same strand, or may try to order exons by their
  location to get a transcript sequence. This software will have
  unintended consequences with trans spliced genes modeled using this
  proposal
  Tool: apollo
  Status: unclear
  apollo may accidentally scramble the order of exons. Need to check
  Tool: gbrowse
  Status: unclear
  Not sure.
  We would introduce extra transcripts, and have relations between the
  transcripts. Only the mature, spliced, transcript would have a
  relation to the polypeptide
  This may model the biology better. However, it introduces a major
  departure from the [16]canonical-gene-model. For this reason this
  proposal is unlikely to be adopted
  gene-with-regulatory-elements
  regulatory elements may be implicitly or explicitly associated with a
  gene
  transposons
  transposons can be annotated as [17]singleton-features or as complex
  annotations
  A transposon may consist of various parts such as
  [18]long_terminal_repeats and gene models coding for genes like gag,
  pol, env. These parts may have all decayed over time. Transposon
  annotation typically ignores these subtleties as all that is usually
  required is a [19]singleton-feature of type
  [20]transposable_element_feature. In this case, there is no difficulty
  If one requires detailed transposon annotation then one is entering
  uncharted water as far as both Chado and annotation tools are
  concerned (which is why this scenario is marked as being under
  discussion). One option would be to treat each transposon part as
  distinct singletons, but this may be unsatisfactory as one may desire
  to have the appropriate part_of relations between the parts.
  P-element-insertions
  SNPs
  gene-with-implicit-features-manifested
  Some feature types such as introns are not normally manifested as rows
  in chado. They are normally derived on-the-fly from the gaps between
  consecutive exons. See for an example. Occasionally it may be
  desirable to store the introns actual rows in the feature table - for
  scenario in a report database
  feature-localization
  All features with sequence annotation should be localized using
  featureloc
  localized features must have a [21]featureloc with rank=0 and
  locgroup=0. This is the primary location of the feature. The location
  always indicates the boundaries of the feature. If the feature is
  composed of distinct subfeatures (e.g. a transcript composes of
  exons), then it is NOT permitted to use multiple featurelocs to
  indicate this. Instead, there must be rows for the subfeatures, each
  with their own featureloc
  In a feature graph (i.e. a group of features connected via
  [22]feature_relationship rows, all features will typically be
  localized relative to the same source feature (i.e. they will all have
  the same value for featureloc.srcfeature_id)
  features are typically localized to some kind of genomic or assembly
  feature, but chado does not constrain you to using only this. For
  example, localizing features relative to a transcript or polypeptide
  or even exon is permitted, but unusual practices will most likely not
  be recognized by most software
  feature-localization-to-contigs-in-assembly
  In an assembled genome, it is common to locate relative to the
  top-level assembly units (e.g. chromosomes). However, it is also
  permissable to locate to smaller units such as [23]contigs or
  [24]golden_path_units
  If a genome assembly is not stable, it is common to locate relative to
  assembly units such as [25]contigs. These contigs may then be
  localized relative to the top-level assembly units. This is known in
  chado terms as a location graph.
  We discuss here location graphs of depth 2. See also
  [26]n-level-assemblies. This scenario is often invisible to software
  interoperating with Chado. The software is free to only look at the
  main features and the contig-level feature and ignore the top-level
  assembly feature. It may sometimes be desirable to have software that
  can perform location transformations, mapping features from contigs to
  top-level units and back
  Tool: apollo
  Status: unclear
  apollo should be happy to treat contigs just as if they were top-level
  units as chromosome arms. However, the user may have to explicitly
  provide contigs if location queries are desired. For example, apollo
  may retrieve nothing if the user asks for a certain range on
  chromosome 4, and the features are located relative to contigs which
  are themselves on chromosome 4.
  Tool: gbrowse
  Status: unclear
  Gbrowse may expect features to be located relative to top-level units
  such as chromosomes.
  redundant-localizations-to-different-assembly-levels
  Features can be located relative to both contigs and top-level
  assembly units
  Chado allows redundant feature localization using
  [27]featureloc.locgroup>0. This allows a database to have primary
  locations for features relative to contigs, and secondary locations
  relative to top-level units such as chromosomes. The converse is also
  allowed.
  This scenario is discouraged unless the chado db admin knows what they
  are doing. They must implement solutions to ensure that featurelocs
  with varying locgroup do not get out of sync. These solutions are not
  part of the standard Chado software suite. Nevertheless, this scenario
  may be useful for advanced users in certain circumstances
  Tool: gbrowse
  Status: unclear
  Not clear if gbrowse uses locgroup in querying. If it constrains by
  locgroup, then this is essentially the same as
  [28]feature-localization-to-contigs-in-assembly
  Tool: gbrowse
  Status: partial
  Not clear if apollo uses locgroup in querying. If it constrains by
  locgroup, then this is essentially the same as
  [29]feature-localization-to-contigs-in-assembly. Apollo will not
  preserve redundant featurelocs when writing back to db. This could
  lead to db getting out of sync.
  n-level-assemblies
  In theory it is possible (but rare) to have assemblies with variable
  depths, or with depths>2
  This scenario is rare. If required, then Chado can deal with this -
  there is no theoretical limit to the depth of a location graph. One
  can have annotated features located relative to minicontigs which are
  located relative to supercontigs which are located relative to
  chromosomes. Most software that interoperates with Chado will not be
  able to deal with this, so this scenario is discouraged except by
  advanced users who have no other option
  unlocalized-gene
  A gene without sequence based localization
  Many chado instances are purely concerned with genome annotation - in
  these cases it would be strange to have genes or other features such
  as transcripts with no localization (i.e. no featurelocs). However,
  this scenario is actually common when Chado is used in a wider
  context. We may of the existence of genes through non-sequence
  evidence such as genetics. When we have no sequence-based localization
  it is perfectly valid to have gene features with no featurelocs. When
  the time comes to create genome annotations for these, we just 'fill
  out' the gene feature by adding transcript and exon features.
  Tool: gbrowse
  Status: supported
  Gbrowse supports this scenario in that unlocalized features will be
  ignored from the genome viewer, which is appropriate
  Tool: apollo
  Status: supported
  Apollo supports this scenario in that unlocalized features will be
  ignored, which is appropriate behaviour for a genome annotation tool

References

  1. http://gmod.sourceforge.net/schema/doc/default_schema.html#feature
  2. http://gmod.sourceforge.net/schema/doc/default_schema.html#featureloc
  3. http://song.sourceforge.net/#contig
  4. http://song.sourceforge.net/#chromosome
  5. http://song.sourceforge.net/#chromosome_arm
  6. file://localhost/Users/bosborne/schema/chado/modules/sequence/doc/sequence-best-practices.html#canonical-gene-model
  7. file://localhost/Users/bosborne/schema/chado/modules/sequence/doc/sequence-best-practices.html#canonical-gene-model
  8. http://song.sourceforge.net/#RNA
  9. http://song.sourceforge.net/#pseudogene
 10. http://song.sourceforge.net/#decayed_exon
 11. file://localhost/Users/bosborne/schema/chado/modules/sequence/doc/sequence-best-practices.html#noncoding-gene
 12. file://localhost/Users/bosborne/schema/chado/modules/sequence/doc/sequence-best-practices.html#singleton-feature
 13. http://song.sourceforge.net/#gene_cassette
 14. file://localhost/Users/bosborne/schema/chado/modules/sequence/doc/sequence-best-practices.html#dicistronic-gene
 15. http://gmod.sourceforge.net/schema/doc/default_schema.html#feature_relationship
 16. file://localhost/Users/bosborne/schema/chado/modules/sequence/doc/sequence-best-practices.html#canonical-gene-model
 17. file://localhost/Users/bosborne/schema/chado/modules/sequence/doc/sequence-best-practices.html#singleton-feature
 18. http://song.sourceforge.net/#long_terminal_repeat
 19. file://localhost/Users/bosborne/schema/chado/modules/sequence/doc/sequence-best-practices.html#singleton-feature
 20. http://song.sourceforge.net/#transposable_element_feature
 21. http://gmod.sourceforge.net/schema/doc/default_schema.html#featureloc
 22. http://gmod.sourceforge.net/schema/doc/default_schema.html#feature_relationship
 23. http://song.sourceforge.net/#contig
 24. http://song.sourceforge.net/#golden_path_unit
 25. http://song.sourceforge.net/#contig
 26. file://localhost/Users/bosborne/schema/chado/modules/sequence/doc/sequence-best-practices.html#n-level-assemblies
 27. http://gmod.sourceforge.net/schema/doc/default_schema.html#featureloc
 28. file://localhost/Users/bosborne/schema/chado/modules/sequence/doc/sequence-best-practices.html#feature-localization-to-contigs-in-assembly
 29. file://localhost/Users/bosborne/schema/chado/modules/sequence/doc/sequence-best-practices.html#feature-localization-to-contigs-in-assembly


4.2 Best Practices


Chado is a generic schema, which means anyone writing software to query or write to chado (either middleware or applications) should be aware of the different ways in which data can be stored. We want to strike a nice balance between flexibility and extensibility on the one hand, and strong typing and rigor on the other. We want to avoid the situation we have with GenBank entries where there are a dozen ways of representing a gene model, but we need to be able to cope with the constant surprises biology throws at us in an attempt to confound our nice computable models.

Chado uses a layered model - this is tried and tested in software engineering. Some generic software can be targeted at the lower layers and be guaranteed to work no matter what. Other more specific software needs a more tightly defined rigorous model and should be targeted at the upper layers.

We require validation software and more formal/computable descriptions of these layers and policies - for now natural language descriptions will have to suffice.


4.2.1 Chado Compliance Layers


Layer 0: Relational Schema


Level 0 conformance basically means the schema is adhered to. Obviously, this is enforced by the DBMS.


Layer 1: Ontologies


Level 1 conformance is minimal conformance to SO - all feature.types must be SO terms, and all feature relationship.types must be SO relationship types.


Layer 2: Graph


Level 2 conformance is graph conformance to SO - all feature relationships between a feature of type X and Y must correspond to relationship of that type in SO; for example, mRNA can be part of gene, but mRNA can not be part of golden path region. [more detailed/formal explanation to come]. In practice Level 2 conformance may be undesirable, we may need to make modifications to SO.

Orthogonal to these layers are various additional policy decisions. Some of these are more tolerant of non-conformance than others. (there is also some overlaps with levels 1/2).


4.2.2 Examples: Current implementations


I have listed how FB implements each policy choice - other chado instances feel free to add....

TIGR: Currently at level 0 conformance, though most (if not all) of the terms being used have an obvious counterpart in SO. Therefore these ”TIGR Ontology” terms are used in the answers to the SO-related questions that appear below. We plan on updating our terms with SO terms very soon.

SO terms used for standard central-dogma gene model

FB: gene mRNA exon protein [other types are derivable]

TIGR: gene transcript CDS exon protein [though the strict answer is for any of these SO questions is ”none” since we do not yet meet level 1 conformance]

NOTE: we should be using ’polypeptide’ instead of ’protein’. For now, software should be tolerant of both these uses.

SO terms used for storing alignments

FB: match

TIGR: match

NOTE: we want to use the new more specific SO types for match set, match part, for hits and hsps respectively. For now, software should be tolerant of either usage.

TIGR: We’ve also extended the model for storing pairwise alignments to store multiple align- ments. Each member of the alignment is featureloced to the ’match’ feature. We’ve used this representation to store paralogous/orthologous gene families.




feature relationship.types


FB: partof (for mRNA to gene and exon to mRNA) producedby (for protein to mRNA)

TIGR: part of (gene-assembly, exon-transcript, assembly-supercontig) produced by (protein- CDS, CDS-transcript, transcript-gene)

NOTE: this should be ”part of” and ”derived from” to conform to SO. Most read-only software should be able to safely ignore feature relationship.type anyway. Protein should be polypeptide - see note above

NOTE: the main difference between FB and TIGR here is that TIGR introduce an intermediate CDS feature between mRNA and protein


featureloc policy


FB: all constituent parts of a central dogma gene model are located relative to the same srcfeature (the chromosome arm). No redundant locations (ie featureloc.group ¿ 0) are used

TIGR: Redundant locations are used and indicated with featureloc.group ¿ 0.


NOTE: we want to allow some flexibility with this policy. I believe that the constituent parts

linked located relative to the feature should always be followed. This can be stated more formally as:


 IF  X is linked to Y via feature\_relationship
 AND X is located relative to Z via featureloc.srcfeature\_id
 THEN Y must also be located relative to Z via featureloc.srcfeature\_id


TIGR: We’ve followed this policy in adding a featureloc between the protein and genomic

contig in our databases (such a featureloc does not appear in the Chado usage documents). This additional featureloc simplifies many queries, especially when looking at the genomic context of ’match’ features associated with proteins.

We should also expect that the fmin/fmax boundaries of a feature be defined the the outermost

boundaries of the outermost constituent part features (this rule may require refinement when we have promoters, enhancers and so on - but for now we don’t).

As to what the srcfeature should be, it could be a contig, and assembly or a top-level locat-

able feature such as chromosome or chromosome arm. Software should be tolerant of different choices here. Whilst it is generally always best to locate relative to the topmost feature (ie the arm/chromosome), sometimes this is not possible or desirable (eg low coverage, heterochromatin).


non-central dogma gene models


FB: we store a lot of non-central dogma gene models; noncoding gene models and pseudogenes [need to fill in more details here]

TIGR: not many of these stored yet, save for a few pseudogenes and the occasional non-coding

ORF


other features


FB: the FlyBase implementation includes many other feature types, including polyA site and se- quence variant [need to fill in details]

TIGR: using ’SNP’ in some databases


derivable features types


FB: derivable features (introns, UTRs, intergenic region) are not included. Feature typing is always done to the most specific, non-derivale level. For example, we never use types ”5 prime exon”, ”dicistronic gene”, ”coding exon” as these are always inferrable. We always use type ”gene” - the specific type of gene is inferred from the child type (mRNA, tRNA, snRNA, etc).

TIGR: derivable features are not included. currently not storing any tRNAs or snRNAs.


NOTE: whilst it is perfectly permissable to include redundant derivable features (useful for warehouse-style querying), you should not write software that expects to find these if you want the software to work on different chado db instances.


sequence variants


FB: these are included in chado, but they are lacking full detail

TIGR: only SNPs so far. the SNPs currently being stored are computed from pairwise align- ments of sequences already loaded into Chado, so each SNP feature is featureloc’ed to the appro- priate place on each of the two sequences (rather than having one of the featurelocs ”dangling”, as indicated in some of the Chado usage documents.) featureloc.residue info is used to redundantly store the base referenced in each of the two sequences.

NOTE: variation features should specify the edit that makes one feature (such as the reference/wild- type) from another (the variant/mutant/non-reference). There were perhaps 2 proposals for this [more details required...]


Chado usage scenarios version:

Index

canonical-gene-model final The "central dogma" gene model - gene makes mRNA makes polypeptide noncoding-gene final Similar to , except with noncoding-RNA pseudogene discussion A pseudogene is a non-functional relic of a gene singleton-feature discussion Many types of features are singletons - that is they are not related to other features through feature_relationships. Storage of these is basic and as one may expect dicistronic-gene discussion A dicistronic gene is a gene with a mRNA that codes for two distinct non-overlapping CDSs operon discussion Bacterial genes are often transcribed in groups; eg LacZ trans-spliced-gene discussion A trans-spliced gene has one or more transcripts in which that transcript may be spliced together from different parts of the genome gene-with-regulatory-elements discussion regulatory elements may be implicitly or explicitly associated with a gene transposons discussion transposons can be annotated as s or as complex annotations P-element-insertions final SNPs final gene-with-implicit-features-manifested discussion Some feature types such as introns are not normally manifested as rows in chado. They are normally derived on-the-fly from the gaps between consecutive exons. See for an example. Occasionally it may be desirable to store the introns actual rows in the feature table - for scenario in a report database feature-localization final All features with sequence annotation should be localized using featureloc feature-localization-to-contigs-in-assembly final In an assembled genome, it is common to locate relative to the top-level assembly units (e.g. chromosomes). However, it is also permissable to locate to smaller units such as s or s redundant-localizations-to-different-assembly-levels final Features can be located relative to both contigs and top-level assembly units n-level-assemblies final In theory it is possible (but rare) to have assemblies with variable depths, or with depths>2 unlocalized-gene final A gene without sequence based localization Abstract

This page contains a selection of Chado best-practices for different usage scenarios. It is designed to complement the Chado SQL DDL (you should familiarize yourself with this first) and the Sequence Ontology. This document status is ALPHA - in progress Scenarios

canonical-gene-model The "central dogma" gene model - gene makes mRNA makes polypeptide For many people this may be the only data they store in Chado. The typical protein coding gene model consists of a gene, one or more mRNAs, one or more exons, and at least one polypeptide.

Alternately spliced genes have a 1 to many relation between gene and mRNA. Exons can be part_of more than one mRNA. No two distinct exon rows should have exact same featureloc coordinates (this indicates they are the same exon).

Every feature must have a featureloc with rank=0 and locgroup=0. The value of the srcfeature_id column should be identical (i.e. all features are located relative to the same feature), except in rare circumstances such as when a feature crosses two contigs. Software is not guaranteed to support this. The srcfeature_id can point to a contig, a chromosome/chromosome_arm or other appropriate assembly unit.

This scenario involves rows in the following tables:

table type_id number comments feature SO:gene 1 The gene must always be provided feature SO:mRNA 1..n One or more transcripts are required, and these are always of type mRNA for protein-coding genes. feature_relationship OBO_REL:part_of SO:mRNA[1..n]---->[1]SO:gene transcripts are always linked to genes by a part_of relation. (Note that SO uses member_of here). One gene can have amny transcripts (multiple splicing). A transcript must always belong to exactly one gene (for an exception, see . feature SO:exon 1..n Exons are always required, even if the genome under consideration has no introns feature_relationship OBO_REL:part_of SO:exon[1..n]---->[1..n]SO:mRNA Exons are always linked to their container transcript (in this case, an mRNA) via the part_of relation. If a transcript is alternately spliced, then an exon can be part_of multiple transcripts feature SO:polypeptide 1..n A protein-coding gene always produces a polypeptide, by definition. The polypeptide is located relative to the same genomic feature as the exons, mRNAs and gene. A single featureloc is used, with fmin and fmax indicating the start and stop codon positions (location is inclusive of stop codon). The polypeptide sequence should be specified as an amino acid sequence. feature_relationship OBO_REL:derived_from SO:polypeptide[1]---->[1..n]SO:mRNA The polypeptide is always derived_from the mRNA. If two alternate spliceforms produce the same polypeptide (i.e. their sequence is the same) then the same polypeptide feature should be used. An mRNA can only derive one polypeptide. For exceptions, see dicistronic-gene featureloc 1..n Every feature above must have a featureloc Tool: apollo Status: supported Tool: gbrowse Status: supported Example

A Drosophila gene with 5 exons and a single spliceform Download: [game] [chado] [chaos] noncoding-gene Similar to canonical-gene-model, except with noncoding-RNA Not all genes are protein-coding. Genes can code for tRNA, miRNA, snoRNA, etc. A noncoding gene model is identical to a canonical-gene-model, with the following exceptions:

There is no polypeptide feature Instead of an mRNA feature, there is a feature that is some other sub-type of RNA This scenario involves rows in the following tables:

table type_id number comments feature SO:gene 1 The gene must always be provided feature SO:RNA 1..n Type can be SO:RNA or any subtype of this type feature_relationship OBO_REL:part_of SO:RNA[1..n]---->[1]SO:gene noncoding transcripts can also be alternately spliced feature SO:exon 1..n Exons are always required, even if the genome under consideration has no introns. feature_relationship OBO_REL:part_of SO:exon[1..n]---->[1..n]SO:RNA Exons are always linked to their container transcript (in this case, a non-mRNA subtype of SO:RNA) via the part_of relation. If a transcript is alternately spliced, then an exon can be part_of multiple transcripts featureloc 1..n Every feature above must have a featureloc Tool: apollo Status: supported Tool: gbrowse Status: supported pseudogene A pseudogene is a non-functional relic of a gene See pseudogene. A pseudogene may look like an ordinary gene, and may even have discernable parts such as exons. It may sometimes be desirable to annotate the exon structure of a pseudogene - this can in principle be done using SO types such as decayed_exon. In practice no-one is using Chado to do this. There are currently two practices: pseudogenes are treated analagously to noncoding-genes. That is, there are normal "gene" and "exon" features. However, in place of a subtype of RNA, there is a feature of type pseudogene. This practice is STRONGLY DISCOURAGED (it is not compliant with the relations in SO, it gives false counts to the number of real genes in the database). Note that this is the current default for FlyBase. Pseudogenes are normal singleton-features. There is no annotation of exon structure. This practice is encouraged. If at a later date it becomes desirable to annotated the exon structure of a pseudogene, it will be compatible with this. Tool: apollo Status: unclear Apollo by default treats pseudogenes using the first method, above. It may also be possible to configure it to the second, singleton, method. Annotating the exon structure of pseudogenes the correct way has not yet been attempted to our knowledge. singleton-feature Many types of features are singletons - that is they are not related to other features through feature_relationships. Storage of these is basic and as one may expect Singleton features present no major problems. Unlike genes, which typically have parts (with the parts having subparts), singletons do not form feature graphs (or rather, they form feature graphs consisting of single nodes). Singleton features are located relative to other features (usually the genome, but once can have singletons that are located relative to other features - this may not be supported by all applications) Tool: gbrowse Status: suppported Tool: apollo Status: suppported Apollo supports singletons provided they are located relative to the genome (singletons located relative to other features will be ignored). It may be necessary to configure apollo to make the feature type "1-level" dicistronic-gene A dicistronic gene is a gene with a mRNA that codes for two distinct non-overlapping CDSs Dicistronic genes (see for example, the dmel Adh and Adhr genes) have totally distinct gene products deriving from the same transcript. To confuse matters, the two polypeptides are commonly refered to as being derived from two distinct genes (e.g. Adh and Adhr). The entire genomic region comprising the transcript (e.g. Adh+Adhr) that includes both CDSs is refered to as the gene_cassette. In a database such as FlyBase, there are 3 gene IDs stored in the database - one for each of the two non-overlapping genes, and one for the gene cassette

Dicistronic genes make it difficult to have a formal definition of gene that corresponds nicely with how biologists use the term.

There are currently two proposals for handling dicistronic genes. The first is a hack and introduces redundancy, but works well with existing software and tools. The second is prefered from a modeling standpoint, but introduces a lot of complexity to software

operon Bacterial genes are often transcribed in groups; eg LacZ There are many similarities with dicistronic-genes here. trans-spliced-gene A trans-spliced gene has one or more transcripts in which that transcript may be spliced together from different parts of the genome A trans spliced transcript is spliced from exons coming from different parts of the genome. The distance between each trans spliced part may be large, or it may be in the same location on the opposite strand.

Most C elegans genes have a trans spliced leader sequence. This is different from the trans splicing involved in dmel , where we observe what appears to be two transcripts on separate strands (both containing coding sequence) joining together in a single functional transcript

There are two proposals for dealing with this. One treats the trans spliced transcript as a single transcripts, with exons coming from different locations. The other treats the trans spliced transcript as a mature transcript created from two distinct primary transcripts. Note that these proposals focus on the dmel example. A solution for the C elegans example is not proposed (not sure if we even need one?)

We treat this as an ordinary gene model, but relax our rules for exon locations in a transcript For example, for the canonical Dmel trans spliced gene, we would allow transcripts to have exons on different strands. Note that in Chado, exon ordering comes from feature_relationship.rank (between exon and transcript), NOT from the featureloc of the exon. Chado has no problem with this. However, some software may make assumptions that all exons are on the same strand, or may try to order exons by their location to get a transcript sequence. This software will have unintended consequences with trans spliced genes modeled using this proposal Tool: apollo Status: unclear apollo may accidentally scramble the order of exons. Need to check Tool: gbrowse Status: unclear Not sure. We would introduce extra transcripts, and have relations between the transcripts. Only the mature, spliced, transcript would have a relation to the polypeptide This may model the biology better. However, it introduces a major departure from the canonical-gene-model. For this reason this proposal is unlikely to be adopted gene-with-regulatory-elements regulatory elements may be implicitly or explicitly associated with a gene transposons transposons can be annotated as singleton-features or as complex annotations A transposon may consist of various parts such as long_terminal_repeats and gene models coding for genes like gag, pol, env. These parts may have all decayed over time. Transposon annotation typically ignores these subtleties as all that is usually required is a singleton-feature of type transposable_element_feature. In this case, there is no difficulty

If one requires detailed transposon annotation then one is entering uncharted water as far as both Chado and annotation tools are concerned (which is why this scenario is marked as being under discussion). One option would be to treat each transposon part as distinct singletons, but this may be unsatisfactory as one may desire to have the appropriate part_of relations between the parts.

P-element-insertions

SNPs


This outlines one way of modeling SNPs in chado. it also illustrates use of the featureloc table.

Most of this applies to other variation features, but I'll illustrute using SNPs for now to keep it simple.

A SNP is represented as a single feature in chado.

Let's take a basic example - a SNP that flips an A to a G on the genome.

Here we would have one feature and two featurelocs.

(feature

 (name "SNP_01")
 (featureloc
   (srcfeature "Chromosome_arm_2L") ;;; dna feature identifier
   (nbeg 1000000)
   (nend 1000001)
   (strand 1)
   (residue_info "A")
   (rank 0)
   (locgroup 0))
 (featureloc
   (residue_info "G")
   (rank 1)
   (locgroup 0)))

the first location is on the chromosome arm (presumably wildtype). the second location has no srcfeature (ie it is set to null). however, it is effectively paired with the first location. if we later wished to instantiate the mutant chromosome arm feature, we would fill in the second locgroup's srcfeature.

Let's take another example - a SNP that has only been characterised at the protein level. This SNP flips an I to a V

(feature

 (name "SNP_02")
 (featureloc
   (srcfeature "dpp-P1")    ;;; protein feature identifier
   (nbeg 23)
   (nend 24)
   (strand 1)
   (residue_info "I")
   (rank 0)
   (locgroup 0))
 (featureloc
   (residue_info "V")
   (rank 1)
   (locgroup 0)))

Again, the second featureloc has no srcfeature. the mutant protein is implicit. the mutant protein sequence can be infered by taking the sequence of "dpp-P1" and substituting the 24th residue with a V.

To do a query for all SNPs that switch I to V or vice versa:

SELECT snp.* FROM

 featureloc AS wildloc,
 featureloc AS mutloc,
 feature AS snp,
 cvterm AS ftype

WHERE

 snp.type_id = ftype.cvterm_id        AND
 ftype.termname = 'snp'               AND
 wildloc.feature_id = snp.feature_id  AND
 mutloc.feature_id = snp.feature_id   AND
 wildloc.locgroup = mutloc.locgroup   AND
 wildloc.residue_info = 'I'           AND
 mutloc.residue_info = 'I';


note that this query remains the same even if mutant protein features are instantiated as opposed to left implicit.


Let's look at a more complex example. If we have a SNP that has been localised to the genome, and the SNP has an effect on a protein (Isoleucine to Threonine), and we want to redundantly store the SNP effect on the genome, transcript and translation.

[note that in this example, the transcript is on the reverse strand, so the residue is reverse complemented]

(feature

 (name "SNP_03")
 ;; position on genome
 (featureloc
   (srcfeature "chrom_arm_3R")
   (nbeg 2000000)
   (nend 2000001)
   (strand 1)
   (residue_info "A")
   (rank 0)                       ;; wild
   (locgroup 0))
 (featureloc
   (residue_info "G")
   (rank 1)                       ;; mutant
   (locgroup 0))
 ;; position on transcript
 (featureloc
   (srcfeature "blah-transcript001")     ;; processed transcript ID
   (nbeg 1000)
   (nend 1001)
   (strand 1)
   (residue_info "T")
   (rank 0)                       ;; wild
   (locgroup 1))
 (featureloc
   (residue_info "C")
   (rank 1)                       ;; mutant
   (locgroup 1))
 ;; position on protein
 (featureloc
   (srcfeature "blah-protein001")    ;;; protein feature identifier
   (nbeg 23)
   (nend 24)
   (strand 1)
   (residue_info "I")
   (rank 0)                       ;; wild
   (locgroup 2))
 (featureloc
   (residue_info "T")
   (rank 1)                       ;; mutant
   (locgroup 2)))

Here we have 6 locations for one SNP. The 6 locations can be imagined to be in a 2D matrix. the purpose of rank and locgroup is to specify the column and row in the matrix

       | genome    transcript   protein

+-------------------------------

wild | A T I

       |

mutant | G C T

rank is used to group the strain and locgroup is used for the grouping within that strain. rank=0 should be used for the wildtype, but this is not always possible; locgroup=0 should be used for primary (as opposed to derived) location, this is not always possible. the important thing is consistency within a SNP to preserve the matrix.

One can imagine rare (but entirely possible) cases where by a single SNP causes different protein level changes in two proteins (for instance, HIV carries a doubly encoded gene - ie the ORFs overlap but have different frames).

Here we would want to add another locgroup, for the second protein

       | genome    transcript   protein1 protein2

+-----------------------------------------

wild | A T I Y

       |

mutant | G C T H

Again, if we don't need to instantiate the 2 mutant proteins, but their sequence can be reconstructed from the wild proteins plus the corresponding mutation

[remember chado is interbase, and postgresql substring counts from 1]

The following query dynamically constructs mutant feature residues based on the wildtype feature and the mutant residue changes. this should work for a variety of variation features, not just SNPs. Note that we need to use locgroup to properly group wild/mutant pairs of locations otherwise this query will give bad data.

SELECT

snp.name,
wildfeat.name,
substr(wildfeat.residues,
       1,
       wildloc.nbeg) ||
mutloc.residue_info  ||
substr(wildfeat.residues,
       wildloc.nend+1)

FROM

 featureloc AS wildloc,
 feature AS wildfeat,
 featureloc AS mutloc,
 feature AS snp,
 cvterm AS ftype

WHERE

 snp.type_id = ftype.cvterm_id         AND
 ftype.termname = 'snp'                AND
 wildloc.feature_id = snp.feature_id   AND
 mutloc.feature_id = snp.feature_id    AND
 wildloc.locgroup = mutloc.locgroup    AND
 wildloc.srcfeature = wildfeat


EXTENSIONS

The above will also work if we have a polymorphic site with a number of different possibilities across multiple strains. We just extend the number of rows in the location matrix (ie we have rank > 1).

We could also instantiate multiple SNPs, one per strain, and keep the locations pairwise.

SIMILARITIES TO ALIGNMENTS

You should hopefully notice the parallels between modeling SNPs and modeling pairwise (eg BLAST) and multiple alignments. The difference is, alignments would always have locgroup=0, with the rank distinguishing query from subject. Also, with an HSP feature, the residue_info is used to store the alignment string.

REDUNDANT STORAGE OF COORDINATES ON DIFFERENT ASSEMBLY LEVELS

Some groups may find it advantageous to redundantly store features relative to both BACs and chromosomes (or to mini-contigs and scaffolds... choose your favourite assembly units). The approach outlined above works perfectly well with this, we would simple add another column in the location matrix (ie another wild/mutant pair with a distinct locgroup). All queries should work the same.


gene-with-implicit-features-manifested Some feature types such as introns are not normally manifested as rows in chado. They are normally derived on-the-fly from the gaps between consecutive exons. See for an example. Occasionally it may be desirable to store the introns actual rows in the feature table - for scenario in a report database feature-localization All features with sequence annotation should be localized using featureloc localized features must have a featureloc with rank=0 and locgroup=0. This is the primary location of the feature. The location always indicates the boundaries of the feature. If the feature is composed of distinct subfeatures (e.g. a transcript composes of exons), then it is NOT permitted to use multiple featurelocs to indicate this. Instead, there must be rows for the subfeatures, each with their own featureloc

In a feature graph (i.e. a group of features connected via feature_relationship rows, all features will typically be localized relative to the same source feature (i.e. they will all have the same value for featureloc.srcfeature_id)

features are typically localized to some kind of genomic or assembly feature, but chado does not constrain you to using only this. For example, localizing features relative to a transcript or polypeptide or even exon is permitted, but unusual practices will most likely not be recognized by most software

feature-localization-to-contigs-in-assembly In an assembled genome, it is common to locate relative to the top-level assembly units (e.g. chromosomes). However, it is also permissable to locate to smaller units such as contigs or golden_path_units If a genome assembly is not stable, it is common to locate relative to assembly units such as contigs. These contigs may then be localized relative to the top-level assembly units. This is known in chado terms as a location graph.

We discuss here location graphs of depth 2. See also n-level-assemblies. This scenario is often invisible to software interoperating with Chado. The software is free to only look at the main features and the contig-level feature and ignore the top-level assembly feature. It may sometimes be desirable to have software that can perform location transformations, mapping features from contigs to top-level units and back

Tool: apollo Status: unclear apollo should be happy to treat contigs just as if they were top-level units as chromosome arms. However, the user may have to explicitly provide contigs if location queries are desired. For example, apollo may retrieve nothing if the user asks for a certain range on chromosome 4, and the features are located relative to contigs which are themselves on chromosome 4. Tool: gbrowse Status: unclear Gbrowse may expect features to be located relative to top-level units such as chromosomes. redundant-localizations-to-different-assembly-levels Features can be located relative to both contigs and top-level assembly units Chado allows redundant feature localization using featureloc.locgroup>0. This allows a database to have primary locations for features relative to contigs, and secondary locations relative to top-level units such as chromosomes. The converse is also allowed.

This scenario is discouraged unless the chado db admin knows what they are doing. They must implement solutions to ensure that featurelocs with varying locgroup do not get out of sync. These solutions are not part of the standard Chado software suite. Nevertheless, this scenario may be useful for advanced users in certain circumstances

Tool: gbrowse Status: unclear Not clear if gbrowse uses locgroup in querying. If it constrains by locgroup, then this is essentially the same as feature-localization-to-contigs-in-assembly Tool: gbrowse Status: partial Not clear if apollo uses locgroup in querying. If it constrains by locgroup, then this is essentially the same as feature-localization-to-contigs-in-assembly. Apollo will not preserve redundant featurelocs when writing back to db. This could lead to db getting out of sync. n-level-assemblies In theory it is possible (but rare) to have assemblies with variable depths, or with depths>2 This scenario is rare. If required, then Chado can deal with this - there is no theoretical limit to the depth of a location graph. One can have annotated features located relative to minicontigs which are located relative to supercontigs which are located relative to chromosomes. Most software that interoperates with Chado will not be able to deal with this, so this scenario is discouraged except by advanced users who have no other option unlocalized-gene A gene without sequence based localization Many chado instances are purely concerned with genome annotation - in these cases it would be strange to have genes or other features such as transcripts with no localization (i.e. no featurelocs). However, this scenario is actually common when Chado is used in a wider context. We may of the existence of genes through non-sequence evidence such as genetics. When we have no sequence-based localization it is perfectly valid to have gene features with no featurelocs. When the time comes to create genome annotations for these, we just 'fill out' the gene feature by adding transcript and exon features.

Tool: gbrowse Status: supported Gbrowse supports this scenario in that unlocalized features will be ignored from the genome viewer, which is appropriate Tool: apollo Status: supported Apollo supports this scenario in that unlocalized features will be ignored, which is appropriate behaviour for a genome annotation tool


Table definitions

feature


A feature is a biological sequence or a section of a biological sequence, or a collection of such sections. Examples include genes, exons, transcripts, regulatory regions, polypeptides, protein domains, chromosome sequences, sequence variations, cross-genome match regions such as hits and HSPs and so on; see the Sequence Ontology for more

feature

Column Datatype Description feature idinteger dbxref id integerAn optional primary public stable identifier for this

 feature. Secondary identifiers and external dbxrefs
 go in table:feature dbxref

organism id integerThe organism to which this feature belongs. This

 column is mandatory

namevarcharThe optional human-readable common name for a

 feature, for display purposes

uniquenametextThe unique name for a feature; may not be necessar-

 ily be particularly human-readable, although this is
 prefered. This name must be unique for this type of
 feature within this organism

residues textA sequence of alphabetic characters representing bi-

 ological residues (nucleic acids, amino acids). This
 column does not need to be manifested for all fea-
 tures; it is optional for features such as exons where
 the residues can be derived from the featureloc. It is
 recommended that the value for this column be man-
 ifested for features which may may non-contiguous
 sublocations (eg transcripts), since derivation at
 query time is non-trivial. For expressed sequence,
 the DNA sequence should be used rather than the
 RNA sequence

seqlen integerThe length of the residue feature. See col-

 umn:residues. This column is partially redundant
 with the residues column, and also with featureloc.
 This column is required because the location may be
 unknown and the residue sequence may not be man-
 ifested, yet it may be desirable to store and query
 the length of the feature. The seqlen should always
 be manifested where the length of the sequence is
 known

md5checksum charThe 32-character checksum of the sequence, calcu-

 lated using the MD5 algorithm. This is practically
 guaranteed to be unique for any feature. This col-
 umn thus acts as a unique identifier on the mathe-
 matical sequence

type idintegerA required reference to a table:cvterm giving the fea-

 ture type. This will typically be a Sequence Ontology
 identifier. This column is thus used to subclass the
 feature table

is analysis booleanBoolean indicating whether this feature is annotated

 or the result of an automated analysis. Analysis re-
 sults also use the companalysis module. Note that
 the dividing line between analysis/annotation may
 be fuzzy, this should be determined on a per-project
 basis in a consistent manner. One requirement is
 that there should only be one non-analysis version of
 each wild-type gene feature in a genome, whereas the
 same gene feature can be predicted multiple times in
 different analyses

is obsolete booleanBoolean indicating whether this feature has been ob-

 soleted. Some chado instances may choose to simply
 remove the feature altogether, others may choose to
 keep an obsolete row in the table

timeaccessioned timestamp for handling object accession/modification times-

 tamps (as opposed to db auditing info, handled else-
 where). The expectation is that these fields would
 be available to software interacting with chado

timelastmodified timestamp for handling object accession/modification times-

 tamps (as opposed to db auditing info, handled else-
 where). The expectation is that these fields would
 be available to software interacting with chado


featureloc

The location of a feature relative to another feature. IMPORTANT: INTERBASE COORDI- NATES ARE USED.(This is vital as it allows us to represent zero-length features eg splice sites, insertion points without an awkward fuzzy system). Features typically have exactly ONE loca- tion, but this need not be the case. Some features may not be localized (eg a gene that has been characterized genetically but no sequence/molecular info is available). NOTE ON MULTIPLE LOCATIONS: Each feature can have 0 or more locations. Multiple locations do NOT indicate non-contiguous locations (if a feature such as a transcript has a non-contiguous location, then the subfeatures such as exons should always be manifested). Instead, multiple featurelocs for a feature designate alternate locations or grouped locations; for instance, a feature designating a blast hit or hsp will have two locations, one on the query feature, one on the subject feature. features repre- senting sequence variation could have alternate locations instantiated on a feature on the mutant strain. the column:rank is used to differentiate these different locations. Reflexive locations should never be stored - this is for -proper- (ie non-self) locations only; i.e. nothing should be located relative to itself


Column Datatype Description
featureloc idinteger
feature idinteger  The feature that is being located. Any feature can
 have zero or more featurelocs
srcfeature idinteger  The source feature which this location is relative to.
 Every location is relative to another feature (how-
 ever, this column is nullable, because the srcfeature
 may not be known). All locations are -proper- that
 is, nothing should be located relative to itself. No
 cycles are allowed in the featureloc graph
fmininteger  The leftmost/minimal boundary in the linear range
 represented by the featureloc.  Sometimes (eg in
 bioperl) this is called -start- although this is con-
 fusing because it does not necessarily represent the
 5-prime coordinate. IMPORTANT: This is space-
 based (INTERBASE) coordinates, counting from
 zero. To convert this to the leftmost position in a
 base-oriented system (eg GFF, bioperl), add 1 to
 fmin
is fmin partial boolean  This is typically false, but may be true if the value
 for column:fmin is inaccurate or the leftmost part of
 the range is unknown/unbounded
fmaxinteger  The rightmost/maximal boundary in the linear range
 represented by the featureloc.  Sometimes (eg in
 bioperl) this is called -end- although this is con-
 fusing because it does not necessarily represent the
 3-prime coordinate. IMPORTANT: This is space-
 based (INTERBASE) coordinates, counting from
 zero. No conversion is required to go from fmax to
 the rightmost coordinate in a base-oriented system
 that counts from 1 (eg GFF, bioperl)
is fmax partial boolean  This is typically false, but may be true if the value
 for column:fmax is inaccurate or the rightmost part
 of the range is unknown/unbounded
strand integer  The  orientation/directionality of the  location.
 Should be 0,-1 or +1
phase  integer  phase of translation wrt srcfeature id.Values are
 0,1,2. It may not be possible to manifest this column for some features such as exons, because the
 phase is dependant on the spliceform (the same exon
 can appear in multiple spliceforms). This column is
 mostly useful for predicted exons and CDSs
residue info text  Alternative residues, when these differ from fea-
 ture.residues. for instance, a SNP feature located
 on a wild and mutant protein would have different
 alresidues. for alignment/similarity features, the altresidues is used to represent the alignment string
 (CIGAR format). Note on variation features; even
 if we dont want to instantiate a mutant chromo-
 some/contig feature, we can still represent a SNP
 etc with 2 locations, one (rank 0) on the genome,
 the other (rank 1) would have most fields null, ex-
 cept for altresidues
locgroup  integer  This is used to manifest redundant, derivable ex-
 tra locations for a feature. The default locgroup=0
 is used for the DIRECT location of a feature.  !!
 MOST CHADO USERS MAY NEVER USE featurelocs WITH logroup¿0 !! Transitively derived locations are indicated with locgroup¿0. For example,
 the position of an exon on a BAC and in global chromosome coordinates.This column is used to dif-
 ferentiate these groupings of locations. the default
 locgroup 0 is used for the main/primary location,
 from which the others can be derived via coordinate
 transformations. another example of redundant locations is storing ORF coordinates relative to both
 transcript and genome.redundant locations open
 the possibility of the database getting into inconsistent states; this schema gives us the flexibility of both
 warehouse instantiations with redundant locations
 (easier for querying) and management instantiations
 with no redundant locations. An example of using
 both locgroup and rank: imagine a feature indicating a conserved region between the chromosomes of
 two different species. we may want to keep redundant locations on both contigs and chromosomes. we
 would thus have 4 locations for the single conserved
 region feature - two distinct locgroups (contig level
 and chromosome level) and two distinct ranks (for
 the two species)
rankinteger  Used when a feature has ¿1 location, otherwise the
 default rank 0 is used. Some features (eg blast hits
 and HSPs) have two locations - one on the query
 and one on the subject. Rank is used to differentiate these. Rank=0 is always used for the query,
 Rank=1 for the subject. For multiple alignments, assignment of rank is arbitrary. Rank is also used for
 sequence variant features, such as SNPs. Rank=0
 indicates the wildtype (or baseline) feature, Rank=1
 indicates the mutant (or compared) feature


featureloc_pub

COMMENT ON INDEX featureloc c1 IS ’locgroup and rank serve to uniquely


 Table 4.3: featureloc pub

ColumnDatatypeDescription featureloc pub id integer featureloc id integer pub idinteger


feature_pub

Provenance. Linking table between features and publications that mention them


 Table 4.4: feature pub
 ColumnDatatype Description
 feature pub id integer
 feature id  integer
 pub idinteger


featureprop

A feature can have any number of slot-value property tags attached to it. This is an alternative to hardcoding a list of columns in the relational schema, and is completely extensible


 Table 4.5: featureprop
 ColumnDatatype Description
 featureprop id integer
 feature id  integer
 type id  integer  The name of the property/slot is a cvterm. The
 meaning of the property is defined in that cvterm.
 Certain property types will only apply to certain feature types (e.g. the anticodon property will only apply to tRNA features) ; the types here come from
 the sequence feature property ontology
 value text  The value of the property, represented as text. Numeric values are converted to their text representation. This is less efficient than using native database
 types, but is easier to query.
 rank  integer  Property-Value ordering. Any feature can have multiple values for any particular property type - these
 are ordered in a list using rank, counting from zero.
 For properties that are single-valued rather than
 multi-valued, the default 0 value should be used


featureprop_pub

for any one feature, multivalued property-value pairs must be differentiated by rank


Table 4.6: featureprop pub

Column Datatype Description featureprop pub id integer featureprop id integer pub id integer


feature_dbxref

links a feature to dbxrefs. This is for secondary identifiers; primary identifiers should use fea- ture.dbxref id


Table 4.7: feature dbxref

ColumnDatatype  Description
feature dbxref id integer
feature id  integer
dbxref idinteger
is current  booleanthe is current boolean indicates whether the linked
 dbxref is the current -official- dbxref for the linked
 feature


feature_relationship

features can be arranged in graphs, eg exon part of transcript part of gene; translation madeby transcript if type is thought of as a verb, each arc makes a statement [SUBJECT VERB OBJECT] object can also be thought of as parent (containing feature), and subject as child (contained feature or subfeature) – we include the relationship rank/order, because even though most of the time we can order things implicitly by sequence coordinates, we cant always do this - eg transpliced genes. its also useful for quickly getting implicit introns


 Table 4.8: feature relationship
 ColumnDatatype Description
 feature relationship id integer
 subject id  integer  the subject of the subj-predicate-obj sentence. This
 is typically the subfeature
 object idinteger  the object of the subj-predicate-obj sentence. This
 is typically the container feature
 type id  integer  relationship type between subject and object. This
 is a cvterm, typically from the OBO relationship
 ontology, although other relationship types are al-
 lowed. The most common relationship type is
 OBO REL:part of. Valid relationship types are con-
 strained by the Sequence Ontology
 value text  Additional notes/comments
 rank  integer  The ordering of subject features with respect to the
 object feature may be important (for example, exon
 ordering on a transcript - not always derivable if you
 take trans spliced genes into consideration). rank is
 used to order these; starts from zero


feature_relationship_pub

Provenance. Attach optional evidence to a feature relationship in the form of a publication


Table 4.9: feature relationship pub

 Column  Datatype Description
 feature relationship pub id  integer
 feature relationship idinteger
 pub id  integer


feature_relationshipprop

Extensible properties for feature relationships. Analagous structure to featureprop. This table is largely optional and not used with a high frequency. Typical scenarios may be if one wishes to attach additional data to a feature relationship - for example to say that the feature relationship is only true in certain contexts


Table 4.10: feature relationshipprop

Column Datatype Description feature relationshipprop id integer feature relationship idinteger type id integer The name of the property/slot is a cvterm.The

 meaning of the property is defined in that cvterm.
 Currently there is no standard ontology for feature relationship property types

valuetext The value of the property, represented as text. Numeric values are converted to their text representation. This is less efficient than using native database

 types, but is easier to query.

rank integer Property-Value ordering. Any feature relationship

 can have multiple values for any particular property
 type - these are ordered in a list using rank, counting from zero. For properties that are single-valued
 rather than multi-valued, the default 0 value should
 be used


feature_relationshipprop_pub

Provenance for feature relationshipprop


Table 4.11: feature relationshipprop pub

Column Datatype Description feature relationshipprop pub idinteger feature relationshipprop id integer pub id integer


feature_cvterm

Associate a term from a cv with a feature, for example, GO annotation


 Table 4.12: feature cvterm

ColumnDatatypeDescription feature cvterm id integer feature id integer cvterm idinteger pub idinteger Provenance for the annotation.Each annotation

 should have a single primary publication (which
 may be of the appropriate type for computational
 analyses) where more details can be found. Additional provenance dbxrefs can be attached using feature cvterm dbxref

is notboolean if this is set to true, then this annotation is interpreted as a NEGATIVE annotation - ie the feature

 does NOT have the specified function, process, component, part, etc. See GO docs for more details


feature_cvtermprop

Extensible properties for feature to cvterm associations. Examples: GO evidence codes; qualifiers; metadata such as the date on which the entry was curated and the source of the association. See the featureprop table for meanings of type id, value and rank


Table 4.13: feature cvtermprop

 Column Datatype Description
 feature cvtermprop id integer
 feature cvterm id  integer
 type idinteger  The name of the property/slot is a cvterm.  The

meaning of the property is defined in that cvterm. cvterms may come from the OBO evidence code cv

 value  text  The value of the property, represented as text. Numeric values are converted to their text representation. This is less efficient than using native database

types, but is easier to query.

 rankinteger  Property-Value ordering.  Any feature cvterm can

have multiple values for any particular property type - these are ordered in a list using rank, counting from zero. For properties that are single-valued rather than multi-valued, the default 0 value should be used


feature_cvterm_dbxref

Additional dbxrefs for an association. Rows in the feature cvterm table may be backed up by dbxrefs. For example, a feature cvterm association that was inferred via a protein-protein interaction may be backed by by refering to the dbxref for the alternate protein. Corresponds to the WITH column in a GO gene association file (but can also be used for other analagous associations). See http://www.geneontology.org/doc/GO.annotation.shtml#file for more details


Table 4.14: feature cvterm dbxref

Column Datatype Description feature cvterm dbxref id integer feature cvterm id integer dbxref id integer


feature_cvterm_pub

Secondary pubs for an association. Each feature cvterm association is supported by a single primary publication. Additional secondary pubs can be added using this linking table (in a GO gene association file, these corresponding to any IDs after the pipe symbol in the publications column


 Table 4.15: feature cvterm pub
 Column Datatype Description
 feature cvterm pub id integer
 feature cvterm id  integer
 pub id integer


synonym

A synonym for a feature. One feature can have multiple synonyms, and the same synonym can apply to multiple features


Table 4.16: synonym

 Column Datatype Description
 synonym idinteger
 namevarchar  The synonym itself.  Should be human-readable

machine-searchable ascii text

 type idinteger  types would be symbol and fullname for now
 synonym sgml varchar  The fully specified synonym, with any non-ascii characters encoded in SGML


feature_synonym

Linking table between feature and synonym


 Table 4.17: feature synonym
 Column Datatype Description
 feature synonym id integer
 synonym idinteger
 feature idinteger
 pub id integer  the pub id link is for relating the usage of a given

synonym to the publication in which it was used

 is currentboolean  the is current boolean indicates whether the linked

synonym is the current -official- symbol for the linked feature

 is internal  boolean  typically a synonym exists so that somebody query-

ing the db with an obsolete name can find the ob- ject theyre looking for (under its current name. If the synonym has been used publicly & deliberately (eg in a paper), it my also be listed in reports as a synonym. If the synonym was not used deliberately (eg, there was a typo which went public), then the is internal boolean may be set to -true- so that it is known that the synonym is -internal- and should be queryable but should not be listed in reports as a valid synonym


genotype

Table 4.35: genotype

ColumnDatatype Description genotype id integer uniquename text description varchar


feature_genotype

 Table 4.36: feature genotype
Column  Datatype Description
feature genotype id integer
feature id integer
genotype idinteger
chromosome id integer
rank integer
cgroup  integer
cvterm id  integer


environment

 Table 4.37: environment
 ColumnDatatype  Description
 environment id integer
 uniquename  text
 description text


environment_cvterm

 Table 4.38: environment cvterm
 Column Datatype Description
 environment cvterm id integer
 environment id  integer
 cvterm id integer


phenstatement

Phenotypes are things like ”larval lethal”. Phenstatements are things like ”dpp[1] is recessive larval lethal”. So essentially phenstatement is a linking table expressing the relationship between genotype, environment, and phenotype.


 Table 4.39: phenstatement
Column  DatatypeDescription
phenstatement id integer
genotype idinteger
environment idinteger
phenotype id  integer
type id integer
pub id  integer


phendesc

a summary of a set of phenotypic statements for any one gcontext made in any one publication


Table 4.40: phendesc

ColumnDatatype Description phendesc id integer genotype id integer environment id integer description text pub idinteger


phenotype_comparison

comparison of phenotypes eg, genotype1/environment1/phenotype1 ”non-suppressible” wrt geno- type2/environment2/phenotype2


Table 4.41: phenotype comparison

ColumnDatatype Description phenotype comparison id integer genotype1 idinteger environment1 idinteger genotype2 idinteger environment2 idinteger phenotype1 id integer phenotype2 id integer type id integer pub idinteger


phenotype

a phenotypic statement, or a single atomic phenotypic observation a controlled sentence describing observable effect of non-wt function – e.g. Obs=eye, attribute=color, cvalue=red


Table 4.42: phenotype

Column  Datatype Description
phenotype id  integer
uniquename text
observable id integer  The entity: e.g. anatomy part, biological process
attr id integer  Phenotypic attribute (quality, property, attribute,

character) - drawn from PATO

valuetext  value of attribute - unconstrained free text. Used

only if cvalue id is not appropriate

cvalue id  integer  Phenotype attribute value (state)
assay idinteger  evidence type


phenotype_cvterm

NULL


 Table 4.43: phenotype cvterm
 Column  Datatype Description
 phenotype cvterm id integer
 phenotype id  integer
 cvterm id  integer


feature_phenotype

NULL


Table 4.44: feature phenotype

 ColumnDatatype Description
 feature phenotype id integer
 feature id  integer
 phenotype idinteger

featuremap

NOTE: this module is all due for revision...


Table 4.45: featuremap
 Column  Datatype Description
 featuremap id integer
 name varchar
 descriptiontext
 unittype idinteger


featurerange

Table 4.46: featurerange

Column Datatype Description featurerange id integer featuremap idinteger feature idinteger leftstartf idinteger leftendf id integer rightstartf id integer rightendf id integer rangestr varcha


featurepos

Table 4.47: featurepos

ColumnDatatype Description featurepos id integer featuremap id integer feature id integer map feature id integer mapposfloat


featuremap_pub

map feature id links to the feature (map) upon which the feature is


Table 4.48: featuremap pub

ColumnDatatype Description featuremap pub id integer featuremap id integer pub id integer