Difference between revisions of "Sample Chado SQL"

From GMOD
Jump to: navigation, search
(started sample chado SQL snippets)
 
(Abstract)
Line 4: Line 4:
 
view basic  genome data from a Chado database.
 
view basic  genome data from a Chado database.
 
Some of these are drawn from the [[GMODTools]]
 
Some of these are drawn from the [[GMODTools]]
configuration file
+
configuration file GMODTools/conf/bulkfiles/chadofeatsql.xml
GMODTools/conf/bulkfiles/chadofeatsql.xml
+
Example output of some of these is shown in the
 
+
tables at http://insects.eugenes.org/genome/Drosophila_melanogaster/current/tables/
  
 
=organism_summary=
 
=organism_summary=

Revision as of 20:09, 16 April 2007

Abstract

This HOWTO provides several sample SQL queries to view basic genome data from a Chado database. Some of these are drawn from the GMODTools configuration file GMODTools/conf/bulkfiles/chadofeatsql.xml Example output of some of these is shown in the tables at http://insects.eugenes.org/genome/Drosophila_melanogaster/current/tables/

organism_summary

This lists organisms and number of features per organism.

   SELECT o.organism_id,o.abbreviation,o.genus,o.species,o.common_name, 
          count(f.feature_id) as n_features, o.comment
   FROM organism o LEFT JOIN feature f USING (organism_id)
   GROUP by o.organism_id,o.abbreviation,o.genus,o.species,o.common_name,o.comment
   ORDER BY o.genus,o.species
   ;

feature_summary

This lists number of features and sequences by species and type.

 SELECT 
   f.type_id,
   t.name as Feature_type,
   count(f.feature_id) as N_features,
   sum(length(f.residues)) as N_residues,
   sum(f.seqlen) as Tot_len,
   ROUND( AVG(f.seqlen), 0 ) as Ave_len,
   MIN(f.seqlen) as Min_len,
   MAX(f.seqlen) as Max_len,
   (select genus || '_' || species from organism where organism_id = f.organism_id) as Species
 FROM feature f, cvterm t
 WHERE f.type_id = t.cvterm_id
 GROUP BY f.organism_id, f.type_id, t.name
 ORDER BY species, feature_type
 ;

analysis_summary

This lists analyses and number of features per analysis.

 SELECT 
   an.analysis_id,
   CASE WHEN (an.sourcename IS NULL OR an.sourcename = 'dummy') THEN 'match:' || an.program
     ELSE 'match:' || an.program || ':' || an.sourcename
   END AS Analysis_type, 
   count(f.feature_id) as N_features,
   ROUND( (AVG(af.rawscore)::numeric), 2 ) as Ave_score,
   ROUND( (AVG(af.significance)::numeric),  2 ) as Ave_sig,
   (select genus || '_' || species from organism where organism_id = f.organism_id) as Species
 FROM feature f, analysisfeature af, analysis an
 WHERE an.analysis_id = af.analysis_id and af.feature_id = f.feature_id
 GROUP BY f.organism_id, an.analysis_id, Analysis_type  
 ORDER BY species, Analysis_type
 ;

property_summary

This lists properties and number of features per analysis.

 SELECT 
   fp.type_id,
   t.name as Property_type,
   count(fp.featureprop_id) as N_properties,
   count(distinct f.feature_id) as N_features,
   count(distinct fp.value) as N_values,
   (select genus || '_' || species from organism where organism_id = f.organism_id) as Species
 FROM feature f, featureprop fp, cvterm t
 WHERE fp.type_id = t.cvterm_id and fp.feature_id = f.feature_id
 GROUP BY f.organism_id, fp.type_id, t.name
 ORDER BY Species, Property_type
 ;

gene_page

This is a sample gene page view to list most attributes for a gene feature. NOTE: this kind of multi-table join view can be very slow to execute on a large genome database.

Usage: dev_chado_01c=# select v.* from v_genepage2 v join feature as f using (feature_id) where f.name = 'PAU1';


 CREATE OR REPLACE VIEW v_genepage2 
     (feature_id, field, value)
   AS
     SELECT feature_id AS feature_id, 'Name' as field, name as value FROM feature
   UNION ALL
     SELECT feature_id AS feature_id, 'uniquename' as field, uniquename as value FROM feature 
   UNION ALL
     SELECT feature_id AS feature_id, 'seqlen' as field, cast(seqlen as text) as value FROM feature  
   UNION ALL
     SELECT f.feature_id AS feature_id, 'type' as field, c.name as value 
     FROM feature f, cvterm c  WHERE f.type_id = c.cvterm_id 
   UNION ALL
     SELECT f.feature_id AS feature_id, 'organism' as field, o.abbreviation as value  
     FROM feature f, organism o WHERE  f.organism_id = o.organism_id 
   
   UNION ALL
     SELECT fs.feature_id AS feature_id,  
       CASE WHEN fs.is_current IS FALSE THEN 'Synonym_2nd' ELSE 'Synonym' END AS field,
       s.name as value 
     FROM  feature_synonym fs, synonym s 
     WHERE fs.synonym_id = s.synonym_id  
   
   UNION ALL
     SELECT f.feature_id AS feature_id, 'Dbxref' as field, gd.name||':'||gx.accession as value
     FROM   feature f, db gd, dbxref gx
     WHERE  f.dbxref_id = gx.dbxref_id and gx.db_id = gd.db_id  
   
   UNION ALL
     SELECT fs.feature_id AS feature_id,  
       CASE WHEN fs.is_current IS FALSE THEN 'Dbxref obsolete' ELSE 'Dbxref 2' END AS field, 
       (d.name || ':' || s.accession)::text AS value
     FROM  feature_dbxref fs, dbxref s, db d
     WHERE fs.dbxref_id = s.dbxref_id and s.db_id = d.db_id
   
   UNION ALL
     SELECT fc.feature_id AS feature_id, c.name AS field, 
           substr(cv.name,1,40) || '; '|| dx.accession AS value
     FROM  feature_cvterm fc, cvterm cv, cv c, dbxref dx
     WHERE fc.cvterm_id = cv.cvterm_id and cv.cv_id = c.cv_id  
           and cv.dbxref_id = dx.dbxref_id
   
   UNION ALL
     SELECT fp.feature_id AS feature_id, cv.name AS field, fp.value AS value
     FROM  featureprop fp, cvterm cv
     WHERE fp.type_id = cv.cvterm_id  
   
   UNION ALL
     SELECT fl.feature_id AS feature_id, 'location' as field, 
         chr.uniquename ||':'|| cast( fl.fmin+1 as text) ||'..'|| cast( fl.fmax as text)
         || CASE 
           WHEN fl.strand IS NULL THEN ' '
           WHEN fl.strand < 0 THEN ' [-]'
           ELSE ' [+]'
           END AS value
     FROM   featureloc fl, feature chr
     WHERE  fl.srcfeature_id = chr.feature_id 
   
   UNION ALL
     SELECT af.feature_id AS feature_id,   
      'an:' ||  
      CASE 
         WHEN a.name IS NOT NULL THEN a.name
         WHEN a.sourcename IS NOT NULL THEN (a.program || '.' || a.sourcename)::text
         ELSE a.program
       END  AS field,
       CASE  
         WHEN af.rawscore IS NOT NULL THEN cast(af.rawscore as text)
         WHEN af.normscore  IS NOT NULL  THEN cast(af.normscore  as text)
         WHEN af.significance  IS NOT NULL THEN cast(af.significance as text)
         ELSE cast(af.identity  as text)
       END  AS value 
     FROM   analysisfeature af, analysis a
     WHERE  af.analysis_id = a.analysis_id 
 ;
 

simple gene_page output

 dev_chado_01c=# select v.* from v_genepage2 v join feature as f using (feature_id) where f.name = 'PAU1';
  feature_id |   field    |          value           
 ------------+------------+--------------------------
          23 | Name       | PAU1
          23 | uniquename | PAU1
          23 | seqlen     | 
          23 | type       | gene
          23 | organism   | S.cerevisiae
          23 | Synonym    | PAU1
          23 | Dbxref     | GeneID:853232
          23 | Dbxref 2   | GFF_source:GenBank
          23 | Dbxref 2   | GeneID:853232
          23 | gene       | PAU1
          23 | locus_tag  | YJL223C
          23 | location   | NC_001142:8776..9138 [-]

... above data was loaded from Yeast GenBank Genome (i.e. not very complex)

More Information

Please send questions to the GMOD developers list:

gmod-devel@lists.sourceforge.net

Authors