Difference between revisions of "Sample Chado SQL"
Dongilbert (Talk | contribs) (started sample chado SQL snippets) |
Dongilbert (Talk | contribs) (→Abstract) |
||
Line 4: | Line 4: | ||
view basic genome data from a Chado database. | view basic genome data from a Chado database. | ||
Some of these are drawn from the [[GMODTools]] | Some of these are drawn from the [[GMODTools]] | ||
− | configuration file | + | configuration file GMODTools/conf/bulkfiles/chadofeatsql.xml |
− | GMODTools/conf/bulkfiles/chadofeatsql.xml | + | Example output of some of these is shown in the |
− | + | tables at http://insects.eugenes.org/genome/Drosophila_melanogaster/current/tables/ | |
=organism_summary= | =organism_summary= |
Revision as of 20:09, 16 April 2007
Contents
Abstract
This HOWTO provides several sample SQL queries to view basic genome data from a Chado database. Some of these are drawn from the GMODTools configuration file GMODTools/conf/bulkfiles/chadofeatsql.xml Example output of some of these is shown in the tables at http://insects.eugenes.org/genome/Drosophila_melanogaster/current/tables/
organism_summary
This lists organisms and number of features per organism.
SELECT o.organism_id,o.abbreviation,o.genus,o.species,o.common_name, count(f.feature_id) as n_features, o.comment FROM organism o LEFT JOIN feature f USING (organism_id) GROUP by o.organism_id,o.abbreviation,o.genus,o.species,o.common_name,o.comment ORDER BY o.genus,o.species ;
feature_summary
This lists number of features and sequences by species and type.
SELECT f.type_id, t.name as Feature_type, count(f.feature_id) as N_features, sum(length(f.residues)) as N_residues, sum(f.seqlen) as Tot_len, ROUND( AVG(f.seqlen), 0 ) as Ave_len, MIN(f.seqlen) as Min_len, MAX(f.seqlen) as Max_len, (select genus || '_' || species from organism where organism_id = f.organism_id) as Species FROM feature f, cvterm t WHERE f.type_id = t.cvterm_id GROUP BY f.organism_id, f.type_id, t.name ORDER BY species, feature_type ;
analysis_summary
This lists analyses and number of features per analysis.
SELECT an.analysis_id, CASE WHEN (an.sourcename IS NULL OR an.sourcename = 'dummy') THEN 'match:' || an.program ELSE 'match:' || an.program || ':' || an.sourcename END AS Analysis_type, count(f.feature_id) as N_features, ROUND( (AVG(af.rawscore)::numeric), 2 ) as Ave_score, ROUND( (AVG(af.significance)::numeric), 2 ) as Ave_sig, (select genus || '_' || species from organism where organism_id = f.organism_id) as Species FROM feature f, analysisfeature af, analysis an WHERE an.analysis_id = af.analysis_id and af.feature_id = f.feature_id GROUP BY f.organism_id, an.analysis_id, Analysis_type ORDER BY species, Analysis_type ;
property_summary
This lists properties and number of features per analysis.
SELECT fp.type_id, t.name as Property_type, count(fp.featureprop_id) as N_properties, count(distinct f.feature_id) as N_features, count(distinct fp.value) as N_values, (select genus || '_' || species from organism where organism_id = f.organism_id) as Species FROM feature f, featureprop fp, cvterm t WHERE fp.type_id = t.cvterm_id and fp.feature_id = f.feature_id GROUP BY f.organism_id, fp.type_id, t.name ORDER BY Species, Property_type ;
gene_page
This is a sample gene page view to list most attributes for a gene feature. NOTE: this kind of multi-table join view can be very slow to execute on a large genome database.
Usage: dev_chado_01c=# select v.* from v_genepage2 v join feature as f using (feature_id) where f.name = 'PAU1';
CREATE OR REPLACE VIEW v_genepage2 (feature_id, field, value) AS SELECT feature_id AS feature_id, 'Name' as field, name as value FROM feature UNION ALL SELECT feature_id AS feature_id, 'uniquename' as field, uniquename as value FROM feature UNION ALL SELECT feature_id AS feature_id, 'seqlen' as field, cast(seqlen as text) as value FROM feature UNION ALL SELECT f.feature_id AS feature_id, 'type' as field, c.name as value FROM feature f, cvterm c WHERE f.type_id = c.cvterm_id UNION ALL SELECT f.feature_id AS feature_id, 'organism' as field, o.abbreviation as value FROM feature f, organism o WHERE f.organism_id = o.organism_id UNION ALL SELECT fs.feature_id AS feature_id, CASE WHEN fs.is_current IS FALSE THEN 'Synonym_2nd' ELSE 'Synonym' END AS field, s.name as value FROM feature_synonym fs, synonym s WHERE fs.synonym_id = s.synonym_id UNION ALL SELECT f.feature_id AS feature_id, 'Dbxref' as field, gd.name||':'||gx.accession as value FROM feature f, db gd, dbxref gx WHERE f.dbxref_id = gx.dbxref_id and gx.db_id = gd.db_id UNION ALL SELECT fs.feature_id AS feature_id, CASE WHEN fs.is_current IS FALSE THEN 'Dbxref obsolete' ELSE 'Dbxref 2' END AS field, (d.name || ':' || s.accession)::text AS value FROM feature_dbxref fs, dbxref s, db d WHERE fs.dbxref_id = s.dbxref_id and s.db_id = d.db_id UNION ALL SELECT fc.feature_id AS feature_id, c.name AS field, substr(cv.name,1,40) || '; '|| dx.accession AS value FROM feature_cvterm fc, cvterm cv, cv c, dbxref dx WHERE fc.cvterm_id = cv.cvterm_id and cv.cv_id = c.cv_id and cv.dbxref_id = dx.dbxref_id UNION ALL SELECT fp.feature_id AS feature_id, cv.name AS field, fp.value AS value FROM featureprop fp, cvterm cv WHERE fp.type_id = cv.cvterm_id UNION ALL SELECT fl.feature_id AS feature_id, 'location' as field, chr.uniquename ||':'|| cast( fl.fmin+1 as text) ||'..'|| cast( fl.fmax as text) || CASE WHEN fl.strand IS NULL THEN ' ' WHEN fl.strand < 0 THEN ' [-]' ELSE ' [+]' END AS value FROM featureloc fl, feature chr WHERE fl.srcfeature_id = chr.feature_id UNION ALL SELECT af.feature_id AS feature_id, 'an:' || CASE WHEN a.name IS NOT NULL THEN a.name WHEN a.sourcename IS NOT NULL THEN (a.program || '.' || a.sourcename)::text ELSE a.program END AS field, CASE WHEN af.rawscore IS NOT NULL THEN cast(af.rawscore as text) WHEN af.normscore IS NOT NULL THEN cast(af.normscore as text) WHEN af.significance IS NOT NULL THEN cast(af.significance as text) ELSE cast(af.identity as text) END AS value FROM analysisfeature af, analysis a WHERE af.analysis_id = a.analysis_id ;
simple gene_page output
dev_chado_01c=# select v.* from v_genepage2 v join feature as f using (feature_id) where f.name = 'PAU1'; feature_id | field | value ------------+------------+-------------------------- 23 | Name | PAU1 23 | uniquename | PAU1 23 | seqlen | 23 | type | gene 23 | organism | S.cerevisiae 23 | Synonym | PAU1 23 | Dbxref | GeneID:853232 23 | Dbxref 2 | GFF_source:GenBank 23 | Dbxref 2 | GeneID:853232 23 | gene | PAU1 23 | locus_tag | YJL223C 23 | location | NC_001142:8776..9138 [-]
... above data was loaded from Yeast GenBank Genome (i.e. not very complex)
More Information
Please send questions to the GMOD developers list:
gmod-devel@lists.sourceforge.net
Authors
- Dongilbert 16:05, 16 April 2007 (EDT)