Chado FAQ

From GMOD
Revision as of 18:14, 29 August 2008 by Scott (Talk | contribs)

Jump to: navigation, search

About this FAQ

What is this FAQ?

It is the list of Frequently Asked Questions about Chado.

How is it maintained?

It is now maintained as a Wiki on this site. You can help maintain it by adding questions and answers.

Chado Questions

How do you pronounce chado?

Chado is usually pronounced like this.

How does one represent BLAST results in Chado? or alignments? or...

Questions about the best ways to represent a variety of observations are answered at the Chado Best Practices page. There is also a worked example for this at Load_BLAST_Into_Chado.

Where do I find a list of tables in Chado?

The Chado Tables page.

What are the modules in Chado?

They are listed in the Chado Manual page.

Is there a Chado for Beginners?

The best place to start would be the Chado Manual or GMOD for the Biologist.

Loading data into Chado

When I try to load data into Chado using the GFF bulk loader (gmod_bulk_load_gff3.pl), I get this error:
 DBD::Pg::db pg_endcopy failed: ERROR:  invalid input syntax for integer: ""
 CONTEXT:  COPY feature_relationship, line 1, column type_id: "" at /usr/lib/perl5/site_perl/5.8.8/Bio/GMOD/DB/Adapter.pm line 2723, <$fh> line 64298.
Why is that and what do I do?
Unfortunately there is a bug in one of the prerequisites for the Chado loader, a perl module called DBIx::DBStag, which does the actual writing of ontology data to the database. When it loads the Gene Ontology (and possibly other ontologies), it destroys the 'part_of' cvterm that belongs to the relationship ontology and makes it part of GO instead. This is the wrong behavior, but at the moment, there is nothing we can do about it.
Instead, you must run a SQL command to repair the database:

<sql>

update cvterm set cv_id = (select cv_id from cv where name = 'relationship')
 where name = 'part_of'
  and cv_id in (select cv_id from cv where name='gene_ontology');

</sql>

Then, rerunning the loader with the --recreate_cache option should allow the database to load. Sorry for the hassle.

Why is loading GFF3 data so slow and what can I do about it?

The gmod_bulk_load_gff3.pl script has to do quite a bit of work that the similarly named bp_bulk_load_gff.pl does not have to do. Since Chado makes extensive use of constraints and foreign key relationships, the bulk loader has to keep track of all of those constraints while parsing the GFF3 file. Also, when it is loading data, it does it in a single database transaction, which can take quite a while if there is a lot of data.
So, what to do about it. First, I would suggest breaking up the load in to several smaller chunks and loading them sequentially. The script gmod_gff3_preprocessor.pl has options for splitting GFF3 files in several ways, like by chromosome or by the "source" (the value in the second column of the GFF3 file). Typically, when I do this, I create a simple bash script that will load the files one by one and then run it and check back periodically to make sure it is doing ok. By breaking the load up into several smaller files, the load process is easier to follow typically goes faster (particularly if the load fails for some reason, the database will rollback to the last known good state and you only have to continue the load from where things went bad).
Also, there are command line options for trying to increase speed, but I haven't spent much time benchmarking them. In particular, their is an option to drop indexes and then recreate them after the load, as well as an option to not load the database in a single transaction.