Apollo-Chado Integration at BovineBase: Bugs and Suggestions
This was written by Justin Reese in preparation for Hackathon 2007.
In preparation for the Bovine Annotation effort, we set up a Chado database containing annotation evidence, allow annotators to connect via Apollo and do their annotations (we haven't gotten Apollo->Chado writebacks working yet, but we'd like to eventually).
We thought it might help GMOD developers improve Apollo/Chado interoperability to get some feedback from the Apollo users (annotators) and developers (the ones who set up our Chado db). So, below are some bug reports and suggestions that compiled from annotators and developers involved in the Bovine Genome annotation effort. I will be fleshing this out in the next 12-24 hours, hopefully before the hackathon starts hacking, but feel free to contact me if something isn't clear.
- Apollo crashes when opening some genes from the opening Chado dialog box. These genes tend to be near scaffold edges, but not all genes near edges cause this error. Example - connect to genomes.tamu.edu:5432, id: nobody, no pwd, open (by type of region: gene) GLEAN_00599. Apollo outputs this on stderr:"java.langStringIndexOutOfBoundsException: String index out of range:-2858" and either throws up an empty dialog box, or a dialog box saying "Can't load region", after which Apollo dies on some machines, but continues on on other machines (e.g. my macbook) but in a very strange, unusable error state.
- Pulling down a piece of evidence into the annotation tier to start an annotation worked fine for most types of evidence, but not for some (ESTs, I think were one class that did not work).
- Pulling down a single exon and trying to add it to an existing annotation never worked. This is always a bit touchy, because the sweet spot (where you must position the exon over the existing annotation before dropping it) is a bit small. But I could not get it to work for the life of me, and I tried multiple times with various different genes, working with various different people.
- When Chado analysis data is opened (via jdbc), and BLAST results are layered on top, occasionally, but not always, blast results are on the wrong strand, similar to this bug: http://sourceforge.net/tracker/index.php?func=detail&aid=1713046&group_id=27707&atid=462763
- Long timeout (a few minutes) during start up when network is unavailable. When Apollo starts, JDBC seems to automatically send a query out to the last database (I think to retrieve the chromosome names for the pull-down menu). This can result in a very long start up if the network is down or if the database in question is unavailable for some reason. Is there a way we could delay this db query until the user asks for it?
A few ideas for future improvements
- Move as much Apollo configuration stuff as possible out of conf files like chado-adapter.xml, and instead query the user or the database, e.g:
- Allow user to enter URLs, id's, password for Chado databases like they would in a web browser, rather than having them specified in chado-adapter.xml
- Have apollo retrieve "track" information from Chado's 'analysis' table, rather than specifying them in chado-adapter.xml (searchHitPrograms, genePredictionPrograms, etc).
- Our annotators aren't particularly good at installing conf files* and are spread out all over the world, so we can't really do it for them. Having things like tracks names and URLs hard-coded in conf files forces us to distribute new conf files to our annotators when we change something and hope they do it correctly. This hasn't always gone smoothly. Ideally, whenever possible, we would just change our Chado database (add a track or change our URL for example), and Apollo would automatically get hip by querying the Chado database or the user.
- *no offense, if any of you annotators are reading this
- Simplify track naming schemes in Apollo conf files - the names of the tracks are a little complex and hard to understand for the uninitiated developer, and it's not always clear which one to use. For example, during my first foray, I naively tried loading repeatmasker results under searchHitPrograms, not realizing that searchHitPrograms are always alignments between the reference sequence and a second sequence. Not sure if I can suggest an intelligent improvement, but would it be possible to construct tracks like you do in GBrowse (using aggregators and the names of the things I would like to aggregate, like gene/trancript/CDS) or have Apollo construct them automatically using some SQL magic (query for a parent, query for it's children, query for the children's children, etc?). Just a thought, this is probably asking a lot.