XORT Presentation

From GMOD
Jump to: navigation, search

This Wiki section is an edited version of Josh Goodman and Pinglei Zhou's presentation.

Introduction
  • An XML-database mapping system for data exchange between DB and XML-driven application
  • XORT can handle typical XML, it's not Chado-specific
  • Developed/Supported by Pinglei Zhou at FlyBase Harvard, 0.007 version now.
  • Used at all FlyBase sites
    • Harvard has extensive library of Perl modules for generating ChadoXML
  • Written in Perl
  • Required Perl modules:
Chado XML
  • Is Chado XML necessary? No, but it may help you.
  • ChadoXML assists with incremental updates, if you want to avoid flush-and-reload.
  • While update can be achived by other middleware (for example, perl Class::DBI, Java Hibernate), ChadoXML provide additional feature as way to archive your transaction.
  • It provides bulk update/download which other methods lack or is inefficient
Components
  • Database & Schema
  • ChadoXML Specification
  • DumpSpec
    • DumpSpec files are simple XML files that tell XORT what to do
    • DumpSpec files are language independent, being XML
    • It's fairly easy for those who know the schema to read these files and understand what the operation is
Highlights of Chado XML Specification
  • Unique represent of specific database schema
  • Get away with those internal primary key value
  • Static vs. Operational
  • Encoding for non-ASCII characters
  • Macro mechanism (object reference)
Putting it together: New FlyBase dataflow Part 1

There are three Flybase sites, and most curation is done at Harvard and Cambridge. Proforma is the curation format at Cambridge and Harvard, but Harvard also curates with Apollo and ChadoXML.

Once in Chado, the reporting instance, there's a denormalization step in moving data to a read-only database. Once in the read-only database there are dumps, for reporting purposes, using XORT to create ChadoXML. Once ChadoXML is created version 2 of XSLT is used to create HTML and GFF. HTML reports are for human-readable reports, GFF for GBrowse and for various power users.

1.a. Proforma (FlyBase Cambridge) is converted to ChadoXML

1.b. ChadoXML is created by Apollo (Harvard)

1.c. ChadoXML is created by Java SEAN (Harvard)

2. All ChadoXML is loaded into Chado by XORT

Putting it together: New FlyBase dataflow Part 2

3. Chado (Harvard) is denormalized and loaded into Chado (Indiana)

4. ChadoXML is created from Chado using XORT

5.a. GFF and Fasta is created from ChadoXML

5.b. HTML is created from Chado XML

Data & Report Generation
  • Content of all output files is controlled by XML dumpspecs.
    • Dumpspecs are language independent.
    • Easily readable (with knowledge of Chado structure).
  • All XML transformation steps are done with XSLT v2.
    • Saxon XSLT (http://saxon.sourceforge.net/)
    • ChadoXML is split into individual chunks before XSLT processing to accommodate large file sizes.
    • Extremely fast. We can process all data for ~60,000 Drosophila genes in under 30 minutes.
Hibernate & XORT
  • Hibernate didn't scale well when dealing with 5,000+ features in bulk.
    • The test was simply calling print() statements
  • Performance tweaks for Hibernate can be quite complicated to setup for bulk operations.
  • XORT is currently handling ~6 million features in production with only minor performance problems.
  • XORT is much more language independent.
Support for complex transactions using XORT

For example:

  • Find all records linked to a record using dumpspec
  • Merge gene x into y, each with thousands of records attached

Step 1. Dump all data use simple dumpspec

 <chado>
  <feature dump=“all”>
   <uniquename test=“eq”>x</uniquename>
  </feature>
 </chado>

Step 2 Delete feature x from DB, with triggers to clean orphan records, if necessary

Step 3. Edit the output xml, change uniquename x to y, then load the edited file back to DB

CHIA (Chado Interface Application)

A Java application that organizes SQL and XORT functionality for internal users, e.g.:

  • Dump chado-XML for gene regions for Apollo curation
  • Organize and execute “canned” SQL queries
  • Serve IDs for curators (in development)
  • Dynamic browser Chado without writing SQL statement

CHIA is being designed to be extensible for adding new functionality as needed.


Documentation
  • Using Chado to Store Genome Annotation Data"
    • Current Protocols in Bioinformatics (Baxevanis, A.D., and Davison, D.B., eds) 2, 9.6.1-9.6.28.
  • XORT specification docs
  • XORT draft (unpublished)
  • GMOD case demo procedure
Acknowledgements
  • Willian Gelbart
  • Chris Mungall
  • David Emmert
  • Mark Gibson
  • Stan Letovsky
  • Nomi Harris
  • Frank Smutniak
  • Suzanna Lewis
  • Peili Zhang
  • Stan Letovsky
  • Haiyan Zhang
  • Aubrey de Grey
  • Andy Schroeder
  • Don Gilbert
  • Susan Russo
  • Mark Zythovicz
  • Scott Cain
  • Lincoln Stein
  • Victor Strelets
  • Robert Wilson
  • Paul Leyland