GMOD

XORT Presentation

This Wiki section is an edited version of Josh Goodman and Pinglei Zhou’s presentation.

1 Introduction
2 Chado XML
3 Components
4 Highlights of Chado XML Specification
5 Putting it together: New FlyBase dataflow Part 1
6 Putting it together: New FlyBase dataflow Part 2
7 Data & Report Generation
8 Hibernate & XORT
9 Support for complex transactions using XORT
10 CHIA (Chado Interface Application)
11 Documentation
12 Acknowledgements

Introduction

An XML-database mapping system for data exchange between DB and XML-driven application
XORT can handle typical XML, it’s not Chado-specific
Developed/Supported by Pinglei Zhou at FlyBase Harvard, 0.007 version now.
Used at all FlyBase sites
- Harvard has extensive library of Perl modules for generating ChadoXML
Written in Perl
Required Perl modules:

Chado XML

Is Chado XML necessary? No, but it may help you.
ChadoXML assists with incremental updates, if you want to avoid flush-and-reload.
While update can be achived by other middleware (for example, perl Class::DBI, Java Hibernate), ChadoXML provide additional feature as way to archive your transaction.
It provides bulk update/download which other methods lack or is inefficient

Components

Database & Schema
ChadoXML Specification
DumpSpec
- DumpSpec files are simple XML files that tell XORT what to do
- DumpSpec files are language independent, being XML
- It’s fairly easy for those who know the schema to read these files and understand what the operation is

Highlights of Chado XML Specification

Unique represent of specific database schema
Get away with those internal primary key value
Static vs. Operational
Encoding for non-ASCII characters
Macro mechanism (object reference)

Putting it together: New FlyBase dataflow Part 1

There are three Flybase sites, and most curation is done at Harvard and Cambridge. Proforma is the curation format at Cambridge and Harvard, but Harvard also curates with Apollo and ChadoXML.

Once in Chado, the reporting instance, there’s a denormalization step in moving data to a read-only database. Once in the read-only database there are dumps, for reporting purposes, using XORT to create ChadoXML. Once ChadoXML is created version 2 of XSLT is used to create HTML and GFF. HTML reports are for human-readable reports, GFF for GBrowse and for various power users.

1.a. Proforma (FlyBase Cambridge) is converted to ChadoXML

1.b. ChadoXML is created by Apollo (Harvard)

1.c. ChadoXML is created by Java SEAN (Harvard)

2. All ChadoXML is loaded into Chado by XORT

Putting it together: New FlyBase dataflow Part 2

3. Chado (Harvard) is denormalized and loaded into Chado (Indiana)

4. ChadoXML is created from Chado using XORT

5.a. GFF and Fasta is created from ChadoXML

5.b. HTML is created from Chado XML

Data & Report Generation

Content of all output files is controlled by XML dumpspecs.
- Dumpspecs are language independent.
- Easily readable (with knowledge of Chado structure).
All XML transformation steps are done with XSLT v2.
- Saxon XSLT (http://saxon.sourceforge.net/)
- ChadoXML is split into individual chunks before XSLT processing to accommodate large file sizes.
- Extremely fast. We can process all data for ~60,000 Drosophila genes in under 30 minutes.

Hibernate & XORT

Hibernate didn’t scale well when dealing with 5,000+ features in bulk.
- The test was simply calling print() statements
Performance tweaks for Hibernate can be quite complicated to setup for bulk operations.
XORT is currently handling ~6 million features in production with only minor performance problems.
XORT is much more language independent.

Support for complex transactions using XORT

For example:

Find all records linked to a record using dumpspec
Merge gene x into y, each with thousands of records attached

Step 1. Dump all data use simple dumpspec

 <chado>
  <feature dump=“all”>
   <uniquename test=“eq”>x</uniquename>
  </feature>
 </chado>

Step 2 Delete feature x from DB, with triggers to clean orphan records, if necessary

Step 3. Edit the output xml, change uniquename x to y, then load the edited file back to DB

CHIA (Chado Interface Application)

A Java application that organizes SQL and XORT functionality for internal users, e.g.:

Dump chado-XML for gene regions for Apollo curation
Organize and execute “canned” SQL queries
Serve IDs for curators (in development)
Dynamic browser Chado without writing SQL statement

CHIA is being designed to be extensible for adding new functionality as needed.

Documentation

Using Chado to Store Genome Annotation Data”
- Current Protocols in Bioinformatics (Baxevanis, A.D., and Davison, D.B., eds) 2, 9.6.1-9.6.28.
XORT specification docs
XORT draft (unpublished)
GMOD case demo procedure
- All in the doc directory of XORT package, http://www.gmod.org

Acknowledgements

Willian Gelbart
Chris Mungall
David Emmert
Mark Gibson
Stan Letovsky
Nomi Harris
Frank Smutniak
Suzanna Lewis
Peili Zhang
Stan Letovsky
Haiyan Zhang
Aubrey de Grey
Andy Schroeder
Don Gilbert
Susan Russo
Mark Zythovicz
Scott Cain
Lincoln Stein
Victor Strelets
Robert Wilson
Paul Leyland

Categories:

FlyBase
XORT

Documentation

Community

Tools

Browse properties
Last updated at 18:54 on 9 October 2012.
Content is available under a GNU Free Documentation License unless otherwise noted.

GMOD

XORT Presentation

Contents

Introduction

Chado XML

Components

Highlights of Chado XML Specification

Putting it together: New FlyBase dataflow Part 1

Putting it together: New FlyBase dataflow Part 2

Data & Report Generation

Hibernate & XORT

Support for complex transactions using XORT

CHIA (Chado Interface Application)

Documentation

Acknowledgements

Navigation menu

Navigation

Documentation

Community

Tools