README for LuceGene LuceGene ('Lucy Jean') is a document/object search and retrieval system for Genome and Bioinformatic Databases Version: 1.4 (released) Date : 20 January 2005 Authors: D. Gilbert, gilbertd@indiana.edu, http://marmot.bio.indiana.edu/ Paul Poole, pppoole@bio.indiana.edu, and others This is an open-source document/object search and retrieval system specially tuned for bioinformatics text databases and documents. It is part of the GMOD (Generic Model Organism Database) project, http://www.gmod.org/lucegene/, and also http://eugenes.org/gmod/lucegene/ LuceGene is similar in concept to the widely used, commercially successful, bioinformatics program SRS (Sequence Retrieval System). LuceGene and Lucene will always be available in source form for public database uses, due to their open-source license. Though written in Java language, it can be used from command-line shells, and performs well that way (current uses include Perl CGI's calling lucegene). LuceGene is built on top of the open-source Lucene package, http://jakarta.apache.org/lucene/ Lucene is used un-changed, with added methods for biology data. See the accompanying INSTALL.txt for instructions for use of the lucegene web application with sample data. The basic package is available at http://eugenes.org/gmod/lucegene/dist/lucegene.war with a Demo server at http://eugenes.org/demolucegene/ ABOUT LuceGene --------------- Information Retrieval for Genomes * IR text search/retrieval tools tuned for data access, not management * Good for a wide range of semi-structured and complex structured data * Better functional match for textual data common in biology than numeric, table-oriented RDBMS * Easier to add new data (e.g. SRS parses 100s of existing bio-databanks) Lucene and LuceGene * Lucene open-source project at jakarta.apache.org/lucene * Common text search features: booleans, phrases, word stemming, fuzzy and field range searches, relevance ranking * Comparable to Glimpse, Exite, WAIS, Isearch, ht/dig, Alta-vista, Google backends * Author Doug Cutting has written text search engines for Apple and Excite * LuceGene additions Data input adaptors for HTML; XML (e.g. MedLine); FlyBase flatfile; Biosequences (GenBank, EMBL, etc.) Basic output formats for XML, HTML via XSLT, Text, Spreadsheet Numeric Range search (** ADDED April 2004) * Tested with 100,000s of FlyBase Genes, References, Game and Chado XML annotations euGenes gene summaries & Daphnia Medline, Sequences, HTML documents * LuceGene/Lucene needs Links/joins among databases Output adaptors ; more flexible configuration for webapp use * WebServices support as Genome Directory System Access to FlyBase data via GDS (WebServices) using LuceGene backend added, with simple server/client SOAP using org.eugenes.services.Directory interface. Distribution files ------------------ Currently these alpha distribution files are available - -- lucegene-1.4-src.jar : sources, documents, configuration for base lucegene software with indexing methods for biology data -- lucegene.war : binary distribution, for webapp (Tomcat) uses -- lucegene_fb.war : webapp customized for FlyBase use -- sample data for lucegene.war (extracts of BIND, BLAST, FlyBas, GEO, Medline, PDF, UniProt) lucegene_demo-data.zip (4 MB) lucegene_demo-indices.zip (5 MB) lucegene_demo-pdfpapers.zip (10 MB) LuceGene is also available as part of the ARGOS genome database replication system at rsync://eugenes.org/argos/common/java/lucegene/ rsync://flybase.net/argos/common/java/lucegene/ Development ----------- The ant java build program is used for building sources, and should be configured to rebuild all the distribution files (you will need to edit build.properties) The '*src.jar' distribution files include needed library files to build all source. The "*.war" files are packaged with all needed library files Release v 1.4, January 2005 Data Indexing ------------- For a worked example using command-line tools to index and search, see docs/lucegene-index-example.txt. The primary first step to use LuceGene or Lucene is to index a set of files. Currently this shell script handles indexing: bin/lucegene-index.sh Usage: lucegene-index.sh -p dbs/lucegene/go.properties Options: (most options are in .properties) -data DATA_ROOT lucene data directory -index INDEX_ROOT lucene indices directory -lib LIB_NAME lucene library -prop PROP_FILE index properties -test list files to index -debug debug output Data configuration files specify data formats, fields, structures, and are located in conf/{library-name}.properties These are key=value files (Java properties), with information on where to locate data, its format (and Java class to index), with field-specifications on how to index each data field. Besides configuration files, the indexer will take Java 'plugin' classes that allow tuning of indexing specific data formats and fields. The src/LucegeneIndexers.java is an example of this. The shell index.sh script recompiles these as needed. This shell script will run command line and interactive searches of indexed data. bin/lucegene-search.sh Usage: lucegene-search.sh -l go -c 'lookup docid:1' Options: (other options are in .properties) -command 'search command' -index INDEX_ROOT lucene indices directory -lib LIB_NAME lucene library -prop PROP_FILE search properties -debug debug output The Perl CGI chado2apollo.cgi is one example how to search lucene indices from programs. Example command-line operation ------------------------------ bin/lucegene-search.sh -l gamexml -p dbs/lucegene/gamexml.properties -c 'format native; find arm:X AND (start:[2000000 3000000])' -- commands from line bin/lucegene-search.sh -l fbgn -debug -p dbs/lucegene/fbgn.properties -- interactive operation Find ARM:X AND (BLOC.start:[100000 200000] OR BLOC.stop:[100000 200000]) 6 matches to 15589 documents 16 ms search time docid GSYM SYM ID ARM BLOC.start BLOC.stop FBan0013374 pcl CG13374 FBgn0011822;FBan0013374 X 193624 193624 FBan0003796 ac CG3796 FBgn0000022;FBan0003796 X 127469 127469 FBan0003827 sc CG3827 FBgn0004170;FBan0003827 X 153498 153498 FBan0003839 l(1)sc CG3839 FBgn0002561;FBan0003839 X 167161 167161 FBan0003757 y CG3757 FBgn0004034;FBan0003757 X 113947 113947 FBan0069523 3S18{}4;3S18{}4 FBti0019523;FBan0069523 X 185912 185912 Public service addition using LuceGene (June 2004) ----------------------------------------- A use for genome directory system (WebServices) with FlyBase http://flybase.net/ws/services/Directory?wsdl See also http://preview.flybase.net/lucegene/webservices/ and LuceGene distribution bin/gdsflybase.pl, gdsflybase.java for sample clients. Given LuceGene with indices to data, the set up for WebServices access follows using Tomcat server and Axis, with interface org.eugenes.services.Directory Public services using LuceGene (Jan 2005) ----------------------------------------- FlyBase Search system preview http://preview.flybase.net/lucegene/ euGenes multi-organism gene search/retrieval http://eugenes.org/lucegene/ Daphnia/wFleaBase search for sequences, Medline abstracts, Web documents http://wfleabase.org/search/ FlyBase Annotated sequence bulk-retrieval service using LuceGene http://flybase.net/cgi-bin/gnoseqbatch FlyBase Apollo annotation data web service using LuceGene http://flybase.net/apollo/ http://flybase.net/apollo-cgi/chado2apollo.cgi Apollo Service notes: Game XML object retrieval using Lucene is 10x to 20x faster than generating them from Postgres Chado db (Pg slows down more the larger the object set/region). You will get a gene query result in 10 to 15 seconds (in my tests from IU to my home computer via cable). A full cytoband of 20 MB of XML took 66 seconds using Lucene (most of that in data transfer time), but took 20 minutes calling Postgres (and it died with an error after that time). That is about as speedy as one can expect, though some tweaking (doing away with Postgres queries entirely) could speed it up a bit. Why use LuceGene ? ------------------ Lucene and LuceGene support fast search and retrieval from document/object databanks, databases, or as many would call them, flat-files. These are commonly used in bioinformatics to represent and exchange the complex, hierarchical data that rapidly builds up from biosciences. Lucene is fast and capable with high-volume data sets (millions of documents, multi-gigabytes of data). Lucene handles full, semi-structured text searching well, and much biology data is in these formats. Lucene and LuceGene use common fielded document parsing, and can index for retrieval a very large variety of document formats, with minimal work. Current supported formats include XML (including biology examples Medline abstracts and Game sequence annotation), HTML, Text, PDF Tabular data (about any single-line row/colum table format) Bio-sequences (including Fasta, GenBank, EMBL and others) Gene object data used in FlyBase, euGenes A major benefit of a document/object approach to search/retrieval is that the complex data objects ('genes', 'proteins', etc.) known by biologists, are often represented in RDBMS by complex, normalized, multiple tables which, while very good for data management, are slow at extracting all the parts needed in a full data object. Document/object databases are often text storage that are indexed by means such as lucene, SRS, and other text-retrieval methods. These doc/obj dbs are very non-normalized, but are a close match in structure to the knowledge structures representing 'genes', 'proteins', etc. For a simple tabular data set, RDBMS and Lucene/text retreival systems will perform similarly (biodata tests suggest lucene is a bit faster). As the complexity of data objects increase, the RDBMS structure needed to represent them in normal form require a more complex searches, while the search methods of Lucene tend to cost about the same with increasing data complexity. For example the Chado Postgres database used by FlyBase currently takes minutes to generate a single gene object, from a fairly small subset of FlyBase's total gene object data (sequence annotations only). Search and retrieval of the same data objects using Lucene indexed XML data pregenerated from Chado Postgres takes just a few seconds. (See also docs/BiodevelopersLucene-BIND.html for note from uses of lucene at BIND project). -----------------