About the LuceGene data parsers ================================ Over the years working to tune SRS parsers to flybase needs, we found it common to repeatedly tweak/update/revise the parsers to fit new data details or correct other errors. One of the reasons I set up the dbs/lucegene/ folder with a collection of property and Java source files for parsing was to mimic the ease for editing SRS icarus parsers -- have all data parsing logic in a few related source files, and be able to quickly test and change the logic as needed. The LuceGene LuceneXxxIndexer document readers are the main indexing class and know about the overall data file structure and how to separate it into records (documents) and fields, based on dataset properties. They do some "pre-lucene" field manipulations also (see esp. LuceneBaseIndexer.FieldRecoder interface). Then the BiodataAnalyzer is called by Lucene document indexer to parse out the field-specific parts. Current LuceneXxxIndexer's include package org.eugenes.index; LuceneBaseIndexer -- handles a key=value type of data structure, with many records / file separated by some end-of-record string. Others below subclass this base class. LuceneAcodeIndexer -- flybase's hierarchical key=value structure (acode) LuceneHtmlIndexer -- uses org.apache.lucene.demo.html.HTMLParser for html, text files. LucenePdfIndexer -- uses org.pdfbox.pdfparser.PDFParser to parse PDF texts. LuceneFastaIndexer -- for Fasta sequence, parses various header styles using regex properties. LuceneReadseqIndexer -- uses Readseq to parse other sequence formats. LuceneTableIndexer -- parses tabular data files, with regex to specify column values (a document == 1 table row). LuceneXmlIndexer -- parse XML data. The data library specific needs are configured with a '.properties' file identifying data locations (file regex patterns), Lucene parameters, and any number of field parameters. These later are split among three main functions: tokenizer.FIELD={ one or more Lucene Tokenizer classes } tokenfilter.FIELD= { one or more Lucene TokenFilter classes } fieldrecoder.FIELD= { one or more java classes for recoding data before Lucene indexes} LuceneXxxIndexer does in essense this: load databank properties locate files to index (see properties) get Analyzer (BiodataAnalyzer or other, with field-handling) get Lucene IndexWriter (with Analyzer) iterate thru input files: get file Parser (versions for Acode, XML, Tabular, Bio-sequence) parse fields and documents to IndexWriter The Parser does field handling beyond Analyzer functions: split input into data fields reprocess fields based on properties: rename, set standard fields (docid, file-location url, etc.) get FieldRecoder - these can do a lot of data-specific input manipulation including add new fields, skip or rewrite current ones, before passing data to IndexWriter and its Analyzer processing (these are things that cant be done, or at least not easily, with Lucene's Analyzer/Tokenfilter paradigm). See the conf/flybase/LucegeneIndexers.java source for a set of these. add Lucene Fields to Lucene Document IndexWriter.addDocument() The Lucene analyzer set doesn't do much useful for biology/science text data. Its examples are mostly for general documents with human-readable text (as opposed to scientist-readable :), and its StandardAnalyzer does things to text that we don't want for bio-data (word splits, lowercase, ...). The Analyzer interface classes TokenFilter, Tokenizer, TokenStream are parts which work with rest of Lucene (document indexing and also searching), as well as the overall Analyzer base class. Analyzer doesn't do anything but call the Tokenizer and TokenFilter classes to convert to index fields: TokenStream tokenStream( String fieldName, Reader reader) There is one Tokenizer to split input from the input Reader, then any number of changed TokenFilters acting on that are possible (though maybe not as efficient as one Tokenizer doing all the work). Lucene v1.4's PerFieldAnalyzerWrapper is an example of how to chain analyzers, but otherwise isn't integrated in Lucene, and since there isn't a set of useful Analyzers in lucene, what we have in BiodataAnalyzer does the work needed, but for note below about chaining token filters. BiodataAnalyzer takes its field-specific configuration from properties file info, with 'tokenizer' and 'tokenfilter' entries. Then it processes input stream from LuceneBaseIndexer class with field names, attaching appropriate tokenizer/filters as needed. Here is a way to add a chain of tokenfilters in properties file: tokenizer.SYM=fbacode$DataTokenizer tokenfilter.SYM=\ fbacode$DataFilter,\ fbacode$LowercaseFilter,\ fbacode$SymbolFilter,\ Likewise fieldrecoders can be chained: fieldrecoder.SYM=fbacode$Sym_Recoder, fbacode$DBX_Recoder Aside from whatever base classes are needed for the primary indexing work, keep most data library / field specific logic in db/lucegene/{datalib}.properties, and .java files. The current lucegene-indexer.sh script will recompile .java in that folder and add to class path as needed. ====================================== Library Indexing: Properties ====================================== # dbs/lucegene/fbrftest.properties # uses dbs/lucegene/fbrftest.java classes LIB_NAME=fbrftest title = FlyBase References DATA_ROOT=web/data2/fbobs/ INDEX_PATH=indices/lucene/fbrf/ # locate data with regex file, folder patterns regex_folder= # can we index multiple files w/ same docid ? regex_file=^(FBrf.acode|FBrf.abstr.acode)$ #regex_file=^(FBrf.acode)$ regex_skipfile= regex_skipfolder=(FBim|tmp|.*\.old) INDEX_CLASS=org.eugenes.index.LuceneAcodeIndexer INDEX_APPEND=false INDEX_TAGS=false INDEX_LEVEL=0 INDEX_BLANKS=false MIME_TYPE=text/acode ## merge=10 is default; 4 == less mem usage ; 2 minimum merge_factor=4 max_field_length=1000000 MAX_FIELDS=10000 # to create "contents" field of all text indexall=false # default search field searchfield=all ## field indexing parameters # special summary fields -- replace w/ fieldalias.TAG=newtag sumfields=docid,docclass,title,summary # field.title=RETE # field.docid=ID # field.summary=SUMX fieldalias.ID=docid fieldalias.RETE=title fieldalias.SUMX=summary ## default - Text // not UnStored = index but dont store text fieldtype=UnStored fieldtype.RETE=UnIndexed fieldtype.ID=Text fieldtype.SYM=Text fieldtype.title=UnIndexed fieldtype.summary=UnIndexed analyzer=org.eugenes.index.BiodataAnalyzer ## field filters tokenfilter.SYM=fbrftest$FbrfDataFilter tokenfilter.YR=fbrftest$FbrfDateFilter tokenfilter.ID=fbrftest$FbrfLowerDataFilter tokenfilter.AU=fbrftest$FbrfLowerDataFilter tokenfilter.TI=fbrftest$FbrfLowerDataFilter tokenfilter.BLOC=fbrftest$FbrfNumberFilter ## field tokenizers -- replace with filters tokenizer.SYM=fbrftest$FbrfDataTokenizer fieldrecoder.ID=fbrftest$FBMainID_FieldRecoder Library Indexing: Java methods =============================== // sample fbrftest.java /** * fbrftest Data class and field specific parsers to handle flybase reference acode data. See also fbrf.properties which links these classes to general acode parser. E.g., fbrftest.properties: tokenfilter.AU=fbrftest$FbrfLowerDataFilter fieldrecoder.ID=fbrftest$FBMainID_FieldRecoder Usage, given location in dbs/lucegene/ that index script finds: bin/lucegene-index.sh -debug -l fbrftest >& tmp/log.fbrft */ import java.util.regex.*; import java.io.*; import java.util.*; import org.eugenes.index.LuceneBaseIndexer; import org.eugenes.index.BiodataAnalyzer; import org.eugenes.index.BiodataAnalyzer.DataTokenizer; import org.eugenes.index.BiodataAnalyzer.DataFilter; import org.eugenes.index.BiodataAnalyzer.DateFilter; import org.eugenes.index.BiodataAnalyzer.NumberField; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.CharTokenizer; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.TokenFilter; import org.apache.lucene.analysis.Token; import org.apache.lucene.document.Field; import org.apache.lucene.document.Document; public class fbrftest { final static boolean nodups=true, withdups=false; // add >1 field value to doc? /** small field recoder to add new main ID field for acode with multiple nested ID fields - really need to have main acode indexer handle this. */ public static class FBMainID_FieldRecoder implements org.eugenes.index.LuceneBaseIndexer.FieldRecoder { public int recodeField(LuceneBaseIndexer idx, Document doc, String fieldName, String fieldPath, StringBuffer val) { if (fieldPath.indexOf(LuceneBaseIndexer.xpathDelim)>1) return kNoChange; // skip subrec locations idx.addField("MAIN_ID", val.toString(), doc, nodups); return kFieldAdded; } } // some data-field tokenizers and tokenfilters public static class FbrfDataFilter extends DataFilter { public FbrfDataFilter() { super(); } public FbrfDataFilter(TokenStream input) { super(input); } public Token next() throws IOException { return super.next(); } } public static class FbrfLowerDataFilter extends DataFilter { public FbrfLowerDataFilter() { super(); } public FbrfLowerDataFilter(TokenStream input) { super(input); } public Token next() throws IOException { Token t = input.next(); if (t == null) return null; return new Token (t.termText().toLowerCase(), t.startOffset(), t.endOffset(), t.type()); } } public static class FbrfNumberFilter extends DataFilter { public Token next() throws IOException { Token t = input.next(); if (t == null) return null; String text = t.termText(); String type = t.type(); try { String nums= NumberField.numToString( Integer.parseInt(text) ); // can except t= new Token( nums, t.startOffset(), t.endOffset(), type); } catch (Exception e) { } //? eat it return t; } } public static class FbrfDateFilter extends DateFilter { public FbrfDateFilter() { super(); } public Token next() throws IOException { return super.next(); } } public static class FbrfDataTokenizer extends DataTokenizer { public FbrfDataTokenizer(Reader in) { super(in); } public FbrfDataTokenizer() { super(); } protected char normalize(char c) { return c; } } }