Difference between revisions of "Bioscape Indexing"

From irefindex
(Updated Lucene format link, fixed wording.)
(Qualified the B-tree remarks.)
Line 1: Line 1:
 
Bioscape places a number of requirements on text indexing solutions (a reasonable number of such solutions being available as described in the [[Text Indexing Resources|"Text Indexing Resources"]] document). This document discusses these requirements.
 
Bioscape places a number of requirements on text indexing solutions (a reasonable number of such solutions being available as described in the [[Text Indexing Resources|"Text Indexing Resources"]] document). This document discusses these requirements.
  
Two indexing solutions used by Bioscape are Lucene and Xapian whose index file formats are described in [http://lucene.apache.org/java/2_9_0/fileformats.html] and [http://trac.xapian.org/wiki/FlintBackend/Structure] respectively. Both solutions employ variations on [http://en.wikipedia.org/wiki/B-tree B-trees] or, more accurately, [http://en.wikipedia.org/wiki/B%2B_tree B+ trees], apparently promoting efficient sequential reads from a compacted or optimised index for a given search term.
+
Two indexing solutions used by Bioscape are Lucene and Xapian whose index file formats are described in [http://lucene.apache.org/java/2_9_0/fileformats.html] and [http://trac.xapian.org/wiki/FlintBackend/Structure] respectively. Xapian employs [http://en.wikipedia.org/wiki/B-tree B-trees] or variations thereof, whereas Lucene adopts techniques promoted by such data structures as [http://en.wikipedia.org/wiki/B%2B_tree B+ trees], such as an emphasis on efficient sequential reads from a compacted or optimised index when finding a given search term.
  
 
== Efficient storage of position information ==
 
== Efficient storage of position information ==

Revision as of 14:48, 11 November 2009

Bioscape places a number of requirements on text indexing solutions (a reasonable number of such solutions being available as described in the "Text Indexing Resources" document). This document discusses these requirements.

Two indexing solutions used by Bioscape are Lucene and Xapian whose index file formats are described in [1] and [2] respectively. Xapian employs B-trees or variations thereof, whereas Lucene adopts techniques promoted by such data structures as B+ trees, such as an emphasis on efficient sequential reads from a compacted or optimised index when finding a given search term.

Efficient storage of position information

Since many tokens are produced and none discarded by the default tokenisation policy, positions must not occupy excessive amounts of space when stored in an index.

Efficient access to index storage

For large datasets, the solution must be able to deal with the corresponding large files and not rely on reading them into RAM in their entirety. Moreover, access to complete result sets for particular terms must also be efficient: unlike conventional search solutions which may present pages of results and not attempt to provide a global overview of the result set, Bioscape attempts to provide accurate statistics for such information as the number of mentions of a gene where such mentions occur in sentences containing a certain keyword or phrase.

Convenient access to document identifiers

Many indexing solutions assign arbitrary identifiers to indexed documents, mandating the storage of genuine identifiers in separate field storage, but this can make access to the required information less efficient. For example, consider the retrieval of the genuine document identifiers given the following term dictionary and stored fields:

Term dictionary
Term Documents
gene 1, 10, 12, 17
protein 9, 10, 12
Stored fields
Document Genuine document identifier
1 1230000
9 1234000
10 1250000
12 1300000
17 1357000

Despite discovering that gene is present in documents 1, 10, 12 and 17, this information is not necessarily usable by the rest of the system until the genuine identifiers have been retrieved from each document's stored field containing this information. Although this need not be a problem for small volumes of results, this can make the retrieval of large volumes of results more time-consuming as field information needs to be accessed for each result document. If one could obtain such information directly from the term dictionary, such extra retrieval work would then be avoided.

Additionally, when preparing indexes which only contain a selection of documents, the inability to associate genuine identifiers with document-related data can lead to further complication in the process of mapping internal index identifiers to genuine global identifiers, since potentially many different mappings would need to be retained for these "filtered" indexes and for other indexes, all in order to obtain genuine identifiers for any given search result.

Forcing Correspondence Between Identifiers

An untidy solution can involve the preparation of indexes which attempt to match the internal identifier used by the index with the genuine identifier, by submitting empty documents for all identifiers for which no document exists in the genuine identifier scheme. From the example above, the following correspondence between identifiers and documents would apply:

Mapping of identifiers to documents (assuming a 1-based identifier scheme)
Identifier Document information
1 to 1229999 inclusive empty documents
1230000 document
1230001 to 1233999 empty documents
1234000 document
1234001 to 1249999 empty documents
1250000 document
1250001 to 1299999 empty documents
1300000 document
1300001 to 1356999 empty documents
1357000 document

Although this gives a one-to-one correspondence between the different identifier types, it is highly likely to "bloat" the term dictionary and bring with it undesirable resource and performance characteristics.

Conversion Between Identifiers

Although document identifiers can be stored in fields in an index, such identifiers may not be globally relevant. Consequently, an additional facility known as an offsets file is provided to make a mapping from document identifiers stored in the index to genuine, globally relevant identifiers. These offset files record global identifier offsets for the document, heading and sentence identifiers used in an index, and such offsets are then added to each kind of identifier in order to produce the genuine identifier which can identify each unit of information in a system-wide context. For example:

Document and sentence identifiers in the index
Document Sentence
10 20
Document and sentence offsets for the index
Document Sentence
1000 5000
Resulting "genuine" document and sentence identifiers
Document Sentence
1000 + 10 = 1010 5000 + 20 = 5020

Offset files are produced, as opposed to attempting to insert genuine identifiers directly into indexed information, because it is often desirable to process data without having to consider allocating ranges of usable identifiers within the system. Thus, it becomes possible to process a number of sections of a literature collection independently, deciding upon the appropriate ranges of identifiers only when each index is ready to be imported or associated with the system.