Bioscape Indexing

From irefindex
Revision as of 12:14, 21 October 2009 by PaulBoddie (talk | contribs) (→‎Convenient access to document identifiers: Added an example of the genuine identifier retrieval problem.)

Bioscape places a number of requirements on text indexing solutions (a reasonable number of such solutions being available as described in the "Text Indexing Resources" document). This document discusses these requirements.

Efficient storage of position information

Since many tokens are produced and none discarded by the default tokenisation policy, positions must not occupy excessive amounts of space when stored in an index.

Efficient access to index storage

For large datasets, the solution must be able to deal with the corresponding large files and not rely on reading them into RAM in their entirety.

Convenient access to document identifiers

Many indexing solutions assign arbitrary identifiers to indexed documents, mandating the storage of genuine identifiers in separate field storage, but this can make access to the required information less efficient. For example, consider the retrieval of the genuine document identifiers given the following term dictionary and stored fields:

Term dictionary
Term Documents
gene 1, 10, 12, 17
protein 9, 10, 12
Stored fields
Document Genuine document identifier
1 1230000
9 1234000
10 1250000
12 1300000
17 1357000

Despite discovering that gene is present in documents 1, 10, 12 and 17, this information is not necessarily usable by the rest of the system until the genuine identifiers have been retrieved from each document's stored field containing this information. Although this need not be a problem for small volumes of results, this can make the retrieval of large volumes of results more time-consuming as field information needs to be accessed for each result document. If one could obtain such information directly from the term dictionary, such extra retrieval work would then be avoided.

Additionally, when preparing indexes which only contain a selection of documents, the inability to associate genuine identifiers with document-related data can lead to further complication in the process of mapping internal index identifiers to genuine global identifiers, since potentially many different mappings would need to be retained for these "filtered" indexes and for other indexes, all in order to obtain genuine identifiers for any given search result.

An untidy solution can involve the preparation of indexes which attempt to match the internal identifier used by the index with the genuine identifier, by submitting empty documents for all identifiers for which no document exists in the genuine identifier scheme. From the example above, the following correspondence between identifiers and documents would apply:

Mapping of identifiers to documents (assuming a 1-based identifier scheme)
Identifier Document information
1 to 1229999 inclusive empty documents
1230000 document
1230001 to 1233999 empty documents
1234000 document
1234001 to 1249999 empty documents
1250000 document
1250001 to 1299999 empty documents
1300000 document
1300001 to 1356999 empty documents
1357000 document

Although this gives a one-to-one correspondence between the different identifier types, it is highly likely to "bloat" the term dictionary and bring with it undesirable resource and performance characteristics.