Bioscape Indexing
Bioscape places a number of requirements on text indexing solutions (a reasonable number of such solutions being available as described in the "Text Indexing Resources" document). This document discusses these requirements.
Efficient storage of position information
Since many tokens are produced and none discarded by the default tokenisation policy, positions must not occupy excessive amounts of space when stored in an index.
Efficient access to index storage
For large datasets, the solution must be able to deal with the corresponding large files and not rely on reading them into RAM in their entirety.
Convenient access to document identifiers
Many indexing solutions assign arbitrary identifiers to indexed documents, mandating the storage of genuine identifiers in separate field storage, but this can make access to the required information less efficient. For example, consider the retrieval of the genuine document identifiers given the following term dictionary and stored fields:
Term | Documents |
---|---|
gene | 1, 10, 12, 17 |
protein | 9, 10, 12 |
Document | Genuine document identifier |
---|---|
1 | 1230000 |
9 | 1234000 |
10 | 1250000 |
12 | 1300000 |
17 | 1357000 |
Despite discovering that gene is present in documents 1, 10, 12 and 17, this information is not necessarily usable by the rest of the system until the genuine identifiers have been retrieved from each document's stored field containing this information. Although this need not be a problem for small volumes of results, this can make the retrieval of large volumes of results more time-consuming as field information needs to be accessed for each result document. If one could obtain such information directly from the term dictionary, such extra retrieval work would then be avoided.
Additionally, when preparing indexes which only contain a selection of documents, the inability to associate genuine identifiers with document-related data can lead to further complication in the process of mapping internal index identifiers to genuine global identifiers, since potentially many different mappings would need to be retained for these "filtered" indexes and for other indexes, all in order to obtain genuine identifiers for any given search result.
An untidy solution can involve the preparation of indexes which attempt to match the internal identifier used by the index with the genuine identifier, by submitting empty documents for all identifiers for which no document exists in the genuine identifier scheme. From the example above, the following correspondence between identifiers and documents would apply:
Identifier | Document information |
---|---|
1 to 1229999 inclusive | empty documents |
1230000 | document |
1230001 to 1233999 | empty documents |
1234000 | document |
1234001 to 1249999 | empty documents |
1250000 | document |
1250001 to 1299999 | empty documents |
1300000 | document |
1300001 to 1356999 | empty documents |
1357000 | document |
Although this gives a one-to-one correspondence between the different identifier types, it is highly likely to "bloat" the term dictionary and bring with it undesirable resource and performance characteristics.