Difference between revisions of "Bioscape Indexing"

From irefindex
(Added filtered index notes.)
(Added link to new page.)
Line 1: Line 1:
Bioscape places the following requirements on text indexing solutions:
+
Bioscape places a number of requirements on text indexing solutions (a reasonable number of such solutions being available as described in the [[Text Indexing Resources|"Text Indexing Resources"]] document). This document discusses these requirements.
  
 
== Efficient storage of position information ==
 
== Efficient storage of position information ==

Revision as of 17:10, 19 October 2009

Bioscape places a number of requirements on text indexing solutions (a reasonable number of such solutions being available as described in the "Text Indexing Resources" document). This document discusses these requirements.

Efficient storage of position information

Since many tokens are produced and none discarded by the default tokenisation policy, positions must not occupy excessive amounts of space when stored in an index.

Efficient access to index storage

For large datasets, the solution must be able to deal with the corresponding large files and not rely on reading them into RAM in their entirety.

Convenient access to document identifiers

Many indexing solutions assign arbitrary identifiers to indexed documents, mandating the storage of genuine identifiers in separate field storage, but this can make access to the required information less efficient. Although this need not be a problem for small volumes of results, this can make the retrieval of large volumes of results more time-consuming as field information needs to be accessed for each result document. Additionally, when preparing indexes which only contain a selection of documents, the inability to associate genuine identifiers with document-related data can lead to further complication in the process of mapping internal index identifiers to genuine global identifiers, since potentially many different mappings would need to be retained for these "filtered" indexes and for other indexes, all in order to obtain genuine identifiers for any given search result.