Bioscape Indexing

From irefindex
Revision as of 15:46, 22 September 2009 by PaulBoddie (talk | contribs) (Initial notes.)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Bioscape places the following requirements on text indexing solutions:

Efficient storage of position information

Since many tokens are produced and none discarded by the default tokenisation policy, positions must not occupy excessive amounts of space when stored in an index.

Efficient access to index storage

For large datasets, the solution must be able to deal with the corresponding large files and not rely on reading them into RAM in their entirety.

Convenient access to document identifiers

Many indexing solutions assign arbitrary identifiers to indexed documents, mandating the storage of genuine identifiers in separate field storage, but this can make access to the required information less efficient. Although this need not be a problem for small volumes of results, this can make the retrieval of large volumes of results more time-consuming as field information needs to be accessed for each result document.