Difference between revisions of "Bioscape Indexing"
PaulBoddie (talk | contribs) (Initial notes.) |
PaulBoddie (talk | contribs) (Added filtered index notes.) |
||
Line 11: | Line 11: | ||
== Convenient access to document identifiers == | == Convenient access to document identifiers == | ||
− | Many indexing solutions assign arbitrary identifiers to indexed documents, mandating the storage of genuine identifiers in separate field storage, but this can make access to the required information less efficient. Although this need not be a problem for small volumes of results, this can make the retrieval of large volumes of results more time-consuming as field information needs to be accessed for each result document. | + | Many indexing solutions assign arbitrary identifiers to indexed documents, mandating the storage of genuine identifiers in separate field storage, but this can make access to the required information less efficient. Although this need not be a problem for small volumes of results, this can make the retrieval of large volumes of results more time-consuming as field information needs to be accessed for each result document. Additionally, when preparing indexes which only contain a selection of documents, the inability to associate genuine identifiers with document-related data can lead to further complication in the process of mapping internal index identifiers to genuine global identifiers, since potentially many different mappings would need to be retained for these "filtered" indexes and for other indexes, all in order to obtain genuine identifiers for any given search result. |
[[Category:Bioscape]] | [[Category:Bioscape]] |
Revision as of 11:45, 23 September 2009
Bioscape places the following requirements on text indexing solutions:
Efficient storage of position information
Since many tokens are produced and none discarded by the default tokenisation policy, positions must not occupy excessive amounts of space when stored in an index.
Efficient access to index storage
For large datasets, the solution must be able to deal with the corresponding large files and not rely on reading them into RAM in their entirety.
Convenient access to document identifiers
Many indexing solutions assign arbitrary identifiers to indexed documents, mandating the storage of genuine identifiers in separate field storage, but this can make access to the required information less efficient. Although this need not be a problem for small volumes of results, this can make the retrieval of large volumes of results more time-consuming as field information needs to be accessed for each result document. Additionally, when preparing indexes which only contain a selection of documents, the inability to associate genuine identifiers with document-related data can lead to further complication in the process of mapping internal index identifiers to genuine global identifiers, since potentially many different mappings would need to be retained for these "filtered" indexes and for other indexes, all in order to obtain genuine identifiers for any given search result.