Difference between revisions of "Bioscape Indexing"

Revision as of 13:07, 21 October 2009

Bioscape places a number of requirements on text indexing solutions (a reasonable number of such solutions being available as described in the "Text Indexing Resources" document). This document discusses these requirements.

Efficient storage of position information

Since many tokens are produced and none discarded by the default tokenisation policy, positions must not occupy excessive amounts of space when stored in an index.

Efficient access to index storage

For large datasets, the solution must be able to deal with the corresponding large files and not rely on reading them into RAM in their entirety. Moreover, access to complete result sets for particular terms must also be efficient: unlike conventional search solutions which may present pages of results and not attempt to provide a global overview of the result set, Bioscape attempts to provide accurate statistics for such information as the number of mentions of a gene where such mentions occur in sentences containing a certain keyword or phrase.

Convenient access to document identifiers

Many indexing solutions assign arbitrary identifiers to indexed documents, mandating the storage of genuine identifiers in separate field storage, but this can make access to the required information less efficient. For example, consider the retrieval of the genuine document identifiers given the following term dictionary and stored fields:

Term dictionary
Term	Documents
gene	1, 10, 12, 17
protein	9, 10, 12

Stored fields
Document	Genuine document identifier
1	1230000
9	1234000
10	1250000
12	1300000
17	1357000

Despite discovering that gene is present in documents 1, 10, 12 and 17, this information is not necessarily usable by the rest of the system until the genuine identifiers have been retrieved from each document's stored field containing this information. Although this need not be a problem for small volumes of results, this can make the retrieval of large volumes of results more time-consuming as field information needs to be accessed for each result document. If one could obtain such information directly from the term dictionary, such extra retrieval work would then be avoided.

Additionally, when preparing indexes which only contain a selection of documents, the inability to associate genuine identifiers with document-related data can lead to further complication in the process of mapping internal index identifiers to genuine global identifiers, since potentially many different mappings would need to be retained for these "filtered" indexes and for other indexes, all in order to obtain genuine identifiers for any given search result.

An untidy solution can involve the preparation of indexes which attempt to match the internal identifier used by the index with the genuine identifier, by submitting empty documents for all identifiers for which no document exists in the genuine identifier scheme. From the example above, the following correspondence between identifiers and documents would apply:

Mapping of identifiers to documents (assuming a 1-based identifier scheme)
Identifier	Document information
1 to 1229999 inclusive	empty documents
1230000	document
1230001 to 1233999	empty documents
1234000	document
1234001 to 1249999	empty documents
1250000	document
1250001 to 1299999	empty documents
1300000	document
1300001 to 1356999	empty documents
1357000	document

Although this gives a one-to-one correspondence between the different identifier types, it is highly likely to "bloat" the term dictionary and bring with it undesirable resource and performance characteristics.

@@ Line 7: / Line 7: @@
 == Efficient access to index storage ==
-For large datasets, the solution must be able to deal with the corresponding large files and not rely on reading them into RAM in their entirety.
+For large datasets, the solution must be able to deal with the corresponding large files and not rely on reading them into RAM in their entirety. Moreover, access to complete result sets for particular terms must also be efficient: unlike conventional search solutions which may present pages of results and not attempt to provide a global overview of the result set, Bioscape attempts to provide accurate statistics for such information as the number of mentions of a gene where such mentions occur in sentences containing a certain keyword or phrase.
 == Convenient access to document identifiers ==

Anonymous

Search

Difference between revisions of "Bioscape Indexing"

Namespaces

More

Page actions

Revision as of 13:07, 21 October 2009

Efficient storage of position information

Efficient access to index storage

Convenient access to document identifiers

Navigation

Navigation

Internal Links

Wiki tools

Wiki tools

Anonymous

Search

Difference between revisions of "Bioscape Indexing"

Revision as of 13:07, 21 October 2009

Efficient storage of position information

Efficient access to index storage

Convenient access to document identifiers

Navigation

Wiki tools

Page tools

Categories