Difference between revisions of "Bioscape Indexing"
PaulBoddie (talk | contribs) (Initial notes.) |
PaulBoddie (talk | contribs) (Added status note. Clarified the status of the indexing solutions used in Bioscape.) |
||
(8 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
− | Bioscape places | + | {{:Bioscape Status}} |
+ | |||
+ | Bioscape places a number of requirements on text indexing solutions (a reasonable number of such solutions being available as described in the [[Text Indexing Resources|"Text Indexing Resources"]] document). This document discusses these requirements. | ||
+ | |||
+ | Two indexing solutions have been used by Bioscape during its development: Lucene and Xapian. The index file formats for these solutions are described in [http://lucene.apache.org/java/2_9_0/fileformats.html] and [http://trac.xapian.org/wiki/FlintBackend/Structure] respectively. Xapian employs [http://en.wikipedia.org/wiki/B-tree B-trees] or variations thereof, whereas Lucene adopts techniques promoted by such data structures as [http://en.wikipedia.org/wiki/B%2B_tree B+ trees], such as an emphasis on efficient sequential reads from a compacted or optimised index when finding a given search term. | ||
== Efficient storage of position information == | == Efficient storage of position information == | ||
Line 7: | Line 11: | ||
== Efficient access to index storage == | == Efficient access to index storage == | ||
− | For large datasets, the solution must be able to deal with the corresponding large files and not rely on reading them into RAM in their entirety. | + | For large datasets, the solution must be able to deal with the corresponding large files and not rely on reading them into RAM in their entirety. Moreover, access to complete result sets for particular terms must also be efficient: unlike conventional search solutions which may present pages of results and not attempt to provide a global overview of the result set, Bioscape attempts to provide accurate statistics for such information as the number of mentions of a gene where such mentions occur in sentences containing a certain keyword or phrase. |
== Convenient access to document identifiers == | == Convenient access to document identifiers == | ||
− | Many indexing solutions assign arbitrary identifiers to indexed documents, mandating the storage of genuine identifiers in separate field storage, but this can make access to the required information less efficient. Although this need not be a problem for small volumes of results, this can make the retrieval of large volumes of results more time-consuming as field information needs to be accessed for each result document. | + | Many indexing solutions assign arbitrary identifiers to indexed documents, mandating the storage of genuine identifiers in separate field storage, but this can make access to the required information less efficient. For example, consider the retrieval of the genuine document identifiers given the following term dictionary and stored fields: |
+ | |||
+ | {| border="1" cellspacing="0" cellpadding="5" style="margin: 1em" | ||
+ | |+ style="margin: 1em" | Term dictionary | ||
+ | !width="20%" | Term | ||
+ | !width="80%" | Documents | ||
+ | |- | ||
+ | | gene || 1, 10, 12, 17 | ||
+ | |- | ||
+ | | protein || 9, 10, 12 | ||
+ | |} | ||
+ | |||
+ | {| border="1" cellspacing="0" cellpadding="5" style="margin: 1em" | ||
+ | |+ style="margin: 1em" | Stored fields | ||
+ | !width="20%" | Document | ||
+ | !width="80%" | Genuine document identifier | ||
+ | |- | ||
+ | | 1 || 1230000 | ||
+ | |- | ||
+ | | 9 || 1234000 | ||
+ | |- | ||
+ | | 10 || 1250000 | ||
+ | |- | ||
+ | | 12 || 1300000 | ||
+ | |- | ||
+ | | 17 || 1357000 | ||
+ | |} | ||
+ | |||
+ | Despite discovering that <tt>gene</tt> is present in documents 1, 10, 12 and 17, this information is not necessarily usable by the rest of the system until the genuine identifiers have been retrieved from each document's stored field containing this information. Although this need not be a problem for small volumes of results, this can make the retrieval of large volumes of results more time-consuming as field information needs to be accessed for each result document. If one could obtain such information directly from the term dictionary, such extra retrieval work would then be avoided. | ||
+ | |||
+ | Additionally, when preparing indexes which only contain a selection of documents, the inability to associate genuine identifiers with document-related data can lead to further complication in the process of mapping internal index identifiers to genuine global identifiers, since potentially many different mappings would need to be retained for these "filtered" indexes and for other indexes, all in order to obtain genuine identifiers for any given search result. | ||
+ | |||
+ | === Forcing Correspondence Between Identifiers === | ||
+ | |||
+ | An untidy solution can involve the preparation of indexes which attempt to match the internal identifier used by the index with the genuine identifier, by submitting empty documents for all identifiers for which no document exists in the genuine identifier scheme. From the example above, the following correspondence between identifiers and documents would apply: | ||
+ | |||
+ | {| border="1" cellspacing="0" cellpadding="5" style="margin: 1em" | ||
+ | |+ style="margin: 1em" | Mapping of identifiers to documents (assuming a 1-based identifier scheme) | ||
+ | ! Identifier | ||
+ | ! Document information | ||
+ | |- | ||
+ | | 1 to 1229999 inclusive || empty documents | ||
+ | |- | ||
+ | | 1230000 || document | ||
+ | |- | ||
+ | | 1230001 to 1233999 || empty documents | ||
+ | |- | ||
+ | | 1234000 || document | ||
+ | |- | ||
+ | | 1234001 to 1249999 || empty documents | ||
+ | |- | ||
+ | | 1250000 || document | ||
+ | |- | ||
+ | | 1250001 to 1299999 || empty documents | ||
+ | |- | ||
+ | | 1300000 || document | ||
+ | |- | ||
+ | | 1300001 to 1356999 || empty documents | ||
+ | |- | ||
+ | | 1357000 || document | ||
+ | |} | ||
+ | |||
+ | Although this gives a one-to-one correspondence between the different identifier types, it is highly likely to "bloat" the term dictionary and bring with it undesirable resource and performance characteristics. | ||
+ | |||
+ | === Conversion Between Identifiers === | ||
+ | |||
+ | Although document identifiers can be stored in fields in an index, such identifiers may not be globally relevant. Consequently, an additional facility known as an ''offsets file'' is provided to make a mapping from document identifiers stored in the index to genuine, globally relevant identifiers. These offset files record global identifier offsets for the document, heading and sentence identifiers used in an index, and such offsets are then added to each kind of identifier in order to produce the genuine identifier which can identify each unit of information in a system-wide context. For example: | ||
+ | |||
+ | {| border="1" cellspacing="0" cellpadding="5" style="margin: 1em" | ||
+ | |+ style="margin: 1em" | Document and sentence identifiers in the index | ||
+ | ! Document | ||
+ | ! Sentence | ||
+ | |- | ||
+ | | 10 || 20 | ||
+ | |} | ||
+ | |||
+ | {| border="1" cellspacing="0" cellpadding="5" style="margin: 1em" | ||
+ | |+ style="margin: 1em" | Document and sentence offsets for the index | ||
+ | ! Document | ||
+ | ! Sentence | ||
+ | |- | ||
+ | | 1000 || 5000 | ||
+ | |} | ||
+ | |||
+ | {| border="1" cellspacing="0" cellpadding="5" style="margin: 1em" | ||
+ | |+ style="margin: 1em" | Resulting "genuine" document and sentence identifiers | ||
+ | ! Document | ||
+ | ! Sentence | ||
+ | |- | ||
+ | | 1000 + 10 = 1010 || 5000 + 20 = 5020 | ||
+ | |} | ||
+ | |||
+ | Offset files are produced, as opposed to attempting to insert genuine identifiers directly into indexed information, because it is often desirable to process data without having to consider allocating ranges of usable identifiers within the system. Thus, it becomes possible to process a number of sections of a literature collection independently, deciding upon the appropriate ranges of identifiers only when each index is ready to be imported or associated with the system. | ||
[[Category:Bioscape]] | [[Category:Bioscape]] |
Latest revision as of 13:45, 14 July 2010
Note | Please note that this documentation covers an unreleased product and is for internal use only. |
Bioscape places a number of requirements on text indexing solutions (a reasonable number of such solutions being available as described in the "Text Indexing Resources" document). This document discusses these requirements.
Two indexing solutions have been used by Bioscape during its development: Lucene and Xapian. The index file formats for these solutions are described in [1] and [2] respectively. Xapian employs B-trees or variations thereof, whereas Lucene adopts techniques promoted by such data structures as B+ trees, such as an emphasis on efficient sequential reads from a compacted or optimised index when finding a given search term.
Contents
Efficient storage of position information
Since many tokens are produced and none discarded by the default tokenisation policy, positions must not occupy excessive amounts of space when stored in an index.
Efficient access to index storage
For large datasets, the solution must be able to deal with the corresponding large files and not rely on reading them into RAM in their entirety. Moreover, access to complete result sets for particular terms must also be efficient: unlike conventional search solutions which may present pages of results and not attempt to provide a global overview of the result set, Bioscape attempts to provide accurate statistics for such information as the number of mentions of a gene where such mentions occur in sentences containing a certain keyword or phrase.
Convenient access to document identifiers
Many indexing solutions assign arbitrary identifiers to indexed documents, mandating the storage of genuine identifiers in separate field storage, but this can make access to the required information less efficient. For example, consider the retrieval of the genuine document identifiers given the following term dictionary and stored fields:
Term | Documents |
---|---|
gene | 1, 10, 12, 17 |
protein | 9, 10, 12 |
Document | Genuine document identifier |
---|---|
1 | 1230000 |
9 | 1234000 |
10 | 1250000 |
12 | 1300000 |
17 | 1357000 |
Despite discovering that gene is present in documents 1, 10, 12 and 17, this information is not necessarily usable by the rest of the system until the genuine identifiers have been retrieved from each document's stored field containing this information. Although this need not be a problem for small volumes of results, this can make the retrieval of large volumes of results more time-consuming as field information needs to be accessed for each result document. If one could obtain such information directly from the term dictionary, such extra retrieval work would then be avoided.
Additionally, when preparing indexes which only contain a selection of documents, the inability to associate genuine identifiers with document-related data can lead to further complication in the process of mapping internal index identifiers to genuine global identifiers, since potentially many different mappings would need to be retained for these "filtered" indexes and for other indexes, all in order to obtain genuine identifiers for any given search result.
Forcing Correspondence Between Identifiers
An untidy solution can involve the preparation of indexes which attempt to match the internal identifier used by the index with the genuine identifier, by submitting empty documents for all identifiers for which no document exists in the genuine identifier scheme. From the example above, the following correspondence between identifiers and documents would apply:
Identifier | Document information |
---|---|
1 to 1229999 inclusive | empty documents |
1230000 | document |
1230001 to 1233999 | empty documents |
1234000 | document |
1234001 to 1249999 | empty documents |
1250000 | document |
1250001 to 1299999 | empty documents |
1300000 | document |
1300001 to 1356999 | empty documents |
1357000 | document |
Although this gives a one-to-one correspondence between the different identifier types, it is highly likely to "bloat" the term dictionary and bring with it undesirable resource and performance characteristics.
Conversion Between Identifiers
Although document identifiers can be stored in fields in an index, such identifiers may not be globally relevant. Consequently, an additional facility known as an offsets file is provided to make a mapping from document identifiers stored in the index to genuine, globally relevant identifiers. These offset files record global identifier offsets for the document, heading and sentence identifiers used in an index, and such offsets are then added to each kind of identifier in order to produce the genuine identifier which can identify each unit of information in a system-wide context. For example:
Document | Sentence |
---|---|
10 | 20 |
Document | Sentence |
---|---|
1000 | 5000 |
Document | Sentence |
---|---|
1000 + 10 = 1010 | 5000 + 20 = 5020 |
Offset files are produced, as opposed to attempting to insert genuine identifiers directly into indexed information, because it is often desirable to process data without having to consider allocating ranges of usable identifiers within the system. Thus, it becomes possible to process a number of sections of a literature collection independently, deciding upon the appropriate ranges of identifiers only when each index is ready to be imported or associated with the system.