Difference between revisions of "Bioscape Tokenisation"

From irefindex
(Added gene keyword-related note.)
(Added more details of the different potential tokenisation approaches.)
Line 3: Line 3:
 
With a fine-grained, lossless tokenisation policy, certain consequences arise for the Bioscape system:
 
With a fine-grained, lossless tokenisation policy, certain consequences arise for the Bioscape system:
  
* Text index sizes are larger, with some indexing solutions (such as Xapian) producing larger volumes of data than others (such as Lucene)
+
* Text index sizes are larger, with some indexing solutions (such as Xapian) producing considerably larger volumes of data than others (such as Lucene)
 
* Searches for names involve longer "phrases" of tokens than would otherwise be the case
 
* Searches for names involve longer "phrases" of tokens than would otherwise be the case
  
However, it should be possible to mitigate these problems by adopting the following techniques:
+
However, it should be possible to mitigate these problems by adopting a number of techniques; these are described below.
  
* By employing a coarse, "lossy" tokenisation policy, the volume of positional data in an index would be reduced, thus reducing the overall index sizes somewhat
+
== Adopting Coarse Tokenisation ==
* Searches for names would involve shorter "phrases" of "important" tokens - those which are not discarded through the application of "lossy" tokenisation
 
* Applying the fine-grained tokenisation policy upon validating possible mention locations of search terms would filter out false positives
 
  
== Assessing Token Filtering Effects ==
+
By employing a coarse tokenisation policy, the volume of positional data in an index would be reduced, since tokens would be larger and thus appear less frequently. However, coarse tokens do not expose their internal structure, necessitating various workarounds to match search terms against indexed tokens and to verify the presence of such terms. One such workaround would involve applying a fine-grained tokenisation policy on any text thought to contain mentions of a search term, then validating such possible mentions using the more precise tokens.
  
Although it is tempting to filter tokens from names and terms, if such filtering removes all tokens then the result will be terms which are unsearchable; it could, however, be argued that such terms would only contain symbols or punctuation and are unlikely to appear in the literature. In practice, very few terms appear to suffer from "token depletion" through filtering, and a script called <tt>bioscape_analyze_terms.py</tt> has been written to consider term lists, such as those prepared as lexicons for use within [[Bioscape Searching|Bioscape's search activity]], producing relevant statistics on this matter.
+
== Adopting "Lossy" Tokenisation ==
 +
 
 +
By employing a "lossy" tokenisation policy - in other words, discarding certain classes of tokens or filtering tokens - the volume of positional data in an index would be reduced, since there would be fewer tokens recorded in the index. One concern with this strategy is the effect of discarding tokens from search terms: should a search term be rendered ambiguous or even eliminated by token filtering, such terms would then be either inconvenient to locate or practically unsearchable.
 +
 
 +
=== Assessing Token Filtering Effects ===
 +
 
 +
Although it is tempting to filter tokens from names and terms, if such filtering removes all tokens then the result will be terms which are unsearchable; it could, however, be argued that such terms would only contain symbols or punctuation and are unlikely to appear in the literature. In practice, very few terms appear to suffer from "token depletion" through filtering, and a script called <tt>bioscape_analyze_terms.py</tt> (in the <tt>bsadmin</tt> distribution) has been written to consider term lists, such as those prepared as lexicons for use within [[Bioscape Searching|Bioscape's search activity]], producing relevant statistics on this matter.
 +
 
 +
=== Practical Differences between Strategies ===
 +
 
 +
A fine-grained tokenisation policy may expose every significant region of text as a token. However, a "lossy" policy may discard regions of the indexed text and not expose tokens for such regions. Consequently, when investigating the validity of a possible search term mention, considering such issues as token adjacency, it is no longer sufficient to investigate tokens between those matching the search term concerned: since "gaps" between the tokens of interest may exist in the textual region under investigation, it becomes necessary to consider the contents of such gaps and whether it is acceptable that the text residing in such gaps appears between the tokens.
  
 
== Tokenisation Issues ==
 
== Tokenisation Issues ==

Revision as of 15:30, 12 February 2010

The principal motivating criteria behind the tokenisation strategy employed by Bioscape has always been to permit the accurate retrieval of bioentity names from a text indexing solution. Consequently, a relatively complicated tokenisation scheme has been developed to expose features of bioentity names so that such features, even such typically uninformative characters as symbols and punctuation, can be retrieved from an index as part of a larger collection of tokens constituting a name.

With a fine-grained, lossless tokenisation policy, certain consequences arise for the Bioscape system:

  • Text index sizes are larger, with some indexing solutions (such as Xapian) producing considerably larger volumes of data than others (such as Lucene)
  • Searches for names involve longer "phrases" of tokens than would otherwise be the case

However, it should be possible to mitigate these problems by adopting a number of techniques; these are described below.

Adopting Coarse Tokenisation

By employing a coarse tokenisation policy, the volume of positional data in an index would be reduced, since tokens would be larger and thus appear less frequently. However, coarse tokens do not expose their internal structure, necessitating various workarounds to match search terms against indexed tokens and to verify the presence of such terms. One such workaround would involve applying a fine-grained tokenisation policy on any text thought to contain mentions of a search term, then validating such possible mentions using the more precise tokens.

Adopting "Lossy" Tokenisation

By employing a "lossy" tokenisation policy - in other words, discarding certain classes of tokens or filtering tokens - the volume of positional data in an index would be reduced, since there would be fewer tokens recorded in the index. One concern with this strategy is the effect of discarding tokens from search terms: should a search term be rendered ambiguous or even eliminated by token filtering, such terms would then be either inconvenient to locate or practically unsearchable.

Assessing Token Filtering Effects

Although it is tempting to filter tokens from names and terms, if such filtering removes all tokens then the result will be terms which are unsearchable; it could, however, be argued that such terms would only contain symbols or punctuation and are unlikely to appear in the literature. In practice, very few terms appear to suffer from "token depletion" through filtering, and a script called bioscape_analyze_terms.py (in the bsadmin distribution) has been written to consider term lists, such as those prepared as lexicons for use within Bioscape's search activity, producing relevant statistics on this matter.

Practical Differences between Strategies

A fine-grained tokenisation policy may expose every significant region of text as a token. However, a "lossy" policy may discard regions of the indexed text and not expose tokens for such regions. Consequently, when investigating the validity of a possible search term mention, considering such issues as token adjacency, it is no longer sufficient to investigate tokens between those matching the search term concerned: since "gaps" between the tokens of interest may exist in the textual region under investigation, it becomes necessary to consider the contents of such gaps and whether it is acceptable that the text residing in such gaps appears between the tokens.

Tokenisation Issues

PubMed #10639512

"In this study, we isolated and characterized the crucial gene at the breast cancer antiestrogen resistance 1 (BCAR1) locus."

Here, the gene/protein name to be searched is "breast cancer anti-estrogen resistance 1". Tokenising the name and searching for the resulting tokens may cause a mismatch between those expected and those present in the tokenised, indexed text.

Alternative Tokenisation Schemes

Although many of the search activities involve the exact matching of terms to regions in text, there are activities which would be helped by strategies which do not rely on exact matching. For example, database entries for genes provide textual summaries which may contain keywords that are uncommon enough in general text that their presence in a document would support any ambiguous mentions for which such genes have been suggested. However, the likelihood of keywords appearing in an identical form to that found in a summary might be somewhat reduced. For example, "mammalian" may be a keyword found in a gene summary, yet a related term derived from the same root word (such as "mammal" or "mammals") might be sufficient in the text to provide confirmation of the correctness of a particular gene suggestion.

At the very least, such summaries could have a process known as stemming applied to them, in order to produce a list of prefix terms which can then be used to search the existing text index. However, it would be more efficient to also have an index whose terms are also stemmed, so that searches in the index could be performed without having to look up all possible terms matching a given prefix.