Difference between revisions of "Bioscape Tokenisation"

From irefindex
(Added example issue.)
Line 15: Line 15:
  
 
Although it is tempting to filter tokens from names and terms, if such filtering removes all tokens then the result will be terms which are unsearchable; it could, however, be argued that such terms would only contain symbols or punctuation and are unlikely to appear in the literature. In practice, very few terms appear to suffer from "token depletion" through filtering, and a script called <tt>bioscape_analyze_terms.py</tt> has been written to consider term lists, such as those prepared as lexicons for use within [[Bioscape Searching|Bioscape's search activity]], producing relevant statistics on this matter.
 
Although it is tempting to filter tokens from names and terms, if such filtering removes all tokens then the result will be terms which are unsearchable; it could, however, be argued that such terms would only contain symbols or punctuation and are unlikely to appear in the literature. In practice, very few terms appear to suffer from "token depletion" through filtering, and a script called <tt>bioscape_analyze_terms.py</tt> has been written to consider term lists, such as those prepared as lexicons for use within [[Bioscape Searching|Bioscape's search activity]], producing relevant statistics on this matter.
 +
 +
== Tokenisation Issues ==
 +
 +
=== PubMed #10639512 ===
 +
 +
"In this study, we isolated and characterized the crucial gene at the breast cancer '''antiestrogen''' resistance 1 (BCAR1) locus."
 +
 +
Here, the gene/protein name to be searched is "breast cancer '''anti-estrogen''' resistance 1". Tokenising the name and searching for the resulting tokens may cause a mismatch between those expected and those present in the tokenised, indexed text.
  
 
[[Category:Bioscape]]
 
[[Category:Bioscape]]

Revision as of 17:50, 7 December 2009

The principal motivating criteria behind the tokenisation strategy employed by Bioscape has always been to permit the accurate retrieval of bioentity names from a text indexing solution. Consequently, a relatively complicated tokenisation scheme has been developed to expose features of bioentity names so that such features, even such typically uninformative characters as symbols and punctuation, can be retrieved from an index as part of a larger collection of tokens constituting a name.

With a fine-grained, lossless tokenisation policy, certain consequences arise for the Bioscape system:

  • Text index sizes are larger, with some indexing solutions (such as Xapian) producing larger volumes of data than others (such as Lucene)
  • Searches for names involve longer "phrases" of tokens than would otherwise be the case

However, it should be possible to mitigate these problems by adopting the following techniques:

  • By employing a coarse, "lossy" tokenisation policy, the volume of positional data in an index would be reduced, thus reducing the overall index sizes somewhat
  • Searches for names would involve shorter "phrases" of "important" tokens - those which are not discarded through the application of "lossy" tokenisation
  • Applying the fine-grained tokenisation policy upon validating possible mention locations of search terms would filter out false positives

Assessing Token Filtering Effects

Although it is tempting to filter tokens from names and terms, if such filtering removes all tokens then the result will be terms which are unsearchable; it could, however, be argued that such terms would only contain symbols or punctuation and are unlikely to appear in the literature. In practice, very few terms appear to suffer from "token depletion" through filtering, and a script called bioscape_analyze_terms.py has been written to consider term lists, such as those prepared as lexicons for use within Bioscape's search activity, producing relevant statistics on this matter.

Tokenisation Issues

PubMed #10639512

"In this study, we isolated and characterized the crucial gene at the breast cancer antiestrogen resistance 1 (BCAR1) locus."

Here, the gene/protein name to be searched is "breast cancer anti-estrogen resistance 1". Tokenising the name and searching for the resulting tokens may cause a mismatch between those expected and those present in the tokenised, indexed text.