Bioscape Tokenisation

The principal motivating criteria behind the tokenisation strategy employed by Bioscape has always been to permit the accurate retrieval of bioentity names from a text indexing solution. Consequently, a relatively complicated tokenisation scheme has been developed to expose features of bioentity names so that such features, even such typically uninformative characters as symbols and punctuation, can be retrieved from an index as part of a larger collection of tokens constituting a name.

With a fine-grained, lossless tokenisation policy, certain consequences arise for the Bioscape system:

Text index sizes are larger, with some indexing solutions (such as Xapian) producing larger volumes of data than others (such as Lucene)
Searches for names involve longer "phrases" of tokens than would otherwise be the case

However, it should be possible to mitigate these problems by adopting the following techniques:

By employing a coarse, "lossy" tokenisation policy, the volume of positional data in an index would be reduced, thus reducing the overall index sizes somewhat
Searches for names would involve shorter "phrases" of "important" tokens - those which are not discarded through the application of "lossy" tokenisation
Applying the fine-grained tokenisation policy upon validating possible mention locations of search terms would filter out false positives

Anonymous

Search

Bioscape Tokenisation

Namespaces

More

Page actions

Navigation

Navigation

Internal Links

Wiki tools

Wiki tools

Anonymous

Search

Bioscape Tokenisation

Navigation

Wiki tools

Page tools

Categories