Difference between revisions of "Bioscape Searching"
PaulBoddie (talk | contribs) (Initial version of searching documentation.) |
PaulBoddie (talk | contribs) (→Tokenisation: Edit table.) |
||
Line 20: | Line 20: | ||
!width="80%" colspan="4" | Tokens | !width="80%" colspan="4" | Tokens | ||
|- | |- | ||
− | | ORC1 || ORC1 | + | |width="20%" | ORC1 |
+ | |width="20%" | ORC1 | ||
+ | |width="20%" style="background-color: #999999" | | ||
+ | |width="20%" style="background-color: #999999" | | ||
+ | |width="20%" style="background-color: #999999" | | ||
|- | |- | ||
| ORF C | | ORF C | ||
| ORF | | ORF | ||
− | | | + | | C |
+ | | style="background-color: #999999" | | ||
+ | | style="background-color: #999999" | | ||
|- | |- | ||
− | | P1- || | + | | P1- |
+ | | P1 | ||
+ | | - | ||
+ | | style="background-color: #999999" | | ||
+ | | style="background-color: #999999" | | ||
|- | |- | ||
| p16(INK4a) | | p16(INK4a) | ||
| p16 | | p16 | ||
| ( | | ( | ||
− | + | | INK4a | |
− | + | | ) | |
|- | |- | ||
− | | P1.7 | + | | P1.7 |
+ | | P1 | ||
+ | | . | ||
+ | | 7 | ||
+ | | style="background-color: #999999" | | ||
|- | |- | ||
− | | PAR-1 || | + | | PAR-1 |
+ | | PAR | ||
+ | | - | ||
+ | | 1 | ||
+ | | style="background-color: #999999" | | ||
|- | |- | ||
− | | phosphoglycerate mutase || | + | | phosphoglycerate mutase |
+ | | phosphoglycerate | ||
+ | | mutase | ||
+ | | style="background-color: #999999" | | ||
+ | | style="background-color: #999999" | | ||
|} | |} | ||
Revision as of 18:19, 25 July 2009
Bioscape combines a number of processes in order to produce search results from textual document and other data. This document attempts to describe each of the processes involved in an approximate order of application.
Contents
Text Indexing
Before any searching can begin, the documents to be searched must be acquired and processed such that search operations performed on these documents may be done relatively efficiently. To achieve this, the text of these documents is tokenised and presented to a text indexing solution such as Lucene or Xapian which permits the retrieval of token (or term) location information via a data structure known as an inverted index.
Although it might be useful to use a single document as the minimal unit of stored information, there are advantages in considering smaller units of information such as the individual sentences in each document. By identifying sentences, it then becomes easier to determine whether results occur together in the same sentence as well as in the same document. Bioscape employs a relatively simple regular expression approach which incorporates exceptions based on abbreviations and other distractions to split documents into sentences.
Consequently, the notion of a document as presented to the text indexing solution is equivalent to a sentence as found in the original document text, with the membership of sentences (that is, indexing solution "documents") in original documents also recorded in the system.
Tokenisation
Within each sentence, the text is broken down into its constituent parts using a process known as tokenisation; this process determines which character sequences will be available in the inverted index for searching. Many search applications employ tokenisation practices which seek to break texts down into sequences of natural language words and names, often discarding symbols, punctuation and even variations in spelling and inflection, as well as potentially applying many other transformations to account for grammatical variations in the way "root words" may be written. Since the purpose of Bioscape is to identify named bioentities whose names may include punctuation and symbols, possibly consisting of multiple words and symbols, naive tokenisation approaches such as splitting on whitespace characters prove rather insufficient in breaking texts down into sequences of recognisable units, whereas linguistic tokenisation approaches such as stemming have only a limited potential for application on typical bioentity names.
One approach similar to the frequently referenced Treebank tokenisation method is to define the following mutually exclusive classes of tokens: alphanumeric words, whitespace characters, all other characters. However, whitespace characters act only as delimiters, causing boundaries between tokens to be established, and are not produced as concrete tokens in the output. Thus only alphanumeric words and all other (non-whitespace) characters are produced as distinct token classes from the tokenisation process with the consequences of this approach observed in the following examples:
Input Text | Tokens | |||
---|---|---|---|---|
ORC1 | ORC1 | |||
ORF C | ORF | C | ||
P1- | P1 | - | ||
p16(INK4a) | p16 | ( | INK4a | ) |
P1.7 | P1 | . | 7 | |
PAR-1 | PAR | - | 1 | |
phosphoglycerate mutase | phosphoglycerate | mutase |
By default, Bioscape uses a somewhat more complicated regular expression approach which attempts to identify a number of token classes:
- digits and numbers
- upper-case characters
- lower-case characters
- upper-case Roman numerals
- lower-case Roman numerals
- Greek letter suffixes (written using Latin characters)
- "h" prefixes
- "p" suffixes
- plus and minus signs
- symbols
The code which performs tokenisation can be found in the bioscape.regexp module in the bsadmin distribution.
Search Strategies
There are two supported forms of searching in Bioscape:
- Using terms from predefined lexicons and searching for these in the indexed documents.
- Speculatively searching for certain patterns in the indexed documents.
Although the latter form of searching is useful for gathering contextual information (such as mentions of phrases such as "... chromosome" or "chromosome ..."), the former is more pertinent in any discussion of the approach taken in Bioscape to search for bioentities.
Predefined Searching Using Lexicons
Given source information about bioentities, obtained from a database such as Entrez Gene, the names employed by authors and researchers when identifying such bioentities can be made available to Bioscape. Using such raw data, it is then possible to prepare a lexicon of reasonable size whose contents can then be used to search for mentions of bioentities in the literature.