Bioscape Searching
Bioscape combines a number of processes in order to produce search results from textual document and other data. This document attempts to describe each of the processes involved in an approximate order of application.
Contents
Text Indexing
Before any searching can begin, the documents to be searched must be acquired and processed such that search operations performed on these documents may be done relatively efficiently. To achieve this, the text of these documents is tokenised and presented to a text indexing solution such as Lucene or Xapian which permits the retrieval of token (or term) location information via a data structure known as an inverted index.
Although it might be useful to use a single document as the minimal unit of stored information, there are advantages in considering smaller units of information such as the individual sentences in each document. By identifying sentences, it then becomes easier to determine whether results occur together in the same sentence as well as in the same document. Bioscape employs a relatively simple regular expression approach which incorporates exceptions based on abbreviations and other distractions to split documents into sentences.
Consequently, the notion of a document as presented to the text indexing solution is equivalent to a sentence as found in the original document text, with the membership of sentences (that is, indexing solution "documents") in original documents also recorded in the system.
Tokenisation
Within each sentence, the text is broken down into its constituent parts using a process known as tokenisation; this process determines which character sequences will be available in the inverted index for searching. Many search applications employ tokenisation practices which seek to break texts down into sequences of natural language words and names, often discarding symbols, punctuation and even variations in spelling and inflection, as well as potentially applying many other transformations to account for grammatical variations in the way "root words" may be written. Since the purpose of Bioscape is to identify named bioentities whose names may include punctuation and symbols, possibly consisting of multiple words and symbols, naive tokenisation approaches such as splitting on whitespace characters prove rather insufficient in breaking texts down into sequences of recognisable units, whereas linguistic tokenisation approaches such as stemming have only a limited potential for application on typical bioentity names.
One approach similar to the frequently referenced Treebank tokenisation method is to define the following mutually exclusive classes of tokens: alphanumeric words, whitespace characters, all other characters. However, in this approach whitespace characters act only as delimiters, causing boundaries between tokens to be established, and are not produced as concrete tokens in the output. Thus only alphanumeric words and all other (non-whitespace) characters are produced as distinct token classes from a tokenisation process operating in this way. The following examples illustrate Treebank-style tokenisation:
Input Text | Tokens | |||
---|---|---|---|---|
ORC1 | ORC1 | |||
ORF C | ORF | C | ||
P1- | P1 | - | ||
p16(INK4a) | p16 | ( | INK4a | ) |
P1.7 | P1 | . | 7 | |
PAR-1 | PAR | - | 1 | |
phosphoglycerate mutase | phosphoglycerate | mutase |
By default, Bioscape uses a somewhat more complicated regular expression approach which attempts to identify a number of token classes:
- digits and numbers
- upper-case characters
- lower-case characters
- upper-case Roman numerals
- lower-case Roman numerals
- Greek letter suffixes (written using Latin characters)
- "h" prefixes
- "p" suffixes
- plus and minus signs
- symbols
The code which performs tokenisation can be found in the bioscape.regexp module in the bsadmin distribution.
Case Sensitivity
For many search solutions, it may not be critical to insist that the case of characters in text match exactly the case employed in search terms. For example, it might be acceptable for the word "protein" to match the occurrences "protein", "Protein" and even "PROTEIN" with such solutions, and for various kinds of search terms, the same matching criteria still apply within Bioscape. However, it can be undesirable to permit such case-insensitive matches for certain kinds of terms, and there is therefore a need to record tokens which permit both case-insensitive and case-sensitive matching.
To support both kinds of matching, separate fields are used in the text indexing solution: one stores tokens with case information intact, for case-sensitive matching; one stores tokens with case information discarded, for case-insensitive matching. Thus, when a term is presented to the indexing solution for searching, the appropriate field can be chosen as the basis of the search in order to provide the appropriate matching behaviour.
Search Overview
The general search pipeline in Bioscape is as presented in the following diagram:
Predefined terms | Locator | ||||
Reader | Selector | Source | Phrase | Validator | Writer |
Predefined terms are read by the reader from a list of terms, a selector determines where to search for the term, a source consults the underlying index for position information, a phrase combines term position information, a validator reconciles position information with the actual indexed text, and a writer records the eventual results.
Search Strategies
There are two supported forms of searching in Bioscape:
- Using terms from predefined lexicons and searching for these in the indexed documents.
- Speculatively searching for certain patterns in the indexed documents.
Although the latter form of searching is useful for gathering contextual information (such as mentions of phrases such as "... chromosome" or "chromosome ..."), the former is more pertinent in any discussion of the approach taken in Bioscape to search for bioentities.
Predefined Searching Using Lexicons
Given source information about bioentities, obtained from a database such as Entrez Gene, the names employed by authors and researchers when identifying such bioentities can be made available to Bioscape. Using such raw data, it is then possible to prepare a lexicon of reasonable size whose contents can then be used to search for mentions of bioentities in the literature.
In order to compile a lexicon for a particular kind of search term, such as a lexicon providing gene and protein names, a query is made against the Bioscape database tables that retain name information, with the query qualified using the appropriate term types (such as "protein" and "gene"), and the results are written to an export file using standard database system export mechanisms (the COPY command as provided by PostgreSQL and other database systems). It is files of this nature which are used as the basis for subsequent searching.
To search for predefined terms, each lexicon is accessed by a component known as a reader, and each term is then assessed and processed by other components (as described in later sections). Since each term is predefined and is associated with a particular identifier in the database, this identifier will be propagated to any search results.
Speculative Searching
Certain kinds of information distinguish themselves from predefined information by either not having a well-defined or fixed representation or by not being conveniently represented by specific enumerated entries in a database. For example, one might wish to identify parentheses (the regions enclosed by brackets) within a text, and it would clearly be unreasonable to predefine such regions in advance: they would only be known upon being found, and thus there would obviously be no need to record such regions in a lexicon for the purposes of searching for those very same regions, when such a process has already been concluded.
Consequently, speculative searching is somewhat less rigid than searching for predefined patterns or representations, since the latter focuses on such representations in order to find matching regions of text, whereas the former is free to use a variety of techniques. In the case of parenthesis identification, since the bracket characters are the only "known" parts of the regions to be discovered, the searching strategy focuses on sentences providing an opening and a closing bracket in the appropriate order, with any consideration of the enclosed text being a secondary concern. Upon recording a matching region as a search result, since no predefined notion of the region exists, a special identifier needs to be used in association with the details of the result.
Matching Strategies
As described above, both case-insensitive and case-sensitive matching is permitted for search terms, and an appropriate strategy is chosen depending on the nature of each term. In Bioscape, the selection of a strategy for matching a given term is performed by a component known as a selector (or field selector, to be precise). For keyword terms, it is acceptable to match in a case-insensitive fashion since such term resemble standard English words which could be capitalised in various ways.
Keyword | Suggestions | Assessment |
---|---|---|
interacting | interacting | Match: normal use |
Interacting | Match: title or start of sentence | |
INTERACTING | Match: title or emphasis |
For other kinds of terms, such as chemical or molecule names, the capitalisation employed by such terms is critical to their correct interpretation, and as a result such terms can be matched in a case-sensitive fashion.
Name | Suggestions | Assessment |
---|---|---|
Ca | Ca | Match: reference to calcium |
CA | Non-match: acronym |
Where gene and protein names are concerned, the matching strategy criteria can be made somewhat more complicated, potentially choosing a case-sensitive strategy where names are shorter than a specified length, and choosing a case-insensitive strategy otherwise.
Name | Suggestions | Assessment |
---|---|---|
Rat | Rat | Match: possible reference to Rat (Entrez Gene #5658213) |
rat | Non-match: probable reference to animal | |
RAT | Non-match: probable reference to acronym | |
LYSOPLA | LYSOPLA | Match: probable reference to named entity |
LysoPLA | Match: probable reference to named entity | |
lysopla | Match: probable reference to named entity |
Variations within Search Terms
Although it would be desirable if every reference to an entity of interest were written in the same way, matching exactly the form of that entity in the various databases, such references typically appear with a degree of variation from one document to the next. Consider the following example:
Name | Variations | Examples |
---|---|---|
B7-2 | B7-2 | PubMed #12835481 |
beta 7-beta | PubMed #18387634 |
In effect, authors use abbreviations and alternative forms for parts of names, and to detect a mention of a name, such features must also be considered when searching.
Tokenising the Search Term
In order to find a given search term in a text index, the term must be broken up into tokens which are compatible with those employed during the indexing process. This should follow intuitively from the realisation that if a specific piece of text is broken up into tokens during indexing, and if such tokens (as terms in the index's term dictionary) are the only means of discovering such pieces of text, then the search term must be broken up in a way which exposes the same tokens and allows for the appropriate query (or queries) to be issued to the index.
Deriving Terms from Other Terms
To be completed.
Locating Matches
Given a resource providing search terms and a selector suggesting the fields in which such terms should be found, a locator is used to coordinate the activities of term variation and derivation described above using a source object. For each term, the source performs the following tasks:
- Identification of the varying parts of the term.
- Substitution of alternative parts where variation is permitted.
- Retrieval of result data for each part and its alternatives.
Such result data is then encapsulated in a phrase object for more convenient navigation of the results.
Writing Search Results
To be completed.