|Note||Please note that this documentation covers an unreleased product and is for internal use only.|
Bioscape combines a number of processes in order to produce search results from textual document and other data. This document attempts to describe each of the processes involved in an approximate order of application.
- 1 Text Indexing
- 2 Search Overview
- 3 Search Strategies
- 4 Reading Search Terms
- 5 Selecting Search Resources
- 6 Tokenising the Search Term
- 7 Deriving Terms from Other Terms
- 8 Variations within Search Terms
- 9 Locating Matches
- 10 Writing Search Results
Before any searching can begin, the documents to be searched must be acquired and processed such that search operations performed on these documents may be done relatively efficiently. To achieve this, the text of these documents is tokenised and presented to a text indexing solution such as Lucene or Xapian which permits the retrieval of token (or term) location information via a data structure known as an inverted index.
Although it might be useful to use a single document as the minimal unit of stored information, there are advantages in considering smaller units of information such as the individual sentences in each document. By identifying sentences, it then becomes easier to determine whether results occur together in the same sentence as well as in the same document. Bioscape employs a relatively simple regular expression approach which incorporates exceptions based on abbreviations and other distractions to split documents into sentences.
Consequently, the notion of a document as presented to the text indexing solution is equivalent to a sentence as found in the original document text, with the membership of sentences (that is, indexing solution "documents") in original documents also recorded in the system.
Within each sentence, the text is broken down into its constituent parts using a process known as tokenisation; this process determines which character sequences will be available in the inverted index for searching. Many search applications employ tokenisation practices which seek to break texts down into sequences of natural language words and names, often discarding symbols, punctuation and even variations in spelling and inflection, as well as potentially applying many other transformations to account for grammatical variations in the way "root words" may be written. Since the purpose of Bioscape is to identify named bioentities whose names may include punctuation and symbols, possibly consisting of multiple words and symbols, naive tokenisation approaches such as splitting on whitespace characters prove rather insufficient in breaking texts down into sequences of recognisable units, whereas linguistic tokenisation approaches such as stemming have only a limited potential for application on typical bioentity names.
One approach similar to the frequently referenced Treebank tokenisation method is to define the following mutually exclusive classes of tokens: alphanumeric words, whitespace characters, all other characters. However, in this approach whitespace characters act only as delimiters, causing boundaries between tokens to be established, and are not produced as concrete tokens in the output. Thus only alphanumeric words and all other (non-whitespace) characters are produced as distinct token classes from a tokenisation process operating in this way. The following examples illustrate Treebank-style tokenisation:
By default, Bioscape uses a somewhat more complicated regular expression approach which attempts to identify a number of token classes:
- digits and numbers
- upper-case characters
- lower-case characters
- upper-case Roman numerals
- lower-case Roman numerals
- Greek letter suffixes (written using Latin characters)
- "h" prefixes
- "p" suffixes
- plus and minus signs
The code which performs tokenisation can be found in the bioscape.regexp module in the bsadmin distribution. A discussion of tokenisation practices and their effects on indexing and searching can be found in the "Bioscape Tokenisation" document.
For many search solutions, it may not be critical to insist that the case of characters in text match exactly the case employed in search terms. For example, it might be acceptable for the word "protein" to match the occurrences "protein", "Protein" and even "PROTEIN" with such solutions, and for various kinds of search terms, the same matching criteria still apply within Bioscape. However, it can be undesirable to permit such case-insensitive matches for certain kinds of terms, and there is therefore a need to record tokens which permit both case-insensitive and case-sensitive matching.
To support both kinds of matching, separate fields are used in the text indexing solution: one stores tokens with case information intact, for case-sensitive matching; one stores tokens with case information discarded, for case-insensitive matching. Thus, when a term is presented to the indexing solution for searching, the appropriate field can be chosen as the basis of the search in order to provide the appropriate matching behaviour.
The general search pipeline in Bioscape is as presented in the following diagram:
(cannot be predefined)
Predefined terms are read by the reader from a list of terms, a selector determines where to search for the term, a source consults the underlying index for position information, a phrase combines term position information, validation reconciles position information with the actual indexed text, and a writer records the eventual results.
For speculative searches, the reader, selector and source are omitted, since these components cannot be defined for such searches: the "logic" defining the nature of the information to be found is instead defined in the phrase component.
There are two supported forms of searching in Bioscape:
- Using terms from predefined lexicons and searching for these in the indexed documents.
- Speculatively searching for certain patterns in the indexed documents.
Although the latter form of searching is useful for gathering contextual information (such as mentions of phrases such as "... chromosome" or "chromosome ..."), the former is more pertinent in any discussion of the approach taken in Bioscape to search for bioentities.
Predefined Searching Using Lexicons
Given source information about bioentities, obtained from a database such as Entrez Gene, the names employed by authors and researchers when identifying such bioentities can be made available to Bioscape. Using such raw data, it is then possible to prepare a lexicon of reasonable size whose contents can then be used to search for mentions of bioentities in the literature.
In order to compile a lexicon for a particular kind of search term, such as a lexicon providing gene and protein names, a query is made against the Bioscape database tables that retain name information, with the query qualified using the appropriate term types (such as "protein" and "gene"), and the results are written to an export file using standard database system export mechanisms (the COPY command as provided by PostgreSQL and other database systems). It is files of this nature which are used as the basis for subsequent searching.
To search for predefined terms, each lexicon is accessed by a component known as a reader, and each term is then assessed and processed by other components (as described in later sections). Since each term is predefined and is associated with a particular identifier in the database, this identifier will be propagated to any search results.
Certain kinds of information distinguish themselves from predefined information by either not having a well-defined or fixed representation or by not being conveniently represented by specific enumerated entries in a database. For example, one might wish to identify parentheses (the regions enclosed by brackets) within a text, and it would clearly be unreasonable to predefine such regions in advance: they would only be known upon being found, and thus there would obviously be no need to record such regions in a lexicon for the purposes of searching for those very same regions, when such a process has already been concluded.
Consequently, speculative searching is somewhat less rigid than searching for predefined patterns or representations, since the latter focuses on such representations in order to find matching regions of text, whereas the former is free to use a variety of techniques. In the case of parenthesis identification, since the bracket characters are the only "known" parts of the regions to be discovered, the searching strategy focuses on sentences providing an opening and a closing bracket in the appropriate order, with any consideration of the enclosed text being a secondary concern. Upon recording a matching region as a search result, since no predefined notion of the region exists, a special identifier needs to be used in association with the details of the result.
Reading Search Terms
For each kind of search, a list of search terms (also known as a lexicon) is prepared from bioentity information derived from source data. This is done by executing queries against the Bioscape database, selecting specific term types (from the text_term table), possibly filtering terms using scoring information (from the text_method and text_termscore tables) and exporting the results to a file in tab-separated form. It is this export file which is then consulted by the reader component when providing terms for searching.
Selecting Search Resources
As described above, both case-insensitive and case-sensitive matching is permitted for search terms, and an appropriate strategy is chosen depending on the nature of each term. In Bioscape, the selection of a strategy for matching a given term is performed by a component known as a selector (or field selector, to be precise). For keyword terms, it is acceptable to match in a case-insensitive fashion since such term resemble standard English words which could be capitalised in various ways.
|interacting||interacting||Match: normal use|
|Interacting||Match: title or start of sentence|
|INTERACTING||Match: title or emphasis|
For other kinds of terms, such as chemical or molecule names, the capitalisation employed by such terms is critical to their correct interpretation, and as a result such terms can be matched in a case-sensitive fashion.
|Ca||Ca||Match: reference to calcium|
Where gene and protein names are concerned, the matching strategy criteria can be made somewhat more complicated, potentially choosing a case-sensitive strategy where names are shorter than a specified length, and choosing a case-insensitive strategy otherwise.
|Rat||Rat||Match: possible reference to Rat (Entrez Gene #5658213)|
|rat||Non-match: probable reference to animal|
|RAT||Non-match: probable reference to acronym|
|LYSOPLA||LYSOPLA||Match: probable reference to named entity|
|LysoPLA||Match: probable reference to named entity|
|lysopla||Match: probable reference to named entity|
Where very common words (stop words) are combined with other common tokens, it can be advisable to forbid case-insensitive matching:
|ALL-1||ALL-1||Match: possible reference to ALL-1 (Entrez Gene #4297)|
|all 1||Non-match: region of text provides token matches (the text could read "when all 1 and 2 forms" or something equally plausible)|
The diagram below summarises the process of selecting search fields:
|Predefined terms||Search fields and terms|
Tokenising the Search Term
In order to find a given search term in a text index, the term must be broken up into tokens which are compatible with those employed during the indexing process. This should follow intuitively from the realisation that if a specific piece of text is broken up into tokens during indexing, and if such tokens (as terms in the index's term dictionary) are the only means of discovering such pieces of text, then the search term must be broken up in a way which exposes the same tokens and allows for the appropriate query (or queries) to be issued to the index.
Deriving Terms from Other Terms
Although variations of search terms can encompass many of the forms of such terms in the literature being searched, it can also be necessary to generate distinct terms potentially having a different structure to that of the original term. Consider the following example:
|polymerase (DNA directed), beta||polymerase beta||PubMed #14755728|
In the above example, it is necessary to discard a section in parentheses in order to recognise the usage of the term in the text. A number of forms of derivation exist in Bioscape:
- No change: the term's original form is retained
- Flattened terms: brackets are removed
- Parenthesis removal: parenthesized sections are removed completely
- Word relocation: some words (defined in bsindex.words, typically words like beta and symbols like ii) are relocated from the end of a term to the start
Generally, derived forms of a term suggest different ways of searching for that term which cannot easily be incorporated into the process of generating simple variations of the term (as described below), mostly because the variation generation process focuses on the relevance of individual tokens, whereas the process of generating derived terms may also consider the relevance of entire sections of tokens in a term. The resulting derived terms are subject to orthographic variation.
The diagram below summarises the process of generating derived search terms in the source component:
|Search term tokens||Lists of tokens for derived names|
|polymerase (DNA directed), beta||Term Filter
(deriving new terms)
|polymerase (DNA directed), beta|
polymerase DNA directed, beta
Variations within Search Terms
Although it would be desirable if every reference to an entity of interest were written in the same way, matching exactly the form of that entity in the various databases, such references typically appear with a degree of variation from one document to the next. Consider the following examples:
|beta 7-beta||PubMed #18387634|
In effect, authors use abbreviations and alternative forms for parts of names, and to detect a mention of a name, such features must also be considered when searching. Consequently, Bioscape recognises a number of different levels of variation:
- No variation at all: this means that all tokens of a given term must match literally
- Simple variation: where some kinds of tokens (generally symbols, as defined in the missable attribute of the bsindex.variations module) can be omitted when matching the tokens of a term to the text
- Tabular variation: where some tokens can be replaced by equivalent tokens (such as 2 being equivalent to beta), optionally acting in a "stateful" fashion where such replacements can be confined to certain parts of a term or word
Generally, the variations of a term incorporate alternative tokens for certain parts of a term or permit the absence of tokens - the ordering of tokens is typically left unchanged, since structural changes are dealt with by deriving new terms which may incorporate a different token order (as described above) - all with the purpose of permitting a flexible, yet relatively uncomplicated match against the text index.
A number of variation finder classes are defined in the bsindex.variations module (in the bsindex distribution). Instances of these classes work with the source component used in a searching activity to indicate how a particular term can be varied and to provide a selection of alternatives for various tokens or parts in that term. The output from a variation finder is a sequence with each element corresponding to a token or part in the search term, indicating whether the token must be found literally, whether it can be omitted, or which selection of alternatives must be considered as legitimate matches for the token. Consider the following example:
|Variation assessments||b, B, beta, BETA||NO_VARIATION||MISSABLE||2, b, B, beta, BETA, ii, II|
Upon requesting information from the variation finder, the source component will receive instructions on how it is to proceed in searching for each of the tokens, and whether it should search for other tokens in order to successfully detect variations of a particular term.
The diagram below summarises the process of generating variations in the source component:
|Search term tokens||Search term variations for tokens|
|Variation Finder||b B beta BETA|
2 b B beta BETA ii II
Given a resource providing search terms and a selector suggesting the fields in which such terms should be found, a locator is used to coordinate the activities of term variation and derivation described above using a source object. For each term, the source performs the following tasks:
- Derivation of alternative search terms.
- Identification of the varying parts of each term.
- Substitution of alternative parts where variation is permitted.
- Retrieval of result data for each part and its alternatives.
The internal structure of the source object can be summarised by the following diagram.
|Search term tokens||Search term variations for tokens|
(deriving new terms)
The search result data is then encapsulated in a phrase object for more convenient navigation of the results.
Phrases and Validation
A phrase incorporates information about a sequence of term parts, each part potentially offering a number of alternative tokens at a particular location within a term. Along with this information, actual information about the occurrence of such tokens in the text index is also retained. It is the phrase's job to find correspondences between the occurrence information for the phrase's tokens and to discover regions of the original text where such tokens appear together in a way which suggests that the original term appears at such places in the text.
Consider the following example of token sequences and information about the positions of such tokens in a document:
3, 6, 15, 19
|PAR-1 (at position 18)|
Although a simple example such as the one above with no orthographic variation involved may be able to convince us that the search term appears at a particular location, there may be situations where the token locations alone cannot confirm whether the term is genuinely present in the text, or whether a region of text resembling the term has been found. Thus, the process of validation is employed to inspect the original text and to test the suggested occurrence for certain properties.
Consider the gene name HM3. Although a search for this name by finding token positions might cause a match to a region of text containing the characters HMG (since 3 is equivalent to G in the orthographic variation scheme), should this matching text be followed by other characters which might influence our interpretation of the text, such as the wider region containing the characters HMG2 (such as in PubMed #10212205), then the validation process should consider such adjacent characters and be able to disallow potential matches. In this case, we would have found something equivalent to HM32 which would not be a likely way of mentioning the search term.
Writing Search Results
Information about search results is written to a tab-separated format file by a writer component, where the exact format of the file depends on the kind of information being investigated. Positioned results incorporate information about character offsets, whereas unpositioned results omit this and other details.