Bioscape Methods

From irefindex
Revision as of 12:37, 22 July 2009 by PaulBoddie (talk | contribs) (→‎Term Scoring: Added some descriptive text.)

Please note that this documentation covers an unreleased product and is for internal use only.


This document describes the role of methods in Bioscape.

Processing, Methods and Scoring

The processing pipeline of Bioscape can be summarised as follows:

  1. Import information about biological entities (genes, proteins), also known as bioentities.
  2. Build a lexicon consisting of names associated with the imported entities as well as more general terms associated with other kinds of data.
  3. Search biomedical literature using the contents of the lexicon, subject to filtering.
  4. Assign bioentities to the text search results.

At each stage in the pipeline, Bioscape employs methods which are used to assess the value or suitability of the information employed by assigning scores to the information based on particular criteria. Consequently, the following kinds of methods are applied:

  1. Term scoring: assessing whether a term (or name) should be used in text searches.
  2. Search scoring: assessing whether a bioentity should be assigned to a text search result.
  3. Sentence scoring: assessing whether a sentence has a particular importance.
  4. Result scoring: assessing whether a result (combining bioentity and textual information) is genuine.

Examples of methods are given below.

Term Scoring

Term scoring methods assess the suitability of various terms for searching purposes, and they are principally divided into two groups: positive and negative.

Positive scoring methods assess whether a term satisfies a number of desirable criteria, thus belonging to a group of terms with desirable search properties. Such groups are typically (or at least initially) those which contain terms of most interest. However, one may also want to search for non-members of such groups later.

Examples of positive scoring methods include "human name" and "WordNet", indicating respectively whether terms are used to refer to human bioentities and whether terms are mentioned in a common English word dictionary.

Negative scoring methods assess whether a term does not satisfy a number of criteria. Such criteria may be associated with undesirable search properties and be common to groups of uninteresting search terms (such as common English words or numbers). By their exclusion from such groups, terms can be considered to be interesting, although one may want to search for uninteresting terms which do satisfy one or more of the criteria at a later point, too.

Examples of negative scoring methods include "not WordNet" and "not number", indicating respectively whether terms are not mentioned in a common English word dictionary and whether terms are not numbers.

Despite the naming convention used above, both positive and negative scoring methods employ the scoring convention whereby interesting terms carry a score of 1 and uninteresting terms carry a score of 0, where "interesting" and "uninteresting" must be considered within the context of subsequent processing.

Thus, a term scored using the "human name" method and being assigned a score of 1 would be considered interesting, as would a term scored using the "not number" method and being assigned a score of 1, even though the latter assignment is based on a negative observation.

Note that negative and positive scoring methods may be complementary: a term assigned a score of 1 with the "WordNet" method will be assigned a score of 0 with the "not WordNet" method. It is the task of subsequent processing of the score information to employ a suitable method which reflects the level of desirability or interest a particular term may have in that processing and in any results produced.

Search Scoring

Sentence Scoring

Result Scoring

The scoring of results involves inspecting the proposed bioentities and assessing their suitability in a particular document location. Such assessment methods employ the following approaches:

Effect of method
Means of assessment Confirm bioentity relevance (scoring supported bioentities positively) Disambiguate bioentities (identifying unsupported bioentities and scoring them negatively)
Find supporting contextual information
Find more supporting contextual information for the "best" bioentities
Compare bioentities in order to identify the "best" bioentities

Methods which confirm bioentity relevance may be combined to test whether a mention satisfies the criteria from all such methods. However, where a mention need only satisfy the criteria from a single method, it is more appropriate to combine disambiguation methods, since these should only exclude mentions on a conservative basis.

Competing Names

PubMed #7479798: gene #1434 is referenced by names CSE1 and CAS, but CAS is used ambiguously. Since the other genes referenced by CAS are not supported by other names, CAS is interpreted as also being a reference to gene #1434.