Bioscape Methods

From irefindex
Revision as of 18:48, 25 February 2010 by PaulBoddie (talk | contribs) (Added filtering and scoring notes.)

Please note that this documentation covers an unreleased product and is for internal use only.


This document describes the role of methods in Bioscape.

Processing, Methods and Scoring

The processing pipeline of Bioscape can be summarised as follows:

  1. Import information about biological entities (genes, proteins), also known as bioentities.
  2. Build a lexicon consisting of names associated with the imported entities as well as more general terms associated with other kinds of data.
  3. Search biomedical literature using the contents of the lexicon, subject to filtering.
  4. Assign bioentities to the text search results.

At each stage in the pipeline, Bioscape employs methods which are used to assess the value or suitability of the information employed by assigning scores to the information based on particular criteria. Consequently, the following kinds of methods are applied:

  1. Term scoring: assessing whether a term (or name) should be used in text searches.
  2. Search scoring: assessing whether a bioentity should be assigned to a text search result.
  3. Sentence scoring: assessing whether a sentence has a particular importance.
  4. Result scoring: assessing whether a result (combining bioentity and textual information) is genuine.

A complete list of scoring methods and their default scores can be found here. Examples of methods are given below.

Filtering and Scoring

In addition to merely scoring results, the opportunity is taken at various points to filter results for use in subsequent activities. For example, terms may be scored in such a way that low-scoring terms are excluded from consideration as search term candidates, and if we were to apply the human_name method for filtering purposes, all resulting search term candidates would produce results that are implicitly concerned with mentions of human names. Thus, textual search results produced by such terms would give a positive result for the human_name method, even though that method is in principle a term scoring method.

The mechanism by which a method's score is evaluated employs a table containing filtering information for any given result set:

  1. If a result has an associated score for a method, this score is taken.
  2. Otherwise, if a method has been used to filter results, a score of 1 is taken.
  3. Otherwise, the method's unscored default value is taken.

Term Scoring

Term scoring methods assess the suitability of various terms for searching purposes, and they are principally divided into two groups: positive and negative.

Positive scoring methods assess whether a term satisfies a number of desirable criteria, thus belonging to a group of terms with desirable search properties. Such groups are typically (or at least initially) those which contain terms of most interest. However, one may also want to search for non-members of such groups later.

Examples of positive scoring methods include human_name and wordnet, indicating respectively whether terms are used to refer to human bioentities and whether terms are mentioned in a common English word dictionary.

Negative scoring methods assess whether a term does not satisfy a number of criteria. Such criteria may be associated with undesirable search properties and be common to groups of uninteresting search terms (such as common English words or numbers). By their exclusion from such groups, terms can be considered to be interesting, although one may want to search for uninteresting terms which do satisfy one or more of the criteria at a later point, too.

Examples of negative scoring methods include not_wordnet and not_number, indicating respectively whether terms are not mentioned in a common English word dictionary and whether terms are not numbers.

Despite the naming convention used above, both positive and negative scoring methods employ the scoring convention whereby interesting terms carry a score of 1 and uninteresting terms carry a score of 0, where "interesting" and "uninteresting" must be considered within the context of subsequent processing.

Thus, a term scored using the human_name method and being assigned a score of 1 would be considered interesting, as would a term scored using the "not number" method and being assigned a score of 1, even though the latter assignment is based on a negative observation.

Note that negative and positive scoring methods may be complementary: a term assigned a score of 1 with the wordnet method will be assigned a score of 0 with the not_wordnet method. It is the task of subsequent processing of the score information to employ a suitable method which reflects the level of desirability or interest a particular term may have in that processing and in any results produced.

Currently used methods are the following:

Method Description
not_moby term is not in the Moby lexicon
not_wordnet term is not in the WordNet lexicon
not_number term is not a number
human_name term refers to a human bioentity
not_uninformative term is informative (not uninformative) according to the presence of certain patterns (identifying particular "uninformative" or "imprecise" styles of bioentity name)
not_systematic term is not systematic according to the presence of certain patterns (identifying particular "systematic" styles of bioentity name)
not_short term is not considered short (less than 2 characters)
not_stop_word term is not a stop word
symbolic term is considered symbolic (where a symbol is a bioentity name belonging to specific categories not containing a space)

Term scoring methods are applied using the templates found in the bioscape/sql/termscore directory.

Filtering

Term scores can be used as the basis of eliminating search candidates - negatively scored terms need not be searched for in the literature - and for filtering suggested bioentities for each search result.

Search Scoring

Search scoring assesses the suitability of particular bioentities as suggestions for textual search results. For example, where only human genes are of interest, it is necessary to firstly identify such genes and to assign a positive score to them, assigning a negative score (potentially implicitly) to all other bioentities. This human_gene method can then be used to select only such genes, potentially in combination with other methods that can narrow the selection further, such that where a particular textual search term could be associated with a number of different bioentities, those of interest are retained and all other candidates filtered out.

Currently used methods are the following:

Method Description
human_gene bioentities which are human genes

Filtering

Although search scoring methods are used to filter concrete, bioentity-specific search results, the information may be useful when filtering other kinds of results.

Sentence Scoring

The scoring of sentences typically involves the assessment of each sentence for some specific property and the subsequent assignment of a score to indicate the presence of such properties. For example, the interaction_sentence method employs information about the presence of interaction keywords - words from a predefined lexicon which may indicate the description of a protein interaction process - and scores sentences positively if such keywords exist in those sentences.

Currently used methods are the following:

Method Description
interaction_sentence sentence contains keywords which suggest an interaction

Filtering

Sentence scores can be useful when identifying sentences for further investigation in the production of suspected interaction occurrences, and the interaction_sentence method is used specifically to filter out results from sentences which do not contain interaction keywords.

Result Scoring

Result scoring involves the assessment of concrete search results which associate bioentities with specific regions of text.

Some result methods may seem unnecessary. For example, the not_chemical_name_mention method, whose nature involves identifying results which coincide with chemical/molecule name mentions, might seem better implemented as a term scoring method. However, whilst term scoring is effective when terms can be precisely matched against each other, result scoring is also effective when terms more loosely match similar regions of text. In other cases, such as with the not_uninformative_keyword_mention method and the not_uninformative_keyword term scoring method, the latter may obviate the need to apply the former since the list of uninformative keywords might match the original terms exactly (and this should be the case since the list was curated).

As well as using specific scoring methods, result scores can be propagated from sentence scores; performing this transfer of scoring information effectively specialises the sentence scoring method so that the interaction_sentence method, for example, would effectively end up scoring results as if it were a "result appears in an interaction sentence" method.

Currently used methods are the following:

Method Description
not_chemical_name_mention result does not coincide with a chemical name mention
not_uninformative_keyword_mention result does not coincide with an uninformative keyword mention
not_part_of_other_mentions result does not occur as part of other mentions
not_maplocation_mention result does not occur at the same location as a maplocation mention
disambiguated_by_competing_names result is supported by the presence of more names than other candidates at the same precise location
unambiguous_gene_mention result suggests the only gene at a particular mention location
confirmed_by_competing_names result is supported by the presence of more names than other candidates at the same precise location
not_disqualified_by_keyword result gene is not disqualified by a keyword
confirmed_by_symbol_match result is confirmed by the presence of a symbol which matches the name involved with the result
unambiguous_gene_mention_at_precise_location result suggests the only gene at a particular precise mention location
confirmed_by_multiple_names result is confirmed by the presence of multiple names for the suggested bioentity
not_gene_ontology_term_mention result does not coincide with a Gene Ontology term mention

Some methods rely on additional data. Partitions of the table text_result_doc_bioentity_names are required for "competing name" methods, whereas partitions of the text_result_doc_genes table are required for disambiguation methods which employ "unambiguous gene" data.

Result Scoring Techniques

The scoring of results involves inspecting the proposed bioentities and assessing their suitability in a particular document location. Such assessment methods employ the following approaches:

Means of assessment Effect of method
Confirm bioentity relevance (scoring supported bioentities positively) Disambiguate bioentities (identifying unsupported bioentities and scoring them negatively)
Find supporting contextual information
Find more supporting contextual information for the "best" bioentities
Compare bioentities in order to identify the "best" bioentities

Methods which confirm bioentity relevance may be combined to test whether a mention satisfies the criteria from all such methods. However, where a mention need only satisfy the criteria from a single method, it is more appropriate to combine disambiguation methods, since these should only exclude mentions on a conservative basis.

A collection of scoring techniques is provided in the "Bioscape Scoring Techniques" document.