Difference between revisions of "Bioscape Searching Techniques"

From irefindex
(New page: A number of searching techniques are applied to find textual mentions of entities or concepts, particularly those where the nature of the searching is speculative, meaning that no predefin...)
 
(Added more examples.)
Line 20: Line 20:
 
| )
 
| )
 
|-
 
|-
| initials
+
| explanation
 
|
 
|
 
| acronym
 
| acronym
Line 31: Line 31:
 
|}
 
|}
  
* initials correspond to acronym
+
* explanation initials correspond to acronym
  
 
=== PubMed #10484778 ===
 
=== PubMed #10484778 ===
Line 43: Line 43:
 
| )
 
| )
 
|-
 
|-
| initials
+
| explanation
 
|
 
|
 
| acronym
 
| acronym
Line 54: Line 54:
 
|}
 
|}
  
* presumed initials only correspond to acronym if reordered
+
* presumed explanation initials only correspond to acronym if reordered
  
 
=== PubMed #10484776 ===
 
=== PubMed #10484776 ===
Line 66: Line 66:
 
| )
 
| )
 
|-
 
|-
| initials
+
| explanation
 
|
 
|
 
| acronym
 
| acronym
Line 77: Line 77:
 
|}
 
|}
  
* presumed initials only correspond if words are inspected more closely
+
* presumed explanation initials only correspond if words are inspected more closely
 
* words must also isolated using a more sophisticated tokeniser than one which splits words using whitespace characters
 
* words must also isolated using a more sophisticated tokeniser than one which splits words using whitespace characters
 +
 +
=== PubMed #10226785 ===
 +
 +
"The '''insulin receptor related receptor (IRR)''' is a heterotetrameric transmembrane receptor with intrinsic tyrosine kinase activity."
 +
 +
{| cellspacing="0" cellpadding="5" border="0" style="margin: 2em"
 +
| style="border: 1px solid #000000" | insulin receptor related receptor
 +
| (
 +
| style="border: 1px solid #000000" | IRR
 +
| )
 +
|-
 +
| explanation
 +
|
 +
| acronym
 +
|
 +
|-
 +
| style="border: 1px solid #000000" | <tt>i</tt>, <tt>r</tt>, <tt>r</tt> (should be ignored), <tt>r</tt><br>(requiring stop-word detection)
 +
|
 +
| style="border: 1px solid #000000" | IRR
 +
|
 +
|}
 +
 +
* presumed explanation initials only correspond if stop-words are discarded
 +
 +
"The IRR shares large homology with the insulin and the '''insulin-like growth factor-1 (IGF-I)''' receptor with regard to amino acid sequence and protein structure."
 +
 +
{| cellspacing="0" cellpadding="5" border="0" style="margin: 2em"
 +
| style="border: 1px solid #000000" | insulin-like growth factor-1
 +
| (
 +
| style="border: 1px solid #000000" | IGF-I
 +
| )
 +
|-
 +
| explanation
 +
|
 +
| acronym
 +
|
 +
|-
 +
| style="border: 1px solid #000000" | <tt>i</tt>, <tt>l</tt> (should be ignored), <tt>g</tt>, <tt>f</tt>, <tt>1</tt><br>(requiring stop-word detection, numeral conversion)
 +
|
 +
| style="border: 1px solid #000000" | IGF-I
 +
|
 +
|}
 +
 +
* presumed explanation initials only correspond if stop-words are discarded
 +
* numerals must also be converted so that <tt>1</tt> and <tt>I</tt> can be matched
  
 
== Chromosome and Maplocation Mentions ==
 
== Chromosome and Maplocation Mentions ==

Revision as of 17:59, 3 December 2009

A number of searching techniques are applied to find textual mentions of entities or concepts, particularly those where the nature of the searching is speculative, meaning that no predefined lists of search terms is used, but where certain characteristic patterns are sought after in the text.

Acronym Mentions

Mentions of acronym definition phrases involve the detection of sentences containing brackets ( and ), followed by a closer inspection of such sentences, applying regular expressions which look for one of the following patterns:

  • An acronym-like term (upper-case letters, digits and hyphens) followed by a parenthesis phrase (a phrase in brackets)
  • An acronym-like term in brackets, with the preceding text then being considered as the definition or explanation of the acronym

Upon identifying a possible acronym and explanation, a test is performed to attempt to match each initial (letter or number) with a word from the explanatory text. Here, although it is tempting to only take the first letter (or digit) from each word, other approaches may be necessary involving more sophisticated tokenisation. Consider the following examples:

PubMed #10639512

"In this study, we isolated and characterized the crucial gene at the breast cancer antiestrogen resistance 1 (BCAR1) locus."

breast cancer antiestrogen resistance 1 ( BCAR1 )
explanation acronym
b, c, a, r, 1 BCAR1
  • explanation initials correspond to acronym

PubMed #10484778

"Anhidrotic ectodermal dysplasia (EDA) is a human genetic disorder of impaired ectodermal appendage development."

ectodermal dysplasia ( EDA )
explanation acronym
a, e, d
(not detectable in order)
EDA
  • presumed explanation initials only correspond to acronym if reordered

PubMed #10484776

"We identified a glyoxylate reductase/hydroxypyruvate reductase (GRHPR) cDNA clone from a human liver expressed sequence tag (EST) library."

glyoxylate reductase/hydroxypyruvate reductase ( GRHPR )
explanation acronym
g, r, h, p (within a word), r
(requiring word analysis)
GRHPR
  • presumed explanation initials only correspond if words are inspected more closely
  • words must also isolated using a more sophisticated tokeniser than one which splits words using whitespace characters

PubMed #10226785

"The insulin receptor related receptor (IRR) is a heterotetrameric transmembrane receptor with intrinsic tyrosine kinase activity."

insulin receptor related receptor ( IRR )
explanation acronym
i, r, r (should be ignored), r
(requiring stop-word detection)
IRR
  • presumed explanation initials only correspond if stop-words are discarded

"The IRR shares large homology with the insulin and the insulin-like growth factor-1 (IGF-I) receptor with regard to amino acid sequence and protein structure."

insulin-like growth factor-1 ( IGF-I )
explanation acronym
i, l (should be ignored), g, f, 1
(requiring stop-word detection, numeral conversion)
IGF-I
  • presumed explanation initials only correspond if stop-words are discarded
  • numerals must also be converted so that 1 and I can be matched

Chromosome and Maplocation Mentions

To be expanded...

PubMed #10684944

"mouse chromosome 17 and to human chromosome 16p13.3"

PubMed #10639512

"chromosome 16q23.1"

PubMed #10484772

"chromosome Xp11.2"