Difference between revisions of "Bioscape Searching Techniques"

From irefindex
(New page: A number of searching techniques are applied to find textual mentions of entities or concepts, particularly those where the nature of the searching is speculative, meaning that no predefin...)
 
m (Added status note.)
 
(7 intermediate revisions by the same user not shown)
Line 1: Line 1:
 +
{{:Bioscape Status}}
 +
 
A number of searching techniques are applied to find textual mentions of entities or concepts, particularly those where the nature of the searching is speculative, meaning that no predefined lists of search terms is used, but where certain characteristic patterns are sought after in the text.
 
A number of searching techniques are applied to find textual mentions of entities or concepts, particularly those where the nature of the searching is speculative, meaning that no predefined lists of search terms is used, but where certain characteristic patterns are sought after in the text.
  
Line 20: Line 22:
 
| )
 
| )
 
|-
 
|-
| initials
+
| explanation
 
|
 
|
 
| acronym
 
| acronym
Line 31: Line 33:
 
|}
 
|}
  
* initials correspond to acronym
+
* explanation initials correspond to acronym
  
 
=== PubMed #10484778 ===
 
=== PubMed #10484778 ===
Line 38: Line 40:
  
 
{| cellspacing="0" cellpadding="5" border="0" style="margin: 2em"
 
{| cellspacing="0" cellpadding="5" border="0" style="margin: 2em"
| style="border: 1px solid #000000" | ectodermal dysplasia
+
| style="border: 1px solid #000000" | Anhidrotic ectodermal dysplasia
 
| (
 
| (
 
| style="border: 1px solid #000000" | EDA
 
| style="border: 1px solid #000000" | EDA
 
| )
 
| )
 
|-
 
|-
| initials
+
| explanation
 
|
 
|
 
| acronym
 
| acronym
Line 54: Line 56:
 
|}
 
|}
  
* presumed initials only correspond to acronym if reordered
+
* presumed explanation initials only correspond to acronym if reordered
  
 
=== PubMed #10484776 ===
 
=== PubMed #10484776 ===
Line 66: Line 68:
 
| )
 
| )
 
|-
 
|-
| initials
+
| explanation
 
|
 
|
 
| acronym
 
| acronym
Line 77: Line 79:
 
|}
 
|}
  
* presumed initials only correspond if words are inspected more closely
+
* presumed explanation initials only correspond if words are inspected more closely
 
* words must also isolated using a more sophisticated tokeniser than one which splits words using whitespace characters
 
* words must also isolated using a more sophisticated tokeniser than one which splits words using whitespace characters
 +
 +
=== PubMed #10226785 ===
 +
 +
"The '''insulin receptor related receptor (IRR)''' is a heterotetrameric transmembrane receptor with intrinsic tyrosine kinase activity."
 +
 +
{| cellspacing="0" cellpadding="5" border="0" style="margin: 2em"
 +
| style="border: 1px solid #000000" | insulin receptor related receptor
 +
| (
 +
| style="border: 1px solid #000000" | IRR
 +
| )
 +
|-
 +
| explanation
 +
|
 +
| acronym
 +
|
 +
|-
 +
| style="border: 1px solid #000000" | <tt>i</tt>, <tt>r</tt>, <tt>r</tt> (should be ignored), <tt>r</tt><br>(requiring stop-word detection)
 +
|
 +
| style="border: 1px solid #000000" | IRR
 +
|
 +
|}
 +
 +
* presumed explanation initials only correspond if stop-words are discarded
 +
 +
"The IRR shares large homology with the insulin and the '''insulin-like growth factor-1 (IGF-I)''' receptor with regard to amino acid sequence and protein structure."
 +
 +
{| cellspacing="0" cellpadding="5" border="0" style="margin: 2em"
 +
| style="border: 1px solid #000000" | insulin-like growth factor-1
 +
| (
 +
| style="border: 1px solid #000000" | IGF-I
 +
| )
 +
|-
 +
| explanation
 +
|
 +
| acronym
 +
|
 +
|-
 +
| style="border: 1px solid #000000" | <tt>i</tt>, <tt>l</tt> (should be ignored), <tt>g</tt>, <tt>f</tt>, <tt>1</tt><br>(requiring stop-word detection, numeral conversion)
 +
|
 +
| style="border: 1px solid #000000" | IGF-I
 +
|
 +
|}
 +
 +
* presumed explanation initials only correspond if stop-words are discarded
 +
* numerals must also be converted so that <tt>1</tt> and <tt>I</tt> can be matched
  
 
== Chromosome and Maplocation Mentions ==
 
== Chromosome and Maplocation Mentions ==
Line 95: Line 142:
  
 
"chromosome Xp11.2"
 
"chromosome Xp11.2"
 +
 +
== Synonym Definitions ==
 +
 +
=== PubMed #10880513 ===
 +
 +
"Our previous studies have shown that activation of a '''related adhesion focal tyrosine kinase (RAFTK) (also known as Pyk2)''' is required for dexamethasone (Dex)-induced apoptosis in multiple myeloma (MM) cells and that human interleukin-6 (IL-6), a known growth and survival factor for MM cells, blocks both RAFTK activation and apoptosis induced by Dex."
 +
 +
{| cellspacing="0" cellpadding="5" border="0" style="margin: 2em"
 +
| style="border: 1px solid #000000" | related adhesion focal tyrosine kinase
 +
| (
 +
| style="border: 1px solid #000000" | RAFTK
 +
| ) (
 +
| style="border: 1px solid #000000" | also known as
 +
|
 +
| style="border: 1px solid #000000" | Pyk2
 +
| )
 +
|-
 +
| suggestion
 +
|
 +
| suggestion
 +
|
 +
| synonym correspondence
 +
|
 +
| suggestion
 +
|}
 +
 +
=== PubMed #10910894 ===
 +
 +
"Liver-expressed chemokine '''(LEC) is an unusually large CC chemokine, which is also known as LMC, HCC-4, NCC-4, and CCL16'''."
 +
 +
{| cellspacing="0" cellpadding="5" border="0" style="margin: 2em"
 +
| (
 +
| style="border: 1px solid #000000" | LEC
 +
| ) is an unusually large CC chemokine, which is
 +
| also known as
 +
|
 +
| style="border: 1px solid #000000" | LMC
 +
| ,
 +
| style="border: 1px solid #000000" | HCC-4
 +
| ,
 +
| style="border: 1px solid #000000" | NCC-4
 +
| , and
 +
| style="border: 1px solid #000000" | CCL16
 +
|-
 +
|
 +
| suggestion
 +
|
 +
| synonym correspondence
 +
|
 +
| suggestion
 +
|
 +
| suggestion
 +
|
 +
| suggestion
 +
|
 +
| suggestion
 +
|-
 +
|
 +
| style="border: 1px solid #000000" | CCL16 (#6360)
 +
|
 +
|
 +
|
 +
| style="border: 1px solid #000000" | CCL16 (#6360)
 +
|
 +
| style="border: 1px solid #000000" | CCL16 (#6360)<br>RBMS1 (#5937)
 +
|
 +
| style="border: 1px solid #000000" | CCL16 (#6360)
 +
|
 +
| style="border: 1px solid #000000" | CCL16 (#6360)
 +
|}
 +
 +
=== PubMed #11137999 ===
 +
 +
"Here we report the identification of a new transmembrane serine protease ('''TMPRSS3; also known as ECHOS1''') expressed in many tissues, including fetal cochlea, which is mutated in the families used to describe both the DFNB10 and DFNB8 loci."
 +
 +
{| cellspacing="0" cellpadding="5" border="0" style="margin: 2em"
 +
| style="border: 1px solid #000000" | TMPRSS3
 +
| ;
 +
| also known as
 +
|
 +
| style="border: 1px solid #000000" | ECHOS1
 +
|-
 +
| suggestion
 +
|
 +
| synonym correspondence
 +
|
 +
| suggestion
 +
|-
 +
| style="border: 1px solid #000000" | TMPRSS3 (#64699)<br>TMPRSS4 (#56649)
 +
|
 +
|
 +
|
 +
| style="border: 1px solid #000000" | TMPRSS3 (#64699)
 +
|}
 +
 +
== Bioentity Inference ==
 +
 +
Sometimes, it can be inferred which kind of bioentity is being written about in the text.
 +
 +
=== PubMed #9624006 ===
 +
 +
"These mice also show increased susceptibility to tumorigenesis either following carcinogen treatment or when also '''deficient in Ink4a'''."
 +
 +
{| cellspacing="0" cellpadding="5" border="0" style="margin: 2em"
 +
| style="border: 1px solid #000000" | deficient in
 +
|
 +
| style="border: 1px solid #000000" | Ink4a
 +
|-
 +
| type-specific term
 +
|
 +
| suggestion
 +
|-
 +
| style="border: 1px solid #000000" | reference to protein
 +
|
 +
| style="border: 1px solid #000000" | CDKN2A (#1029)
 +
|}
 +
 +
* "deficient in" should force the interpretation of the bioentity as a protein
 +
 +
=== PubMed #8062391 ===
 +
 +
"A novel positive cofactor (PC4) purified from the human USA fraction effected a marked enhancement (up to 85-fold) of GAL4-AH-dependent transcription '''in conjunction with TFIID and other general factors'''."
 +
 +
{| cellspacing="0" cellpadding="5" border="0" style="margin: 2em"
 +
| in conjunction with
 +
| style="border: 1px solid #000000" | TFIID
 +
| and other general
 +
| style="border: 1px solid #000000" | factors
 +
|-
 +
|
 +
| suggestion
 +
|
 +
| type-specific term
 +
|-
 +
|
 +
| style="border: 1px solid #000000" | TBP (#6908)
 +
|
 +
| style="border: 1px solid #000000" | reference to proteins
 +
|}
 +
 +
* syntactic analysis connecting the suggestion with "factors" should force the interpretation of the bioentity as a protein
 +
 +
=== PubMed #7663508 ===
 +
 +
"The fat mutation maps to mouse chromosome 8, '''very close to the gene for carboxypeptidase E (Cpe), which encodes an enzyme (CPE)''' that processes prohormone intermediates such as proinsulin."
 +
 +
{| cellspacing="0" cellpadding="5" border="0" style="margin: 2em"
 +
| very close to
 +
| style="border: 1px solid #000000" | the gene for
 +
|
 +
| style="border: 1px solid #000000" | carboxypeptidase E
 +
| (
 +
| style="border: 1px solid #000000" | Cpe
 +
| ), which encodes an
 +
| style="border: 1px solid #000000" | enzyme
 +
| (
 +
| style="border: 1px solid #000000" | CPE
 +
| )
 +
|-
 +
|
 +
| type-specific term
 +
|
 +
| suggestion
 +
|
 +
| suggestion
 +
|
 +
| type-specific term
 +
|
 +
| suggestion
 +
|
 +
|-
 +
|
 +
| style="border: 1px solid #000000" | reference to gene ''for protein''
 +
|
 +
| style="border: 1px solid #000000" | CPE (#1363)
 +
|
 +
| style="border: 1px solid #000000" | CPE (#1363)
 +
|
 +
| style="border: 1px solid #000000" | reference to protein
 +
|
 +
| style="border: 1px solid #000000" | CPE (#1363)
 +
|
 +
|}
 +
 +
* type-specific terms should cause suggestions to be interpreted as proteins, albeit with an implied gene reference
 +
* note that "mouse chromosome 8" should affect any chromosome-related techniques
 +
 +
However, elsewhere in this document, text surrounding suggestions may cause other interpretations:
 +
 +
"Hyperproinsulinaemia in obese fat/fat mice '''associated with a carboxypeptidase E mutation''' which reduces enzyme activity."
 +
 +
{| cellspacing="0" cellpadding="5" border="0" style="margin: 2em"
 +
| associated with a
 +
| style="border: 1px solid #000000" | carboxypeptidase E
 +
|
 +
| style="border: 1px solid #000000" | mutation
 +
|-
 +
|
 +
| suggestion
 +
|
 +
| type-specific term
 +
|-
 +
|
 +
| style="border: 1px solid #000000" | CPE (#1363)
 +
|
 +
| style="border: 1px solid #000000" | implied reference to gene
 +
|}
 +
 +
* "mutation" should cause the suggestion to be interpreted as a gene
 +
 +
=== PubMed #2674130 ===
 +
 +
"'''A new family of ras-related proteins, designated rac (ras-related C3 botulinum toxin substrate)''' has been identified. rac1 and rac2 cDNA clones were isolated from a differentiated HL-60 library and encode proteins that are 92% homologous and share 58% and 26-30% amino acid homology with human rhos and ras, respectively."
 +
 +
{| cellspacing="0" cellpadding="5" border="0" style="margin: 2em"
 +
| A new
 +
| style="border: 1px solid #000000" | family
 +
| of ras-related
 +
| style="border: 1px solid #000000" | proteins
 +
| , designated
 +
| style="border: 1px solid #000000" | rac
 +
| (
 +
| style="border: 1px solid #000000" | ras-related C3 botulinum toxin substrate
 +
| )
 +
|-
 +
|
 +
| type-specific term
 +
|
 +
| type-specific term
 +
|
 +
| suggestion
 +
|
 +
| acronym explanation (not detected)
 +
|
 +
|-
 +
|
 +
| style="border: 1px solid #000000" | reference to protein family
 +
|
 +
| style="border: 1px solid #000000" | reference to proteins
 +
|
 +
| style="border: 1px solid #000000" | AKT1 (#207)
 +
|
 +
|
 +
|
 +
|}
 +
 +
* syntactic analysis should connect the type-specific term to the suggestion and cause the suggestion to be interpreted as a protein family
 +
 +
== Collections or Sequences of Terms ==
 +
 +
=== PubMed #7649249 ===
 +
 +
"We report the characterization of three novel members of the KRAB-domain containing C2-H2 '''zinc finger family (ZNF133, 136 and 140).'''"
 +
 +
{| cellspacing="0" cellpadding="5" border="0" style="margin: 2em"
 +
| zinc finger
 +
| style="border: 1px solid #000000" | family
 +
| (
 +
| style="border: 1px solid #000000" | ZNF133
 +
| ,
 +
| style="border: 1px solid #000000" | 136
 +
| and
 +
| style="border: 1px solid #000000" | 140
 +
| ).
 +
|-
 +
|
 +
| type-specific term
 +
|
 +
| suggestion
 +
|
 +
| part of suggestion
 +
|
 +
| part of suggestion
 +
|
 +
|-
 +
|
 +
| style="border: 1px solid #000000" | reference to protein family
 +
|
 +
| style="border: 1px solid #000000" | ZNF133 (#7692)
 +
|
 +
| style="border: 1px solid #000000" | ZNF136 (#7695)
 +
|
 +
| style="border: 1px solid #000000" | ZNF140 (#7699)
 +
|
 +
|}
 +
 +
* "family" could be used to prompt further inspection of the term and following numbers
  
 
[[Category:Bioscape]]
 
[[Category:Bioscape]]

Latest revision as of 13:46, 14 July 2010

NoteNotePlease note that this documentation covers an unreleased product and is for internal use only.

A number of searching techniques are applied to find textual mentions of entities or concepts, particularly those where the nature of the searching is speculative, meaning that no predefined lists of search terms is used, but where certain characteristic patterns are sought after in the text.

Acronym Mentions

Mentions of acronym definition phrases involve the detection of sentences containing brackets ( and ), followed by a closer inspection of such sentences, applying regular expressions which look for one of the following patterns:

  • An acronym-like term (upper-case letters, digits and hyphens) followed by a parenthesis phrase (a phrase in brackets)
  • An acronym-like term in brackets, with the preceding text then being considered as the definition or explanation of the acronym

Upon identifying a possible acronym and explanation, a test is performed to attempt to match each initial (letter or number) with a word from the explanatory text. Here, although it is tempting to only take the first letter (or digit) from each word, other approaches may be necessary involving more sophisticated tokenisation. Consider the following examples:

PubMed #10639512

"In this study, we isolated and characterized the crucial gene at the breast cancer antiestrogen resistance 1 (BCAR1) locus."

breast cancer antiestrogen resistance 1 ( BCAR1 )
explanation acronym
b, c, a, r, 1 BCAR1
  • explanation initials correspond to acronym

PubMed #10484778

"Anhidrotic ectodermal dysplasia (EDA) is a human genetic disorder of impaired ectodermal appendage development."

Anhidrotic ectodermal dysplasia ( EDA )
explanation acronym
a, e, d
(not detectable in order)
EDA
  • presumed explanation initials only correspond to acronym if reordered

PubMed #10484776

"We identified a glyoxylate reductase/hydroxypyruvate reductase (GRHPR) cDNA clone from a human liver expressed sequence tag (EST) library."

glyoxylate reductase/hydroxypyruvate reductase ( GRHPR )
explanation acronym
g, r, h, p (within a word), r
(requiring word analysis)
GRHPR
  • presumed explanation initials only correspond if words are inspected more closely
  • words must also isolated using a more sophisticated tokeniser than one which splits words using whitespace characters

PubMed #10226785

"The insulin receptor related receptor (IRR) is a heterotetrameric transmembrane receptor with intrinsic tyrosine kinase activity."

insulin receptor related receptor ( IRR )
explanation acronym
i, r, r (should be ignored), r
(requiring stop-word detection)
IRR
  • presumed explanation initials only correspond if stop-words are discarded

"The IRR shares large homology with the insulin and the insulin-like growth factor-1 (IGF-I) receptor with regard to amino acid sequence and protein structure."

insulin-like growth factor-1 ( IGF-I )
explanation acronym
i, l (should be ignored), g, f, 1
(requiring stop-word detection, numeral conversion)
IGF-I
  • presumed explanation initials only correspond if stop-words are discarded
  • numerals must also be converted so that 1 and I can be matched

Chromosome and Maplocation Mentions

To be expanded...

PubMed #10684944

"mouse chromosome 17 and to human chromosome 16p13.3"

PubMed #10639512

"chromosome 16q23.1"

PubMed #10484772

"chromosome Xp11.2"

Synonym Definitions

PubMed #10880513

"Our previous studies have shown that activation of a related adhesion focal tyrosine kinase (RAFTK) (also known as Pyk2) is required for dexamethasone (Dex)-induced apoptosis in multiple myeloma (MM) cells and that human interleukin-6 (IL-6), a known growth and survival factor for MM cells, blocks both RAFTK activation and apoptosis induced by Dex."

related adhesion focal tyrosine kinase ( RAFTK ) ( also known as Pyk2 )
suggestion suggestion synonym correspondence suggestion

PubMed #10910894

"Liver-expressed chemokine (LEC) is an unusually large CC chemokine, which is also known as LMC, HCC-4, NCC-4, and CCL16."

( LEC ) is an unusually large CC chemokine, which is also known as LMC , HCC-4 , NCC-4 , and CCL16
suggestion synonym correspondence suggestion suggestion suggestion suggestion
CCL16 (#6360) CCL16 (#6360) CCL16 (#6360)
RBMS1 (#5937)
CCL16 (#6360) CCL16 (#6360)

PubMed #11137999

"Here we report the identification of a new transmembrane serine protease (TMPRSS3; also known as ECHOS1) expressed in many tissues, including fetal cochlea, which is mutated in the families used to describe both the DFNB10 and DFNB8 loci."

TMPRSS3 ; also known as ECHOS1
suggestion synonym correspondence suggestion
TMPRSS3 (#64699)
TMPRSS4 (#56649)
TMPRSS3 (#64699)

Bioentity Inference

Sometimes, it can be inferred which kind of bioentity is being written about in the text.

PubMed #9624006

"These mice also show increased susceptibility to tumorigenesis either following carcinogen treatment or when also deficient in Ink4a."

deficient in Ink4a
type-specific term suggestion
reference to protein CDKN2A (#1029)
  • "deficient in" should force the interpretation of the bioentity as a protein

PubMed #8062391

"A novel positive cofactor (PC4) purified from the human USA fraction effected a marked enhancement (up to 85-fold) of GAL4-AH-dependent transcription in conjunction with TFIID and other general factors."

in conjunction with TFIID and other general factors
suggestion type-specific term
TBP (#6908) reference to proteins
  • syntactic analysis connecting the suggestion with "factors" should force the interpretation of the bioentity as a protein

PubMed #7663508

"The fat mutation maps to mouse chromosome 8, very close to the gene for carboxypeptidase E (Cpe), which encodes an enzyme (CPE) that processes prohormone intermediates such as proinsulin."

very close to the gene for carboxypeptidase E ( Cpe ), which encodes an enzyme ( CPE )
type-specific term suggestion suggestion type-specific term suggestion
reference to gene for protein CPE (#1363) CPE (#1363) reference to protein CPE (#1363)
  • type-specific terms should cause suggestions to be interpreted as proteins, albeit with an implied gene reference
  • note that "mouse chromosome 8" should affect any chromosome-related techniques

However, elsewhere in this document, text surrounding suggestions may cause other interpretations:

"Hyperproinsulinaemia in obese fat/fat mice associated with a carboxypeptidase E mutation which reduces enzyme activity."

associated with a carboxypeptidase E mutation
suggestion type-specific term
CPE (#1363) implied reference to gene
  • "mutation" should cause the suggestion to be interpreted as a gene

PubMed #2674130

"A new family of ras-related proteins, designated rac (ras-related C3 botulinum toxin substrate) has been identified. rac1 and rac2 cDNA clones were isolated from a differentiated HL-60 library and encode proteins that are 92% homologous and share 58% and 26-30% amino acid homology with human rhos and ras, respectively."

A new family of ras-related proteins , designated rac ( ras-related C3 botulinum toxin substrate )
type-specific term type-specific term suggestion acronym explanation (not detected)
reference to protein family reference to proteins AKT1 (#207)
  • syntactic analysis should connect the type-specific term to the suggestion and cause the suggestion to be interpreted as a protein family

Collections or Sequences of Terms

PubMed #7649249

"We report the characterization of three novel members of the KRAB-domain containing C2-H2 zinc finger family (ZNF133, 136 and 140)."

zinc finger family ( ZNF133 , 136 and 140 ).
type-specific term suggestion part of suggestion part of suggestion
reference to protein family ZNF133 (#7692) ZNF136 (#7695) ZNF140 (#7699)
  • "family" could be used to prompt further inspection of the term and following numbers