Difference between revisions of "Bioscape Scoring Techniques"
PaulBoddie (talk | contribs) (→Acronym Mentions and Techniques: Added awkward acronym detection case.) |
PaulBoddie (talk | contribs) m (Added status note.) |
||
(One intermediate revision by the same user not shown) | |||
Line 1: | Line 1: | ||
+ | {{:Bioscape Status}} | ||
+ | |||
This document provides examples of some of the adopted and proposed techniques for scoring source and result data in order to more accurately identify bioentities in the literature. | This document provides examples of some of the adopted and proposed techniques for scoring source and result data in order to more accurately identify bioentities in the literature. | ||
Line 674: | Line 676: | ||
== Assuming the Presence of Gene Symbols == | == Assuming the Presence of Gene Symbols == | ||
− | One technique amongst many that assert different claims about the kinds of names used for genes by authors is that which claims that where a gene symbol is used, it is more likely to refer to the actual gene being discussed than other names which are not symbolic. However, this is a highly speculative method, as can be seen in the following | + | One technique amongst many that assert different claims about the kinds of names used for genes by authors is that which claims that where a gene symbol is used, it is more likely to refer to the actual gene being discussed than other names which are not symbolic. However, this is a highly speculative method, as can be seen in the following examples. |
=== PubMed #8248256 === | === PubMed #8248256 === | ||
Line 693: | Line 695: | ||
* although NRF1 is the symbol for gene #4899, this alone does not make it the correct suggestion | * although NRF1 is the symbol for gene #4899, this alone does not make it the correct suggestion | ||
* in fact, NFE2L1 is the correct suggestion | * in fact, NFE2L1 is the correct suggestion | ||
+ | |||
+ | === PubMed #9931490 === | ||
+ | |||
+ | "'''XCE, a new member of the endothelin-converting enzyme and neutral endopeptidase family''', is preferentially expressed in the CNS." | ||
+ | |||
+ | {| cellspacing="0" cellpadding="5" border="0" style="margin: 2em" | ||
+ | | style="border: 1px solid #000000" | XCE | ||
+ | | , a new member of the | ||
+ | | style="border: 1px solid #000000" | endothelin | ||
+ | | -converting enzyme and neutral | ||
+ | | style="border: 1px solid #000000" | endopeptidase | ||
+ | | family | ||
+ | |- | ||
+ | | suggestions | ||
+ | | | ||
+ | | keyword | ||
+ | | | ||
+ | | keyword | ||
+ | | | ||
+ | |- | ||
+ | | style="border: 1px solid #000000" | XCE (#7497) ''symbol''<br>XIC (#7502)<br>XIST (#7503)<br>ECEL1 (#9427) | ||
+ | | | ||
+ | | | ||
+ | | | ||
+ | | style="border: 1px solid #000000" | referenced in summary for ECEL1 (#9427) | ||
+ | | | ||
+ | |} | ||
+ | |||
+ | * although XCE is the symbol for gene #7497, it is not the correct suggestion | ||
+ | * instead, ECEL (#9427) is the correct suggestion as confirmed by the presence of a gene summary keyword | ||
[[Category:Bioscape]] | [[Category:Bioscape]] |
Latest revision as of 13:44, 14 July 2010
Note | Please note that this documentation covers an unreleased product and is for internal use only. |
This document provides examples of some of the adopted and proposed techniques for scoring source and result data in order to more accurately identify bioentities in the literature.
Contents
- 1 Acronym Mentions and Techniques
- 2 Confirmation of Mention Suggestions
- 3 Finding Unambiguous Gene Mentions
- 4 Disambiguating using Unambiguous Mentions
- 5 Gene Ontology Correspondence
- 6 Chromosome Information Correspondence
- 7 Using Parentheses as Result Regions
- 8 Applying Disqualifying Keywords to Mentions
- 9 Assuming the Presence of Gene Symbols
Acronym Mentions and Techniques
Once acronym mentions and accompanying definitions or explanations of those acronyms have been found (as described in the "Bioscape Searching Techniques" document), further analysis of such mentions and their application to gene or protein mentions can occur.
PubMed #18370233
"The role of inactivating and activating calcium-sensing receptor (CASR) mutations is discussed"
calcium-sensing receptor | ( | CASR | ) |
explanation | acronym | ||
synonym for CASR (#846) | match for CASR (#846) |
- agreement (acronym and explanation support each other)
"with respect to familial hypocalciuric hypercalemia (FHH) and autosomal dominant hypocalemia (ADH)."
familial hypocalciuric hypercalemia | ( | FHH | ) |
explanation (not detected) | acronym | ||
match for CASR (#846) |
- no agreement (if explanation detected speculatively)
- FHH would not represent CASR despite matching an established gene
- See also PubMed #18393128: "papillary thyroid carcinomas (PTCs)"
PubMed #19364970
"PATIENTS AND METHODS: In addition to UGT1A1*28, UGT1A1*60, UGT1A1*93, UGT1A7*3, and UGT1A9*22 were genotyped in 250 metastatic colorectal cancer patients, and associations with severe hematologic and nonhematologic toxicity, objective response, time to progression (TTP), and overall survival were evaluated."
time to progression | ( | TTP | ) |
explanation (not detected) | acronym | ||
ZFP36 (#7538) ADAMTS13 (#11093) SH3BP4 (#23677) |
- no agreement (if explanation detected speculatively)
- TTP would not represent ZFP36, ADAMTS13 or SH3BP4
PubMed #18281285
"We have previously shown that ASK1-interacting protein 1 (AIP1) transduces tumor necrosis factor-induced ASK1-JNK signaling."
ASK1-interacting protein 1 | ( | AIP1 | ) |
explanation | acronym | ||
synonym for DAB2IP (#153090) | matches for numerous genes including DAB2IP (#153090) |
- agreement for DAB2IP
- disagreement for other genes associated with AIP1
"Because endoplasmic reticulum (ER) stress activates ASK1-JNK signaling cascade, we investigated the role of AIP1 in ER stress-induced signaling."
endoplasmic reticulum | ( | ER | ) |
explanation | acronym (not detected) | ||
Gene Ontology term |
- no agreement (but the acronym does not require definition)
PubMed #16566752
"In the present study, we subjected CD1 male mice to intraperitoneal injection with TNFalpha (10 ng/mouse) and then examined the expression and localization of DMT1 (divalent metal transporter 1), IREG1 (iron-regulated protein 1) and ferritin in duodenum."
localization of | DMT1 | ( | divalent metal transporter 1 | ) |
suggestion | description (not detected) | |||
DMRT1 (#1761) SLC11A2 (#4891) CHMP2B (#25978) |
summary: "The SLC11A2 gene encodes a divalent metal transporter (DMT1)" |
- agreement between summary of #4891 and description
- see also this document in the disambiguation techniques documentation
PubMed #10036216
"Phenobarbital (PB) and many structurally unrelated chemicals induce the protein and mRNA of P450 cytochromes CYP2B1, CYP2B2, CYP3A1, and specific phase II enzymes to a greater extent in Fischer 344 (F344) than in Wistar Furth (WF) female rats."
Phenobarbital | ( | PB | ) |
explanation | suggestion | ||
chemical name | SMR3B (#10879) |
- disagreement between suggestion and chemical name
PubMed #2674130
"A new family of ras-related proteins, designated rac (ras-related C3 botulinum toxin substrate) has been identified. rac1 and rac2 cDNA clones were isolated from a differentiated HL-60 library and encode proteins that are 92% homologous and share 58% and 26-30% amino acid homology with human rhos and ras, respectively."
A new family of ras-related proteins, designated | rac | ( | ras-related C3 botulinum toxin substrate | ) |
suggestion | explanation (not detected) | |||
AKT1 (#207) |
- suggestion not confirmed by explanation
PubMed #9569054
"The primary end point of the present multicentre, randomized, parallel-group phase II study was to determine the activity of the simplified 2-day EAP schedule in patients with locally advanced or metastatic gastric cancer, and to verify whether the addition of low doses of granulocyte-macrophage colony-stimulating factor (GM-CSF) made it possible to increase dose intensity."
granulocyte-macrophage colony-stimulating factor | (GM- | CSF | ) |
suggestion/explanation | suggestion/acronym | ||
CSF2 (#1437) | LAMC2 (#3918) |
- incomplete match for the acronym causes an inappropriate suggestion
PubMed #11916966
"We observed a novel endogenous association of BRCA1 with Nmi (N-Myc-interacting protein) in breast cancer cells."
of BRCA1 with | Nmi | ( | N-Myc-interacting protein | ) |
suggestion | suggestion/explanation (not detected) | |||
NMI (#9111) MYO1C (#4641) |
- inappropriate suggestion should have been redefined
- later mentions of "Nmi" are thus also inappropriate
PubMed #9092545
Sometimes, recognition of acronym explanations can be quite involved:
"The 130-kDa cytosolic enzyme was purified to homogeneity and shown by tryptic peptide and reverse transcriptase- polymerase chain reaction (RT-PCR)-amplified rat cDNA sequence analyses to be structurally related to the 116-kDa rat hepatic PAK-1/protein kinase N (PKN) and, even more closely (95% sequence identity) to the 130-kDa human PKC-related kinase, PRK2."
rat hepatic | PAK-1 | / | protein kinase N | ( | PKN | ) |
suggestions | acronym explanation | acronym/suggestion | ||||
PAK1 (#5058) PKN1 (#5585) |
PKN1 (#5585) |
- the detected acronym explanation does not support the acronym suggestion
- the adjacent text "PAK-1" does support the acronym suggestion
PubMed #10878382
Method: disambiguated_by_acronym
Principle: the transfer of acronym information to other mentions
"The G protein-coupled CXC-chemokine receptor CXCR-2 mediates activation of neutrophil effector functions in response to multiple ligands, including IL-8 and neutrophil-activating peptide 2 (NAP-2)."
neutrophil-activating peptide 2 | ( | NAP-2 | ) |
explanation | acronym | ||
PPBP (#5473) | PPBP (#5473) NAP1L4 (#4676) NAPSB (#256236) |
- only PPBP (#5473) appears in both lists, thus defining the acronym
- in other situations where PPBP (#5473) competes with other genes, if such genes were rejected above, they could also be rejected in such cases, too
- since mentions of "NAP-2" should lead to PPBP (#5473) and any competitors being suggested, PPBP (#5473) should be chosen in such cases
For example:
"Immunoprecipitation and Western blot analyses of surface-expressed receptors covalently linked to IL-8 or NAP-2 as well as in their unloaded state revealed the occurrence of a single CXCR-2 variant with an apparent size of 56 kDa."
- PPBP (#5473) is chosen in preference to NAP1L4 (#4676) and NAPSB (#256236)
PubMed #7512734
Method: disambiguated_by_acronym
"Here another guanine nucleotide-releasing protein (GNRP), C3G, has been identified as a CRK SH3-binding protein."
Here another | guanine nucleotide-releasing protein | ( | GNRP | ) |
suggestion/explanation | suggestion | |||
RCC1 (#1104) | RASGRF1 (#5923) |
- suggestions contradict each other
- the suggestion for the acronym should be disregarded
Confirmation of Mention Suggestions
PubMed #11137999
Method: to be implemented
"Here we report the identification of a new transmembrane serine protease (TMPRSS3; also known as ECHOS1) expressed in many tissues, including fetal cochlea, which is mutated in the families used to describe both the DFNB10 and DFNB8 loci."
( | TMPRSS3 | ; also known as | ECHOS1 | ) |
suggestions | suggestion | |||
TMPRSS3 (#64699) TMPRSS4 (#56649) |
TMPRSS3 (#64699) |
- TMPRSS3 (#64699) occurs for both mentions
- "also known as" could be used as a key to such situations
- close proximity of mentions (as seen with acronyms) could be sufficient
Note that the above also involves an acronym.
PubMed #7479798
Method: confirmed_by_competing_names
"Cloning and analysis of the full-length cDNA of the human CSE1 homologue, which we name CAS for cellular apoptosis susceptibility gene, reveals a protein coding region with similar length (971 amino acids for CAS, 960 amino acids for CSE1) and 59% overall protein homology to the yeast CSE1 protein."
human | CSE1 | homologue, which we name | CAS |
suggestion | suggestions | ||
CSE1L (#1434) | CSE1L (#1434) CTNND1 (#1500) BCAR1 (#9564) |
- CSE1L (#1434) is supported by two different names
- CTNND1 (#1500) and BCAR1 (#9564) are not supported by any other names
- the "which we name" text also confirms the equivalence of the two entities
Finding Unambiguous Gene Mentions
In various disambiguation techniques, an unambiguous gene mention may be needed in order to disambiguate between competing gene suggestions. Consequently, a reliable method is needed to find "high quality" suggestions which can be considered as unambiguous gene mentions. Consider the following document excerpt:
PubMed #9788873
"In mouse embryo fibroblasts, TCDD activates expression of multiple genes, including CYP1B1, the predominant cytochrome P450 expressed in these cells."
Mentions | Suggestions | Methods | |
---|---|---|---|
unambiguous_at_exact_location | not_part_of_other_mentions | ||
CYP1B1 | CYP1B1 (#1545) | X | X |
CYP1 | CYP1A1 (#1543) | ||
CYP2A (#1546) | |||
CYP27B1 (#1594) | |||
CYP | PPIG (#9360) | X |
- Here, CYP1B1 (#1545) can be considered at this location as an unambiguous gene mention.
If the above methods taken together are known as an "unambiguous gene mention" method, the fundamental technique can be defined in terms of this method. However, in order to eliminate obvious bad suggestions for genes, it is also necessary to apply other methods in order to identify good suggestions more reliably.
Disambiguating using Unambiguous Mentions
Within the same document, the presence of unambiguous gene mentions can be used to help disambiguate at other mention locations where an unambiguously identified gene may be "competing" with other genes, typically using a name which is ambiguous. For example:
PubMed #10484773
Method: disambiguated_by_unambiguous_gene_mention
"A common genetic variant (V) of the human luteinizing hormone (LH) beta-subunit gene was recently discovered."
of the | human luteinizing hormone (LH) beta | -subunit gene |
suggestion | ||
LHB (#3972) |
With this unambiguous mention identified and the presence of the suggested gene confirmed, this knowledge can be applied to other mention locations. For example:
"We have now studied whether additional mutations in the V-LHbeta promoter sequence could contribute to the altered physiology of the LH variant molecules."
in the | V-LHbeta | promoter sequence |
suggestions | ||
LHB (#3972) PLOD2 (#5352) LHX2 (#9355) |
Since the latter two genes are not unambiguously identified in the document, yet the first gene has been identified (see above), the latter two genes are scored negatively and are regarded as not being referenced.
Note that the confirmed_by_competing_names method also resolves these mentions.
Gene Ontology Correspondence
PubMed #16566752
"In the present study, we subjected CD1 male mice to intraperitoneal injection with TNFalpha (10 ng/mouse) and then examined the expression and localization of DMT1 (divalent metal transporter 1), IREG1 (iron-regulated protein 1) and ferritin in duodenum."
of | DMT1 | (divalent metal transporter 1), | IREG1 | (iron-regulated protein 1) |
suggestions | suggestion | |||
DMRT1 (#1761) SLC11A2 (#4891) CHMP2B (#25978) |
SLC40A1 (#30061) |
Gene Ontology (Function) correspondence could be used:
DMRT1 | SLC11A2 | CHMP2B | IREG1 | |
(#1761) | (#4891) | (#25978) | (#30061) | |
ferrous iron transmembrane transporter activity | X | |||
iron ion binding | >2 | X | X | |
iron ion transmembrane transporter activity | <1 | X | ||
metal ion binding | X | |||
zinc ion binding | X | X |
- X: exact match
- <1: more specific match: ferrous iron transmembrane transporter activity => iron ion transmembrane transporter activity
- >2: more general match: metal ion binding <= transition metal ion binding <= iron ion binding
Chromosome Information Correspondence
Method: confirmed_by_chromosome_mention
PubMed #10072425
"The t(X;18)(p11.2;q11.2) chromosomal translocation commonly found in synovial sarcomas fuses the SYT gene on chromosome 18 to either of two similar genes, SSX1 or SSX2, on the X chromosome."
the | SYT | gene on | chromosome 18 |
suggestion | information | ||
SS18 (#6760) (chromosome 18) SYT1 (#6857) (chromosome 12) |
chromosome 18 |
- SS18 chromosome matches the accompanying information.
- Note that "X chromosome" is also mentioned in the sentence.
Using Parentheses as Result Regions
Text enclosed within brackets could be regarded as a region corresponding to a particular result, and existing techniques could then be applied to such broader regions.
PubMed #2365818
"Nucleotide sequence analysis revealed that the isolated cDNA clone (lambda hBE1 beta-1) contained a 5'-untranslated sequence of four nucleotides, the translated sequence of 1,176 nucleotides and the 3'-untranslated sequence of 169 nucleotides."
cDNA clone | ( | lambda | hBE1 | beta-1) |
type-specific term | suggestion | |||
reference to cDNA | HBE1 (#3046) |
- the type-specific term preceding the parentheses could force a specific interpretation of the contents of the parentheses
- a simpler approach for gene identification could involve incorporating "lambda" and "beta" as keywords which disqualify a gene mention if they appear next to a mention
Applying Disqualifying Keywords to Mentions
PubMed #1731328
"We now present cloning of a cDNA coding for CHED (cholinesterase-related cell division controller), a human homolog of the Schizosaccharomyces pombe cell division cycle 2 (cdc2)-like kinases, universal controllers of the mitotic cell cycle."
a human homolog of the Schizosaccharomyces pombe | cell division cycle 2 | ( | cdc2 | )- | like | kinases | |
suggestion | suggestion | disqualifier keyword | disqualifier keyword | ||||
CDC2 (#983) | CDC2 (#983) POLD1 (#5424) |
indicates something other than the suggested bioentity | reference to kinases |
- the disqualification applying to the bracketed term should also apply to the equivalent preceding term
- the reference to "the human homolog" could be used to disambiguate between species-specific bioentities
Assuming the Presence of Gene Symbols
One technique amongst many that assert different claims about the kinds of names used for genes by authors is that which claims that where a gene symbol is used, it is more likely to refer to the actual gene being discussed than other names which are not symbolic. However, this is a highly speculative method, as can be seen in the following examples.
PubMed #8248256
"Cloning of Nrf1, an NF-E2-related transcription factor, by genetic selection in yeast."
Nrf1 | , an NF-E2-related transcription factor |
suggestions | |
NRF1 (#4899) symbol NFE2L1 (#4779) not symbol |
- although NRF1 is the symbol for gene #4899, this alone does not make it the correct suggestion
- in fact, NFE2L1 is the correct suggestion
PubMed #9931490
"XCE, a new member of the endothelin-converting enzyme and neutral endopeptidase family, is preferentially expressed in the CNS."
XCE | , a new member of the | endothelin | -converting enzyme and neutral | endopeptidase | family |
suggestions | keyword | keyword | |||
XCE (#7497) symbol XIC (#7502) XIST (#7503) ECEL1 (#9427) |
referenced in summary for ECEL1 (#9427) |
- although XCE is the symbol for gene #7497, it is not the correct suggestion
- instead, ECEL (#9427) is the correct suggestion as confirmed by the presence of a gene summary keyword