Difference between revisions of "Bioscape Issues and Tasks"

Latest revision as of 13:49, 14 July 2010

Note

Please note that this documentation covers an unreleased product and is for internal use only.

Tasks

Make use of genes in MeSH terms: PubMed #12732733, PubMed #8197175
Introduce work discovery into the quick-start script
Unexplained acronyms: PubMed #9022669
Index narrowing/reduction:
- Use of part-of-speech tags to narrow the index
- Use of stop words to narrow the index
- Punctuation removal from the index
Use of stemming to match keywords
Hierarchical scoring/rescoring (negation) in the quick-start script
Extra data in Bioscape: iRefIndex, diseases
Ad-hoc term highlighting
Keyword disambiguation
Author-based disambiguation
- Inspect author metadata
- Use unambiguous mentions in one document to disambiguate mentions in other documents by the same author
Article dates, retractions, modifications
Packaging and dependency documentation
Test installation in virtual machines
Overview of documentation and work on publications
Assess overlap with NLTK
Review of methods

Previously Discussed Ideas

As described in a document originally produced by Ian Donaldson, here are the current issues, wishes and related notes.

Disambiguation and Elimination

Use synonyms to disambiguate from other bioentities (from the same or different organisms)

Some papers (Peregrine, P132, BioCreative #2) suggest only assigning gene identities when mentions are supported by synonyms or (on P133) by uncommon, individual words from the "long-form names" of a gene
Some papers (Hakenberg et al, P141) suggest other forms of synonym, typically originating from other data sources (GeneOntology) as well as chromosome information
- The "disambiguated by competing names" method (which counts the number of names used in a document for a bioentity) manages to consistently raise precision by 4-5%, showing that this does help disambiguation

Use synonyms and capitalisation to identify genuine mentions which resemble English words

Search for ambiguous and unambiguous names separately

Handle ambiguous names and English words by searching for unambiguous names in the same abstract

UMLS term disambiguation (http://www.nlm.nih.gov/research/umls/)

Score according to length in order to decide between overlapping matches ("IL1 receptor" is preferred to "IL1")

Disqualifiers (surrounding words which indicate false positives)

Following words: gene, cell, cells, cell type, domain, DNA binding site, mediated, interactor, protooncogene, costimulates, heterodimer, transcripts, corepressor, exerts, suppresses, encodes

Unspecific synonyms

Added:

purely numeric names
pN (N being a number)
N kDa
N kD
N k

From BioThesaurus (http://pir.georgetown.edu/pirwww/iprolink/biothesaurus/supplement/BioThesaurus_Supplement.pdf):

N k (protein(s))
N aa long hypothetical protein(s)>
N kaa long hypothetical protein(s)
hypothetical protein precursor(s)
unnamed protein product(s)
conserved (hypothetical)/expressed/hypothetical (conserved)/novel/predicted/putative (exported)/unknown (polyprotein(s)/protein(s)/orf(s))

Action:

Update the scoring for uninformative names in the following file:

bioscape/modules/text/sql/importdb-score-uninformative-pgsql.sql.in

Others:

tRNA, RNA, DNA, mRNA, snRNA

Action:

Check for the presence of such terms in the chemical/molecule name lexicon

Recognition

Conjunctions and enumerations:

HAP2, 3, 4
HAP2, 3 and 4
HAP2-4
HAP-2, -3, -4
HAP2/4
HAP2 to HAP4
freac1-freac7
M and B creatine kinase

Acronym recognition and expansion/equivalence:

AcroMed (http://medstract.med.tufts.edu/acro1.1/index.htm http://www.medstract.org/)
ADAM (http://arrowsmith.psych.uic.edu/arrowsmith_uic/index.html)
ALICE (http://uvdb3.hgc.jp/ALICE/ALICE_index.html)
ARGH (http://invention.swmed.edu/argh/)
Biomedical Abbreviation Server (http://abbreviation.stanford.edu/)
BioText Abbreviation Definition Recognition Software (http://biotext.berkeley.edu/software.html)
SaRAD (http://www.hpl.hp.com/research/idl/projects/abbrev.html)
See also: http://www.hsls.pitt.edu/guides/genetics/obrc/others/literature_protocols

Techniques:

Compare words preceding acronyms in parentheses with acronym initials
- The most conservative acronym disambiguation approach involves comparing the list of candidates suggested by an acronym with those suggested by the accompanying "explanation"
- However, adopting a name/synonym disambiguation method, such as the "disambiguated by competing names" method, seems to overlap with such an acronym disambiguation technique

Multi-word descriptions and authoritative names (involving commas, parentheses)

Orthographic variation/tokenisation

Skipping terms: "type" as in "IL type 1"
Greek letters converted to Latin equivalents
Hyphens removed, but inserted after every Greek letter
Hyphens added at alphabetic/numeric boundaries

Synonyms

Synonyms less than six characters searched with all upper case or initial upper case:

("Change this for yeast.")

Organism-specific

Prefixes ("h") and suffixes ("p") in organism-specific rules

General

Usage of Biothesaurus, BioLexicon

Usage of euGenes (http://eugenes.org/)

Full-text searching

Manual curation lists for adding/removing names for specific bioentities, whole organisms

Case-sensitive searching of ambiguous names:

Case-insensitive searching only for numeric names or for names longer than 5 characters

Remove subtype specifier if there is only one subtype in the organism for that organism ("aminocyclase 1" becomes "aninocyclase")

Do pre-search and score on names according to the number of results returned from all PubMed abstracts, filtering out names as a result

Stop words

Common English words
Protein family terms
Non-protein molecules
Experimental words

@@ Line 1: / Line 1: @@
-----
+{{:Bioscape Status}}
-'''Please note that this documentation covers an unreleased product and is for internal use only.'''
-----
 == Tasks ==

Anonymous

Search

Difference between revisions of "Bioscape Issues and Tasks"

Namespaces

More

Page actions

Latest revision as of 13:49, 14 July 2010

Contents

Tasks

Previously Discussed Ideas

Disambiguation and Elimination

Disqualifiers (surrounding words which indicate false positives)

Unspecific synonyms

Recognition

Orthographic variation/tokenisation

Synonyms

Organism-specific

General

Stop words

Navigation

Navigation

Internal Links

Wiki tools

Wiki tools

Anonymous

Search

Difference between revisions of "Bioscape Issues and Tasks"

Latest revision as of 13:49, 14 July 2010

Contents

Tasks

Previously Discussed Ideas

Disambiguation and Elimination

Disqualifiers (surrounding words which indicate false positives)

Unspecific synonyms

Recognition

Orthographic variation/tokenisation

Synonyms

Organism-specific

General

Stop words

Navigation

Wiki tools

Page tools

Categories