Bioscape Issues and Tasks

From irefindex
Revision as of 12:40, 21 May 2010 by PaulBoddie (talk | contribs) (Added a task list.)

Please note that this documentation covers an unreleased product and is for internal use only.


Tasks

  • Make use of genes in MeSH terms: PubMed #12732733, PubMed #8197175
  • Introduce work discovery into the quick-start script
  • Unexplained acronyms: PubMed #9022669
  • Index narrowing/reduction:
    • Use of part-of-speech tags to narrow the index
    • Use of stop words to narrow the index
    • Punctuation removal from the index
  • Use of stemming to match keywords
  • Hierarchical scoring/rescoring (negation) in the quick-start script
  • Extra data in Bioscape: iRefIndex, diseases
  • Ad-hoc term highlighting
  • Keyword disambiguation
  • Author-based disambiguation
    • Inspect author metadata
    • Use unambiguous mentions in one document to disambiguate mentions in other documents by the same author
  • Article dates, retractions, modifications
  • Packaging and dependency documentation
  • Test installation in virtual machines
  • Overview of documentation and work on publications
  • Assess overlap with NLTK
  • Review of methods

Previously Discussed Ideas

As described in a document originally produced by Ian Donaldson, here are the current issues, wishes and related notes.

Disambiguation and Elimination

Use synonyms to disambiguate from other bioentities (from the same or different organisms)

  • Some papers (Peregrine, P132, BioCreative #2) suggest only assigning gene identities when mentions are supported by synonyms or (on P133) by uncommon, individual words from the "long-form names" of a gene
  • Some papers (Hakenberg et al, P141) suggest other forms of synonym, typically originating from other data sources (GeneOntology) as well as chromosome information
    • The "disambiguated by competing names" method (which counts the number of names used in a document for a bioentity) manages to consistently raise precision by 4-5%, showing that this does help disambiguation

Use synonyms and capitalisation to identify genuine mentions which resemble English words

Search for ambiguous and unambiguous names separately

Handle ambiguous names and English words by searching for unambiguous names in the same abstract

UMLS term disambiguation (http://www.nlm.nih.gov/research/umls/)

Score according to length in order to decide between overlapping matches ("IL1 receptor" is preferred to "IL1")

Disqualifiers (surrounding words which indicate false positives)

Following words: gene, cell, cells, cell type, domain, DNA binding site, mediated, interactor, protooncogene, costimulates, heterodimer, transcripts, corepressor, exerts, suppresses, encodes

Unspecific synonyms

Added:
  • purely numeric names
  • pN (N being a number)
  • N kDa
  • N kD
  • N k
From BioThesaurus (http://pir.georgetown.edu/pirwww/iprolink/biothesaurus/supplement/BioThesaurus_Supplement.pdf):
  • N k (protein(s))
  • N aa long hypothetical protein(s)>
  • N kaa long hypothetical protein(s)
  • hypothetical protein precursor(s)
  • unnamed protein product(s)
  • conserved (hypothetical)/expressed/hypothetical (conserved)/novel/predicted/putative (exported)/unknown (polyprotein(s)/protein(s)/orf(s))
Action:
Update the scoring for uninformative names in the following file:
bioscape/modules/text/sql/importdb-score-uninformative-pgsql.sql.in
Others:
tRNA, RNA, DNA, mRNA, snRNA
Action:
Check for the presence of such terms in the chemical/molecule name lexicon

Recognition

Conjunctions and enumerations:

  • HAP2, 3, 4
  • HAP2, 3 and 4
  • HAP2-4
  • HAP-2, -3, -4
  • HAP2/4
  • HAP2 to HAP4
  • freac1-freac7
  • M and B creatine kinase

Acronym recognition and expansion/equivalence:

Techniques:

  • Compare words preceding acronyms in parentheses with acronym initials
    • The most conservative acronym disambiguation approach involves comparing the list of candidates suggested by an acronym with those suggested by the accompanying "explanation"
    • However, adopting a name/synonym disambiguation method, such as the "disambiguated by competing names" method, seems to overlap with such an acronym disambiguation technique

Multi-word descriptions and authoritative names (involving commas, parentheses)

Orthographic variation/tokenisation

  • Skipping terms: "type" as in "IL type 1"
  • Greek letters converted to Latin equivalents
  • Hyphens removed, but inserted after every Greek letter
  • Hyphens added at alphabetic/numeric boundaries

Synonyms

Synonyms less than six characters searched with all upper case or initial upper case:

  • ("Change this for yeast.")

Organism-specific

Prefixes ("h") and suffixes ("p") in organism-specific rules

General

Usage of Biothesaurus, BioLexicon

Usage of euGenes (http://eugenes.org/)

Full-text searching

Manual curation lists for adding/removing names for specific bioentities, whole organisms

Case-sensitive searching of ambiguous names:

  • Case-insensitive searching only for numeric names or for names longer than 5 characters

Remove subtype specifier if there is only one subtype in the organism for that organism ("aminocyclase 1" becomes "aninocyclase")

Do pre-search and score on names according to the number of results returned from all PubMed abstracts, filtering out names as a result

Stop words

  • Common English words
  • Protein family terms
  • Non-protein molecules
  • Experimental words