Bioscape Issues and Tasks
Note | Please note that this documentation covers an unreleased product and is for internal use only. |
Contents
Tasks
- Make use of genes in MeSH terms: PubMed #12732733, PubMed #8197175
- Introduce work discovery into the quick-start script
- Unexplained acronyms: PubMed #9022669
- Index narrowing/reduction:
- Use of part-of-speech tags to narrow the index
- Use of stop words to narrow the index
- Punctuation removal from the index
- Use of stemming to match keywords
- Hierarchical scoring/rescoring (negation) in the quick-start script
- Extra data in Bioscape: iRefIndex, diseases
- Ad-hoc term highlighting
- Keyword disambiguation
- Author-based disambiguation
- Inspect author metadata
- Use unambiguous mentions in one document to disambiguate mentions in other documents by the same author
- Article dates, retractions, modifications
- Packaging and dependency documentation
- Test installation in virtual machines
- Overview of documentation and work on publications
- Assess overlap with NLTK
- Review of methods
Previously Discussed Ideas
As described in a document originally produced by Ian Donaldson, here are the current issues, wishes and related notes.
Disambiguation and Elimination
Use synonyms to disambiguate from other bioentities (from the same or different organisms)
- Some papers (Peregrine, P132, BioCreative #2) suggest only assigning gene identities when mentions are supported by synonyms or (on P133) by uncommon, individual words from the "long-form names" of a gene
- Some papers (Hakenberg et al, P141) suggest other forms of synonym, typically originating from other data sources (GeneOntology) as well as chromosome information
- The "disambiguated by competing names" method (which counts the number of names used in a document for a bioentity) manages to consistently raise precision by 4-5%, showing that this does help disambiguation
Use synonyms and capitalisation to identify genuine mentions which resemble English words
Search for ambiguous and unambiguous names separately
Handle ambiguous names and English words by searching for unambiguous names in the same abstract
UMLS term disambiguation (http://www.nlm.nih.gov/research/umls/)
Score according to length in order to decide between overlapping matches ("IL1 receptor" is preferred to "IL1")
Disqualifiers (surrounding words which indicate false positives)
Following words: gene, cell, cells, cell type, domain, DNA binding site, mediated, interactor, protooncogene, costimulates, heterodimer, transcripts, corepressor, exerts, suppresses, encodes
Unspecific synonyms
- Added:
- purely numeric names
- pN (N being a number)
- N kDa
- N kD
- N k
- From BioThesaurus (http://pir.georgetown.edu/pirwww/iprolink/biothesaurus/supplement/BioThesaurus_Supplement.pdf):
- N k (protein(s))
- N aa long hypothetical protein(s)>
- N kaa long hypothetical protein(s)
- hypothetical protein precursor(s)
- unnamed protein product(s)
- conserved (hypothetical)/expressed/hypothetical (conserved)/novel/predicted/putative (exported)/unknown (polyprotein(s)/protein(s)/orf(s))
- Action:
- Update the scoring for uninformative names in the following file:
bioscape/modules/text/sql/importdb-score-uninformative-pgsql.sql.in
- Others:
- tRNA, RNA, DNA, mRNA, snRNA
- Action:
- Check for the presence of such terms in the chemical/molecule name lexicon
Recognition
Conjunctions and enumerations:
- HAP2, 3, 4
- HAP2, 3 and 4
- HAP2-4
- HAP-2, -3, -4
- HAP2/4
- HAP2 to HAP4
- freac1-freac7
- M and B creatine kinase
Acronym recognition and expansion/equivalence:
- AcroMed (http://medstract.med.tufts.edu/acro1.1/index.htm http://www.medstract.org/)
- ADAM (http://arrowsmith.psych.uic.edu/arrowsmith_uic/index.html)
- ALICE (http://uvdb3.hgc.jp/ALICE/ALICE_index.html)
- ARGH (http://invention.swmed.edu/argh/)
- Biomedical Abbreviation Server (http://abbreviation.stanford.edu/)
- BioText Abbreviation Definition Recognition Software (http://biotext.berkeley.edu/software.html)
- SaRAD (http://www.hpl.hp.com/research/idl/projects/abbrev.html)
- See also: http://www.hsls.pitt.edu/guides/genetics/obrc/others/literature_protocols
Techniques:
- Compare words preceding acronyms in parentheses with acronym initials
- The most conservative acronym disambiguation approach involves comparing the list of candidates suggested by an acronym with those suggested by the accompanying "explanation"
- However, adopting a name/synonym disambiguation method, such as the "disambiguated by competing names" method, seems to overlap with such an acronym disambiguation technique
Multi-word descriptions and authoritative names (involving commas, parentheses)
Orthographic variation/tokenisation
- Skipping terms: "type" as in "IL type 1"
- Greek letters converted to Latin equivalents
- Hyphens removed, but inserted after every Greek letter
- Hyphens added at alphabetic/numeric boundaries
Synonyms
Synonyms less than six characters searched with all upper case or initial upper case:
- ("Change this for yeast.")
Organism-specific
Prefixes ("h") and suffixes ("p") in organism-specific rules
General
Usage of Biothesaurus, BioLexicon
Usage of euGenes (http://eugenes.org/)
Full-text searching
Manual curation lists for adding/removing names for specific bioentities, whole organisms
Case-sensitive searching of ambiguous names:
- Case-insensitive searching only for numeric names or for names longer than 5 characters
Remove subtype specifier if there is only one subtype in the organism for that organism ("aminocyclase 1" becomes "aninocyclase")
Do pre-search and score on names according to the number of results returned from all PubMed abstracts, filtering out names as a result
Stop words
- Common English words
- Protein family terms
- Non-protein molecules
- Experimental words