Text Mining Resources
From irefindex
Revision as of 14:04, 17 February 2010 by PaulBoddie (talk | contribs) (Added NLTK-related resources, Wikipedia pronoun page link.)
Some notes on open source text mining resources:
- "The Text Mining Tool Evaluation project will describe the process of text mining, identify non-proprietary software that can search blocks of text to identify reports relevant to the cancer registry, and provide information to state cancer registries regarding different tools available and a comparison of the functionality provided by each tool." Evaluation of Open Source Text Mining Tools for Cancer Surveillance (HTML version from the Google cache)
- Java Open Source NLP and Text Mining tools
- "U-Compare is an integrated text mining/natural language processing system based on the UIMA Framework." U-Compare: share and compare tools with UIMA
- "The BioNLP Unstructured Information Management Architecture (UIMA) Component Repository provides UIMA wrappers for novel and well-known 3rd-party NLP tools used in biomedical text prosessing, such as tokenizers, parsers, named entity taggers, and tools for evaluation." BioNLP UIMA Component Respository
- "OpenNLP is an organizational center for open source projects related to natural language processing." OpenNLP
- OpenNLP projects
- See also OpenNLP links for other resources.
- FreeLing - written in C++ with features from tokenisation through to part-of-speech tagging, word sense disambiguation
- NLTK - written in Python with a wide range of natural language processing features
- The English (Porter2) stemming algorithm - also providing links to stop word lists, vocabulary
- MedPost - a part-of-speech tagger
- "The SPECIALIST LEXICON is a large syntactic lexicon of biomedical and general English" The SPECIALIST LEXICON (with relatively permissive licensing terms)
- Medical/Health Lexicon from the Intelligent Systems Lab at Claremont Graduate University is based on the GATE and SPECIALIST LEXICON resources
- Consumer Health Vocabulary Initiative provides GPL-licensed resources (albeit behind a click-through acceptance page which is superfluous with the GPL), referenced by Health Information Text Characteristics (PMC#1839344)
- Hunspell - a spell checker with morphological analysis features, stemming
- Pronoun - the Wikipedia page about pronouns which could be potential stop words
Notes from the Text Mining Tutorial at EBI
Links:
- Text Mining in Biomedicine/Exploitation of biomedical semantic resources
- NaCTeM's Services: KLEIO, FACTA, MEDIE, TerMine, AcroMine
- Overview of resources, BioLexicon, bio-ontologies, text-mining infrastructure (U-Compare text-mining workflows)
Text Search Resources
- UK PubMed Central provides annotation of abstracts, covers (or will eventually cover) up to 1.5 million full-text articles
- Links to the official PubMed results with links back to UK PubMed Central results (presented similarly to official PubMed Central results).
- The PMC Open Access Subset is downloadable via the PMC FTP Service, specifically as XML for Data Mining via FTP.
- Using BioMed Central's open access full-text corpus for text mining research
- CiteXplore provides literature search including (but not limited to) PubMed, without domain-specific features
- Results show PubMed records with search keywords highlighted.
- GoPubMed provides PubMed searching with Gene Ontology categorisation/filtering of search results
- Results include annotated abstracts which seem keyword-oriented, not gene-oriented, and offer interesting statistics related to publication metadata.
- Domain-specific annotations can apparently be activated by selecting items from the "what" sidebar, such as protein PDC.
- MedEvi offers sentence-oriented, interaction-oriented querying with wildcards like [disease] supported for an interaction participant
- It seems debatable whether viewing sentences in isolation is very helpful, especially in the tabular form. I tried searching for phosducin AND "phosducin-like protein" in order to retrieve a document seen in Bioscape (PubMed #12060742), and this query did find it, although PDC AND PDCL (which employs the symbol names) does not, suggesting that there is a textual orientation to the service.
- Annotated sentences do not appear to be available from this service: links to PubMed are provided.
- EBIMed permits the inspection of results according to the co-occurrence of search terms with other features, thus supporting GoPubMed-style categorisation/filtering as well as gene/protein-related segmentation of results
- Results are initially presented using a table of "facets" such as co-occurring gene/protein, Gene Ontology categories, drugs and species, with abstracts obtainable upon selection of a particular gene/protein or co-occurring concept.
- Abstracts are annotated with domain-specific concepts.
- Protein Corral produces results in a way similar to EBIMed but focusing on interaction verbs and confidence measures
- Results show a selection of "facets" mostly related to interaction context.
- Abstracts are annotated with domain-specific concepts.
- Whatizit is a service which exposes the EBI text-mining infrastructure
- Results can mimic other services such as EBIMed (by selecting the whatizitEBIMed pipeline and by issuing A Lucene Query using the input).
- Abstracts can therefore be annotated with domain-specific concepts if the pipeline supports this (whatizitEBIMed does, whatizitProteinInteraction does not).
- KLEIO supports searches for domain-specific keywords (PDC versus PROTEIN:PDC), and employs named entity recognition to generate terms for indexing with Lucene
- Results are accessed via a traditional list of document extracts with selectable facets (provided through the use of Solr) for filtering (such as ORGAN permitting results where values such as liver are mentioned).
- Abstracts are annotated with domain-specific concepts.
- The Lucene results are collated with BioLexicon data.
- FACTA provides a slightly different (more Google-like) search interface for PubMed, concentrating on co-occurrences of concepts
- Results appear in a traditional list of documents, with "relevant concepts" available to filter the list of results further (similar to KLEIO, EBIMed).
- Annotated sentences do not appear to be available from this service: links to PubMed are provided.
- MEDIE offers search facilities for relations (such as interactions) in PubMed (apparently employing "several hundred cores" on each new batch of PubMed documents to make them available within around 20 minutes), supporting domain-specific concepts (gene, protein, disease) and sentence/section types from abstracts (although only 9% of abstracts in PubMed expose such metadata)
- Results are processed using the Enju parser and show the annotated abstracts with the search terms highlighted (or words in the role of an unspecified subject, verb or object proposed and highlighted).
- Annotated documents are available with search terms (or found terms in the appropriate roles) highlighted.
Text Processing and Database Resources
- TerMine provides part-of-speech tagging using GENIA, term normalisation, acronym extraction/clustering, supporting variations such as ("NF kappa B", "NKfB", "nuclear factor kappa B") on explicitly submitted text
- Results show the submitted text annotated with recognised terms and acronyms.
- AcroMine is a database of acronyms found in PubMed
- Techniques employed include word sense disambiguation classifiers based on features such as neighbouring word and context.