Text Mining Resources

Some notes on open source text mining resources:

"The Text Mining Tool Evaluation project will describe the process of text mining, identify non-proprietary software that can search blocks of text to identify reports relevant to the cancer registry, and provide information to state cancer registries regarding different tools available and a comparison of the functionality provided by each tool." Evaluation of Open Source Text Mining Tools for Cancer Surveillance (HTML version from the Google cache)
"U-Compare is an integrated text mining/natural language processing system based on the UIMA Framework." U-Compare: share and compare tools with UIMA
"The BioNLP Unstructured Information Management Architecture (UIMA) Component Repository provides UIMA wrappers for novel and well-known 3rd-party NLP tools used in biomedical text prosessing, such as tokenizers, parsers, named entity taggers, and tools for evaluation." BioNLP UIMA Component Respository
"OpenNLP is an organizational center for open source projects related to natural language processing." OpenNLP
- OpenNLP projects
- See also OpenNLP links for other resources.
FreeLing - written in C++ with features from tokenisation through to part-of-speech tagging, word sense disambiguation
NLTK - written in Python with a wide range of natural language processing features

Notes from the Text Mining Tutorial at EBI

Links:

Text Mining in Biomedicine/Exploitation of biomedical semantic resources
- NaCTeM's Services: KLEIO, FACTA, MEDIE, TerMine, Acromine
- Overview of resources, biolexicon, bio-ontologies, text mining infrastructure (U-Compare text mining workflows)

Useful resources:

UK PubMed Central provides annotation of abstracts, covers (or will eventually cover) up to 1.5 million full-text articles
- Links to the official PubMed results with links back to UK PubMed Central results (presented similarly to official PubMed Central results).
CiteXplore provides literature search including (but not limited to) PubMed, without domain-specific features
- Results show PubMed records with search keywords highlighted.
GoPubMed provides PubMed searching with Gene Ontology categorisation/filtering of search results
- Results include annotated abstracts which seem keyword-oriented, not gene-oriented, and offer interesting statistics related to publication metadata.
- Domain-specific annotations can apparently be activated by selecting items from the "what" sidebar, such as protein PDC.
MedEvi offers sentence-oriented, interaction-oriented querying with wildcards like [disease] supported for an interaction participant
- It seems debatable whether viewing sentences in isolation is very helpful, especially in the tabular form. I tried searching for phosducin AND "phosducin-like protein" in order to retrieve a document seen in Bioscape (PubMed #12060742), and this query did find it, although PDC AND PDCL (which employs the symbol names) does not, suggesting that there is a textual orientation to the service.
- Annotated sentences do not appear to be available from this service: links to PubMed are provided.
EBIMed permits the inspection of results according to the co-occurrence of search terms with other features, thus supporting GoPubMed-style categorisation/filtering as well as gene/protein-related segmentation of results
- Results are initially presented using a table of "facets" such as co-occurring gene/protein, Gene Ontology categories, drugs and species, with abstracts obtainable upon selection of a particular gene/protein or co-occurring concept.
- Abstracts are annotated with domain-specific concepts.
Protein Corral produces results in a way similar to EBIMed but focusing on interaction verbs and confidence measures
- Results show a selection of "facets" mostly related to interaction context.
- Abstracts are annotated with domain-specific concepts.
Whatizit is a service which exposes the EBI text-mining infrastructure
- Results can mimic other services such as EBIMed (by selecting the whatizitEBIMed pipeline and by issuing A Lucene Query using the input).
- Abstracts can therefore be annotated with domain-specific concepts if the pipeline supports this (whatizitEBIMed does, whatizitProteinInteraction does not).

Anonymous

Search

Text Mining Resources

Namespaces

More

Page actions

Notes from the Text Mining Tutorial at EBI

Navigation

Navigation

Internal Links

Wiki tools

Wiki tools

Anonymous

Search

Text Mining Resources

Notes from the Text Mining Tutorial at EBI

Navigation

Wiki tools

Page tools