Difference between revisions of "Text Mining Resources"

Revision as of 15:37, 15 October 2009

Some notes on open source text mining resources:

"The Text Mining Tool Evaluation project will describe the process of text mining, identify non-proprietary software that can search blocks of text to identify reports relevant to the cancer registry, and provide information to state cancer registries regarding different tools available and a comparison of the functionality provided by each tool." Evaluation of Open Source Text Mining Tools for Cancer Surveillance (HTML version from the Google cache)
"U-Compare is an integrated text mining/natural language processing system based on the UIMA Framework." U-Compare: share and compare tools with UIMA
"The BioNLP Unstructured Information Management Architecture (UIMA) Component Repository provides UIMA wrappers for novel and well-known 3rd-party NLP tools used in biomedical text prosessing, such as tokenizers, parsers, named entity taggers, and tools for evaluation." BioNLP UIMA Component Respository
"OpenNLP is an organizational center for open source projects related to natural language processing." OpenNLP
- OpenNLP projects
- See also OpenNLP links for other resources.
FreeLing - written in C++ with features from tokenisation through to part-of-speech tagging, word sense disambiguation
NLTK - written in Python with a wide range of natural language processing features

Notes from the Text Mining Tutorial at EBI

Links:

Text Mining in Biomedicine/Exploitation of biomedical semantic resources
- NaCTeM's Services: KLEIO, FACTA, MEDIE, TerMine, AcroMine
- Overview of resources, BioLexicon, bio-ontologies, text-mining infrastructure (U-Compare text-mining workflows)

Text Search Resources

UK PubMed Central provides annotation of abstracts, covers (or will eventually cover) up to 1.5 million full-text articles
- Links to the official PubMed results with links back to UK PubMed Central results (presented similarly to official PubMed Central results).
CiteXplore provides literature search including (but not limited to) PubMed, without domain-specific features
- Results show PubMed records with search keywords highlighted.
GoPubMed provides PubMed searching with Gene Ontology categorisation/filtering of search results
- Results include annotated abstracts which seem keyword-oriented, not gene-oriented, and offer interesting statistics related to publication metadata.
- Domain-specific annotations can apparently be activated by selecting items from the "what" sidebar, such as protein PDC.
MedEvi offers sentence-oriented, interaction-oriented querying with wildcards like [disease] supported for an interaction participant
- It seems debatable whether viewing sentences in isolation is very helpful, especially in the tabular form. I tried searching for phosducin AND "phosducin-like protein" in order to retrieve a document seen in Bioscape (PubMed #12060742), and this query did find it, although PDC AND PDCL (which employs the symbol names) does not, suggesting that there is a textual orientation to the service.
- Annotated sentences do not appear to be available from this service: links to PubMed are provided.
EBIMed permits the inspection of results according to the co-occurrence of search terms with other features, thus supporting GoPubMed-style categorisation/filtering as well as gene/protein-related segmentation of results
- Results are initially presented using a table of "facets" such as co-occurring gene/protein, Gene Ontology categories, drugs and species, with abstracts obtainable upon selection of a particular gene/protein or co-occurring concept.
- Abstracts are annotated with domain-specific concepts.
Protein Corral produces results in a way similar to EBIMed but focusing on interaction verbs and confidence measures
- Results show a selection of "facets" mostly related to interaction context.
- Abstracts are annotated with domain-specific concepts.
Whatizit is a service which exposes the EBI text-mining infrastructure
- Results can mimic other services such as EBIMed (by selecting the whatizitEBIMed pipeline and by issuing A Lucene Query using the input).
- Abstracts can therefore be annotated with domain-specific concepts if the pipeline supports this (whatizitEBIMed does, whatizitProteinInteraction does not).
KLEIO supports searches for domain-specific keywords (PDC versus PROTEIN:PDC), and employs named entity recognition to generate terms for indexing with Lucene
- Results are accessed via a traditional list of document extracts with selectable facets (provided through the use of Solr) for filtering (such as ORGAN permitting results where values such as liver are mentioned).
- Abstracts are annotated with domain-specific concepts.
- The Lucene results are collated with BioLexicon data.
FACTA provides a slightly different (more Google-like) search interface for PubMed, concentrating on co-occurrences of concepts
- Results appear in a traditional list of documents, with "relevant concepts" available to filter the list of results further (similar to KLEIO, EBIMed).
- Annotated sentences do not appear to be available from this service: links to PubMed are provided.

Text Processing and Database Resources

TerMine provides part-of-speech tagging using GENIA, term normalisation, acronym extraction/clustering, supporting variations such as ("NF kappa B", "NKfB", "nuclear factor kappa B") on explicitly submitted text
- Results show the submitted text annotated with recognised terms and acronyms.
AcroMine is a database of acronyms found in PubMed
- Techniques employed include word sense disambiguation classifiers based on features such as neighbouring word and context.

@@ Line 15: / Line 15: @@
 * [http://nactem.ac.uk/talk_slides/trainingEBI_final.pdf Text Mining in Biomedicine/Exploitation of biomedical semantic resources]
-** NaCTeM's Services: KLEIO, FACTA, MEDIE, TerMine, Acromine
+** NaCTeM's Services: KLEIO, FACTA, MEDIE, TerMine, AcroMine
-** Overview of resources, biolexicon, bio-ontologies, text mining infrastructure (U-Compare text mining workflows)
+** Overview of resources, BioLexicon, bio-ontologies, text-mining infrastructure (U-Compare text-mining workflows)
-Useful resources:
+=== Text Search Resources ===
 * [http://ukpmc.ac.uk/ UK PubMed Central] provides annotation of abstracts, covers (or will eventually cover) up to 1.5 million full-text articles
@@ Line 39: / Line 39: @@
 ** Results can mimic other services such as EBIMed (by selecting the <tt>whatizitEBIMed</tt> pipeline and by issuing <tt>A Lucene Query</tt> using the input).
 ** Abstracts can therefore be annotated with domain-specific concepts if the pipeline supports this (<tt>whatizitEBIMed</tt> does, <tt>whatizitProteinInteraction</tt> does not).
+* [http://www.nactem.ac.uk/software/kleio/ KLEIO] supports searches for domain-specific keywords (<tt>PDC</tt> versus <tt>PROTEIN:PDC</tt>), and employs named entity recognition to generate terms for indexing with Lucene
+** Results are accessed via a traditional list of document extracts with selectable facets (provided through the use of Solr) for filtering (such as <tt>ORGAN</tt> permitting results where values such as <tt>liver</tt> are mentioned).
+** Abstracts are annotated with domain-specific concepts.
+** The Lucene results are collated with BioLexicon data.
+* [http://text0.mib.man.ac.uk/software/facta/ FACTA] provides a slightly different (more Google-like) search interface for PubMed, concentrating on co-occurrences of concepts
+** Results appear in a traditional list of documents, with "relevant concepts" available to filter the list of results further (similar to KLEIO, EBIMed).
+** Annotated sentences do not appear to be available from this service: links to PubMed are provided.
+=== Text Processing and Database Resources ===
+* [http://www.nactem.ac.uk/software/termine/ TerMine] provides part-of-speech tagging using GENIA, term normalisation, acronym extraction/clustering, supporting variations such as ("NF kappa B", "NKfB", "nuclear factor kappa B") on explicitly submitted text
+** Results show the submitted text annotated with recognised terms and acronyms.
+* [http://www.nactem.ac.uk/software/acromine/ AcroMine] is a database of acronyms found in PubMed
+** Techniques employed include word sense disambiguation classifiers based on features such as neighbouring word and context.

Anonymous

Search

Difference between revisions of "Text Mining Resources"

Namespaces

More

Page actions

Revision as of 15:37, 15 October 2009

Notes from the Text Mining Tutorial at EBI

Text Search Resources

Text Processing and Database Resources

Navigation

Navigation

Internal Links

Wiki tools

Wiki tools

Anonymous

Search

Difference between revisions of "Text Mining Resources"

Revision as of 15:37, 15 October 2009

Notes from the Text Mining Tutorial at EBI

Text Search Resources

Text Processing and Database Resources

Navigation

Wiki tools

Page tools