Difference between revisions of "Text Mining Resources"

From irefindex
(Added FreeLing and NLTK.)
(Initial notes.)
Line 9: Line 9:
 
* [http://www.lsi.upc.es/~nlp/freeling/ FreeLing] - ''written in C++ with features from tokenisation through to part-of-speech tagging, word sense disambiguation''
 
* [http://www.lsi.upc.es/~nlp/freeling/ FreeLing] - ''written in C++ with features from tokenisation through to part-of-speech tagging, word sense disambiguation''
 
* [http://www.nltk.org/ NLTK] - ''written in Python with a wide range of natural language processing features''
 
* [http://www.nltk.org/ NLTK] - ''written in Python with a wide range of natural language processing features''
 +
 +
== Notes from the Text Mining Tutorial at EBI ==
 +
 +
Links:
 +
 +
* [http://nactem.ac.uk/talk_slides/trainingEBI_final.pdf Text Mining in Biomedicine/Exploitation of biomedical semantic resources]
 +
** NaCTeM's Services: KLEIO, FACTA, MEDIE, TerMine, Acromine
 +
** Overview of resources, biolexicon, bio-ontologies, text mining infrastructure (U-Compare text mining workflows)
 +
 +
Useful resources:
 +
 +
* [http://ukpmc.ac.uk/ UK PubMed Central] provides annotation of abstracts, covers (or will eventually cover) up to 1.5 million full-text articles
 +
** Links to the official PubMed results with links back to UK PubMed Central results (presented similarly to official PubMed Central results).
 +
* [http://www.ebi.ac.uk/citexplore/ CiteXplore] provides literature search including (but not limited to) PubMed, without domain-specific features
 +
** Results show PubMed records with search keywords highlighted.
 +
* [http://www.gopubmed.com/ GoPubMed] provides PubMed searching with Gene Ontology categorisation/filtering of search results
 +
** Results include annotated abstracts which seem keyword-oriented, not gene-oriented, and offer interesting statistics related to publication metadata.
 +
** Domain-specific annotations can apparently be activated by selecting items from the "what" sidebar, such as protein <tt>PDC</tt>.
 +
* [http://www.ebi.ac.uk/Rebholz-srv/MedEvi/ MedEvi] offers sentence-oriented, interaction-oriented querying with wildcards like <tt>[disease]</tt> supported for an interaction participant
 +
** It seems debatable whether viewing sentences in isolation is very helpful, especially in the tabular form. I tried searching for <tt>phosducin AND "phosducin-like protein"</tt> in order to retrieve a document seen in Bioscape (PubMed #12060742), and this query did find it, although <tt>PDC AND PDCL</tt> (which employs the symbol names) does not, suggesting that there is a textual orientation to the service.
 +
** Annotated sentences do not appear to be available from this service: links to PubMed are provided.
 +
* [http://www.ebi.ac.uk/Rebholz-srv/ebimed/ EBIMed] permits the inspection of results according to the co-occurrence of search terms with other features, thus supporting GoPubMed-style categorisation/filtering as well as gene/protein-related segmentation of results
 +
** Results are initially presented using a table of "facets" such as co-occurring gene/protein, Gene Ontology categories, drugs and species, with abstracts obtainable upon selection of a particular gene/protein or co-occurring concept.
 +
** Abstracts are annotated with domain-specific concepts.
 +
* [http://www.ebi.ac.uk/Rebholz-srv/pcorral/ Protein Corral] produces results in a way similar to EBIMed but focusing on interaction verbs and confidence measures
 +
** Results show a selection of "facets" mostly related to interaction context.
 +
** Abstracts are annotated with domain-specific concepts.
 +
* [http://www.ebi.ac.uk/webservices/whatizit/ Whatizit] is a service which exposes the EBI text-mining infrastructure
 +
** Results can mimic other services such as EBIMed (by selecting the <tt>whatizitEBIMed</tt> pipeline and by issuing <tt>A Lucene Query</tt> using the input).
 +
** Abstracts can therefore be annotated with domain-specific concepts if the pipeline supports this (<tt>whatizitEBIMed</tt> does, <tt>whatizitProteinInteraction</tt> does not).

Revision as of 12:51, 15 October 2009

Some notes on open source text mining resources:

  • "The Text Mining Tool Evaluation project will describe the process of text mining, identify non-proprietary software that can search blocks of text to identify reports relevant to the cancer registry, and provide information to state cancer registries regarding different tools available and a comparison of the functionality provided by each tool." Evaluation of Open Source Text Mining Tools for Cancer Surveillance (HTML version from the Google cache)
  • "U-Compare is an integrated text mining/natural language processing system based on the UIMA Framework." U-Compare: share and compare tools with UIMA
  • "The BioNLP Unstructured Information Management Architecture (UIMA) Component Repository provides UIMA wrappers for novel and well-known 3rd-party NLP tools used in biomedical text prosessing, such as tokenizers, parsers, named entity taggers, and tools for evaluation." BioNLP UIMA Component Respository
  • "OpenNLP is an organizational center for open source projects related to natural language processing." OpenNLP
  • FreeLing - written in C++ with features from tokenisation through to part-of-speech tagging, word sense disambiguation
  • NLTK - written in Python with a wide range of natural language processing features

Notes from the Text Mining Tutorial at EBI

Links:

Useful resources:

  • UK PubMed Central provides annotation of abstracts, covers (or will eventually cover) up to 1.5 million full-text articles
    • Links to the official PubMed results with links back to UK PubMed Central results (presented similarly to official PubMed Central results).
  • CiteXplore provides literature search including (but not limited to) PubMed, without domain-specific features
    • Results show PubMed records with search keywords highlighted.
  • GoPubMed provides PubMed searching with Gene Ontology categorisation/filtering of search results
    • Results include annotated abstracts which seem keyword-oriented, not gene-oriented, and offer interesting statistics related to publication metadata.
    • Domain-specific annotations can apparently be activated by selecting items from the "what" sidebar, such as protein PDC.
  • MedEvi offers sentence-oriented, interaction-oriented querying with wildcards like [disease] supported for an interaction participant
    • It seems debatable whether viewing sentences in isolation is very helpful, especially in the tabular form. I tried searching for phosducin AND "phosducin-like protein" in order to retrieve a document seen in Bioscape (PubMed #12060742), and this query did find it, although PDC AND PDCL (which employs the symbol names) does not, suggesting that there is a textual orientation to the service.
    • Annotated sentences do not appear to be available from this service: links to PubMed are provided.
  • EBIMed permits the inspection of results according to the co-occurrence of search terms with other features, thus supporting GoPubMed-style categorisation/filtering as well as gene/protein-related segmentation of results
    • Results are initially presented using a table of "facets" such as co-occurring gene/protein, Gene Ontology categories, drugs and species, with abstracts obtainable upon selection of a particular gene/protein or co-occurring concept.
    • Abstracts are annotated with domain-specific concepts.
  • Protein Corral produces results in a way similar to EBIMed but focusing on interaction verbs and confidence measures
    • Results show a selection of "facets" mostly related to interaction context.
    • Abstracts are annotated with domain-specific concepts.
  • Whatizit is a service which exposes the EBI text-mining infrastructure
    • Results can mimic other services such as EBIMed (by selecting the whatizitEBIMed pipeline and by issuing A Lucene Query using the input).
    • Abstracts can therefore be annotated with domain-specific concepts if the pipeline supports this (whatizitEBIMed does, whatizitProteinInteraction does not).