Difference between revisions of "Text Mining Resources"
From irefindex
PaulBoddie (talk | contribs) (Initial notes.) |
PaulBoddie (talk | contribs) (→Notes from the Text Mining Tutorial at EBI: Added more resources with separate lists for different resource types.) |
||
Line 15: | Line 15: | ||
* [http://nactem.ac.uk/talk_slides/trainingEBI_final.pdf Text Mining in Biomedicine/Exploitation of biomedical semantic resources] | * [http://nactem.ac.uk/talk_slides/trainingEBI_final.pdf Text Mining in Biomedicine/Exploitation of biomedical semantic resources] | ||
− | ** NaCTeM's Services: KLEIO, FACTA, MEDIE, TerMine, | + | ** NaCTeM's Services: KLEIO, FACTA, MEDIE, TerMine, AcroMine |
− | ** Overview of resources, | + | ** Overview of resources, BioLexicon, bio-ontologies, text-mining infrastructure (U-Compare text-mining workflows) |
− | + | === Text Search Resources === | |
* [http://ukpmc.ac.uk/ UK PubMed Central] provides annotation of abstracts, covers (or will eventually cover) up to 1.5 million full-text articles | * [http://ukpmc.ac.uk/ UK PubMed Central] provides annotation of abstracts, covers (or will eventually cover) up to 1.5 million full-text articles | ||
Line 39: | Line 39: | ||
** Results can mimic other services such as EBIMed (by selecting the <tt>whatizitEBIMed</tt> pipeline and by issuing <tt>A Lucene Query</tt> using the input). | ** Results can mimic other services such as EBIMed (by selecting the <tt>whatizitEBIMed</tt> pipeline and by issuing <tt>A Lucene Query</tt> using the input). | ||
** Abstracts can therefore be annotated with domain-specific concepts if the pipeline supports this (<tt>whatizitEBIMed</tt> does, <tt>whatizitProteinInteraction</tt> does not). | ** Abstracts can therefore be annotated with domain-specific concepts if the pipeline supports this (<tt>whatizitEBIMed</tt> does, <tt>whatizitProteinInteraction</tt> does not). | ||
+ | * [http://www.nactem.ac.uk/software/kleio/ KLEIO] supports searches for domain-specific keywords (<tt>PDC</tt> versus <tt>PROTEIN:PDC</tt>), and employs named entity recognition to generate terms for indexing with Lucene | ||
+ | ** Results are accessed via a traditional list of document extracts with selectable facets (provided through the use of Solr) for filtering (such as <tt>ORGAN</tt> permitting results where values such as <tt>liver</tt> are mentioned). | ||
+ | ** Abstracts are annotated with domain-specific concepts. | ||
+ | ** The Lucene results are collated with BioLexicon data. | ||
+ | * [http://text0.mib.man.ac.uk/software/facta/ FACTA] provides a slightly different (more Google-like) search interface for PubMed, concentrating on co-occurrences of concepts | ||
+ | ** Results appear in a traditional list of documents, with "relevant concepts" available to filter the list of results further (similar to KLEIO, EBIMed). | ||
+ | ** Annotated sentences do not appear to be available from this service: links to PubMed are provided. | ||
+ | |||
+ | === Text Processing and Database Resources === | ||
+ | |||
+ | * [http://www.nactem.ac.uk/software/termine/ TerMine] provides part-of-speech tagging using GENIA, term normalisation, acronym extraction/clustering, supporting variations such as ("NF kappa B", "NKfB", "nuclear factor kappa B") on explicitly submitted text | ||
+ | ** Results show the submitted text annotated with recognised terms and acronyms. | ||
+ | * [http://www.nactem.ac.uk/software/acromine/ AcroMine] is a database of acronyms found in PubMed | ||
+ | ** Techniques employed include word sense disambiguation classifiers based on features such as neighbouring word and context. |
Revision as of 15:37, 15 October 2009
Some notes on open source text mining resources:
- "The Text Mining Tool Evaluation project will describe the process of text mining, identify non-proprietary software that can search blocks of text to identify reports relevant to the cancer registry, and provide information to state cancer registries regarding different tools available and a comparison of the functionality provided by each tool." Evaluation of Open Source Text Mining Tools for Cancer Surveillance (HTML version from the Google cache)
- "U-Compare is an integrated text mining/natural language processing system based on the UIMA Framework." U-Compare: share and compare tools with UIMA
- "The BioNLP Unstructured Information Management Architecture (UIMA) Component Repository provides UIMA wrappers for novel and well-known 3rd-party NLP tools used in biomedical text prosessing, such as tokenizers, parsers, named entity taggers, and tools for evaluation." BioNLP UIMA Component Respository
- "OpenNLP is an organizational center for open source projects related to natural language processing." OpenNLP
- OpenNLP projects
- See also OpenNLP links for other resources.
- FreeLing - written in C++ with features from tokenisation through to part-of-speech tagging, word sense disambiguation
- NLTK - written in Python with a wide range of natural language processing features
Notes from the Text Mining Tutorial at EBI
Links:
- Text Mining in Biomedicine/Exploitation of biomedical semantic resources
- NaCTeM's Services: KLEIO, FACTA, MEDIE, TerMine, AcroMine
- Overview of resources, BioLexicon, bio-ontologies, text-mining infrastructure (U-Compare text-mining workflows)
Text Search Resources
- UK PubMed Central provides annotation of abstracts, covers (or will eventually cover) up to 1.5 million full-text articles
- Links to the official PubMed results with links back to UK PubMed Central results (presented similarly to official PubMed Central results).
- CiteXplore provides literature search including (but not limited to) PubMed, without domain-specific features
- Results show PubMed records with search keywords highlighted.
- GoPubMed provides PubMed searching with Gene Ontology categorisation/filtering of search results
- Results include annotated abstracts which seem keyword-oriented, not gene-oriented, and offer interesting statistics related to publication metadata.
- Domain-specific annotations can apparently be activated by selecting items from the "what" sidebar, such as protein PDC.
- MedEvi offers sentence-oriented, interaction-oriented querying with wildcards like [disease] supported for an interaction participant
- It seems debatable whether viewing sentences in isolation is very helpful, especially in the tabular form. I tried searching for phosducin AND "phosducin-like protein" in order to retrieve a document seen in Bioscape (PubMed #12060742), and this query did find it, although PDC AND PDCL (which employs the symbol names) does not, suggesting that there is a textual orientation to the service.
- Annotated sentences do not appear to be available from this service: links to PubMed are provided.
- EBIMed permits the inspection of results according to the co-occurrence of search terms with other features, thus supporting GoPubMed-style categorisation/filtering as well as gene/protein-related segmentation of results
- Results are initially presented using a table of "facets" such as co-occurring gene/protein, Gene Ontology categories, drugs and species, with abstracts obtainable upon selection of a particular gene/protein or co-occurring concept.
- Abstracts are annotated with domain-specific concepts.
- Protein Corral produces results in a way similar to EBIMed but focusing on interaction verbs and confidence measures
- Results show a selection of "facets" mostly related to interaction context.
- Abstracts are annotated with domain-specific concepts.
- Whatizit is a service which exposes the EBI text-mining infrastructure
- Results can mimic other services such as EBIMed (by selecting the whatizitEBIMed pipeline and by issuing A Lucene Query using the input).
- Abstracts can therefore be annotated with domain-specific concepts if the pipeline supports this (whatizitEBIMed does, whatizitProteinInteraction does not).
- KLEIO supports searches for domain-specific keywords (PDC versus PROTEIN:PDC), and employs named entity recognition to generate terms for indexing with Lucene
- Results are accessed via a traditional list of document extracts with selectable facets (provided through the use of Solr) for filtering (such as ORGAN permitting results where values such as liver are mentioned).
- Abstracts are annotated with domain-specific concepts.
- The Lucene results are collated with BioLexicon data.
- FACTA provides a slightly different (more Google-like) search interface for PubMed, concentrating on co-occurrences of concepts
- Results appear in a traditional list of documents, with "relevant concepts" available to filter the list of results further (similar to KLEIO, EBIMed).
- Annotated sentences do not appear to be available from this service: links to PubMed are provided.
Text Processing and Database Resources
- TerMine provides part-of-speech tagging using GENIA, term normalisation, acronym extraction/clustering, supporting variations such as ("NF kappa B", "NKfB", "nuclear factor kappa B") on explicitly submitted text
- Results show the submitted text annotated with recognised terms and acronyms.
- AcroMine is a database of acronyms found in PubMed
- Techniques employed include word sense disambiguation classifiers based on features such as neighbouring word and context.