Difference between revisions of "Text Mining Resources"

From irefindex
(Added NLTK-related resources, Wikipedia pronoun page link.)
 
(3 intermediate revisions by the same user not shown)
Line 10: Line 10:
 
* [http://www.lsi.upc.es/~nlp/freeling/ FreeLing] - ''written in C++ with features from tokenisation through to part-of-speech tagging, word sense disambiguation''
 
* [http://www.lsi.upc.es/~nlp/freeling/ FreeLing] - ''written in C++ with features from tokenisation through to part-of-speech tagging, word sense disambiguation''
 
* [http://www.nltk.org/ NLTK] - ''written in Python with a wide range of natural language processing features''
 
* [http://www.nltk.org/ NLTK] - ''written in Python with a wide range of natural language processing features''
 +
** [http://nltk.googlecode.com/svn/trunk/doc/book/ch05.html 5 Categorizing and Tagging Words]
 +
** [http://nltk.googlecode.com/svn/trunk/doc/howto/tag.html Taggers]
 
* [http://snowball.tartarus.org/algorithms/english/stemmer.html The English (Porter2) stemming algorithm] - ''also providing links to stop word lists, vocabulary''
 
* [http://snowball.tartarus.org/algorithms/english/stemmer.html The English (Porter2) stemming algorithm] - ''also providing links to stop word lists, vocabulary''
 
* [ftp://ftp.ncbi.nlm.nih.gov/pub/lsmith/MedPost/ MedPost] - ''a part-of-speech tagger''
 
* [ftp://ftp.ncbi.nlm.nih.gov/pub/lsmith/MedPost/ MedPost] - ''a part-of-speech tagger''
 
* "The SPECIALIST LEXICON is a large syntactic lexicon of biomedical and general English" [http://lexsrv3.nlm.nih.gov/LexSysGroup/Projects/lexicon/current/index.html The SPECIALIST LEXICON] (with relatively permissive [http://lexsrv3.nlm.nih.gov/LexSysGroup/Projects/lexicon/current/index.html licensing terms])
 
* "The SPECIALIST LEXICON is a large syntactic lexicon of biomedical and general English" [http://lexsrv3.nlm.nih.gov/LexSysGroup/Projects/lexicon/current/index.html The SPECIALIST LEXICON] (with relatively permissive [http://lexsrv3.nlm.nih.gov/LexSysGroup/Projects/lexicon/current/index.html licensing terms])
 
* [http://isl.cgu.edu/Resources/CGU_ISL_Health_Lexicon.txt Medical/Health Lexicon] from the [http://isl.cgu.edu/ Intelligent Systems Lab at Claremont Graduate University] is based on the [http://gate.ac.uk/ GATE] and SPECIALIST LEXICON resources
 
* [http://isl.cgu.edu/Resources/CGU_ISL_Health_Lexicon.txt Medical/Health Lexicon] from the [http://isl.cgu.edu/ Intelligent Systems Lab at Claremont Graduate University] is based on the [http://gate.ac.uk/ GATE] and SPECIALIST LEXICON resources
 +
* [http://consumerhealthvocab.org/ Consumer Health Vocabulary Initiative] provides GPL-licensed [http://consumerhealthvocab.org/chvfiles.php resources] (albeit behind a click-through acceptance page which is superfluous with the GPL), referenced by [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1839344/ Health Information Text Characteristics (PMC#1839344)]
 +
* [http://hunspell.sourceforge.net/ Hunspell] - ''a spell checker with morphological analysis features, stemming''
 +
* [http://en.wikipedia.org/wiki/Pronoun Pronoun] - ''the Wikipedia page about pronouns which could be potential stop words''
  
 
== Notes from the Text Mining Tutorial at EBI ==
 
== Notes from the Text Mining Tutorial at EBI ==

Latest revision as of 14:04, 17 February 2010

Some notes on open source text mining resources:

Notes from the Text Mining Tutorial at EBI

Links:

Text Search Resources

  • UK PubMed Central provides annotation of abstracts, covers (or will eventually cover) up to 1.5 million full-text articles
  • Using BioMed Central's open access full-text corpus for text mining research
  • CiteXplore provides literature search including (but not limited to) PubMed, without domain-specific features
    • Results show PubMed records with search keywords highlighted.
  • GoPubMed provides PubMed searching with Gene Ontology categorisation/filtering of search results
    • Results include annotated abstracts which seem keyword-oriented, not gene-oriented, and offer interesting statistics related to publication metadata.
    • Domain-specific annotations can apparently be activated by selecting items from the "what" sidebar, such as protein PDC.
  • MedEvi offers sentence-oriented, interaction-oriented querying with wildcards like [disease] supported for an interaction participant
    • It seems debatable whether viewing sentences in isolation is very helpful, especially in the tabular form. I tried searching for phosducin AND "phosducin-like protein" in order to retrieve a document seen in Bioscape (PubMed #12060742), and this query did find it, although PDC AND PDCL (which employs the symbol names) does not, suggesting that there is a textual orientation to the service.
    • Annotated sentences do not appear to be available from this service: links to PubMed are provided.
  • EBIMed permits the inspection of results according to the co-occurrence of search terms with other features, thus supporting GoPubMed-style categorisation/filtering as well as gene/protein-related segmentation of results
    • Results are initially presented using a table of "facets" such as co-occurring gene/protein, Gene Ontology categories, drugs and species, with abstracts obtainable upon selection of a particular gene/protein or co-occurring concept.
    • Abstracts are annotated with domain-specific concepts.
  • Protein Corral produces results in a way similar to EBIMed but focusing on interaction verbs and confidence measures
    • Results show a selection of "facets" mostly related to interaction context.
    • Abstracts are annotated with domain-specific concepts.
  • Whatizit is a service which exposes the EBI text-mining infrastructure
    • Results can mimic other services such as EBIMed (by selecting the whatizitEBIMed pipeline and by issuing A Lucene Query using the input).
    • Abstracts can therefore be annotated with domain-specific concepts if the pipeline supports this (whatizitEBIMed does, whatizitProteinInteraction does not).
  • KLEIO supports searches for domain-specific keywords (PDC versus PROTEIN:PDC), and employs named entity recognition to generate terms for indexing with Lucene
    • Results are accessed via a traditional list of document extracts with selectable facets (provided through the use of Solr) for filtering (such as ORGAN permitting results where values such as liver are mentioned).
    • Abstracts are annotated with domain-specific concepts.
    • The Lucene results are collated with BioLexicon data.
  • FACTA provides a slightly different (more Google-like) search interface for PubMed, concentrating on co-occurrences of concepts
    • Results appear in a traditional list of documents, with "relevant concepts" available to filter the list of results further (similar to KLEIO, EBIMed).
    • Annotated sentences do not appear to be available from this service: links to PubMed are provided.
  • MEDIE offers search facilities for relations (such as interactions) in PubMed (apparently employing "several hundred cores" on each new batch of PubMed documents to make them available within around 20 minutes), supporting domain-specific concepts (gene, protein, disease) and sentence/section types from abstracts (although only 9% of abstracts in PubMed expose such metadata)
    • Results are processed using the Enju parser and show the annotated abstracts with the search terms highlighted (or words in the role of an unspecified subject, verb or object proposed and highlighted).
    • Annotated documents are available with search terms (or found terms in the appropriate roles) highlighted.

Text Processing and Database Resources

  • TerMine provides part-of-speech tagging using GENIA, term normalisation, acronym extraction/clustering, supporting variations such as ("NF kappa B", "NKfB", "nuclear factor kappa B") on explicitly submitted text
    • Results show the submitted text annotated with recognised terms and acronyms.
  • AcroMine is a database of acronyms found in PubMed
    • Techniques employed include word sense disambiguation classifiers based on features such as neighbouring word and context.