Difference between revisions of "Text Indexing Resources"
From irefindex
PaulBoddie (talk | contribs) (New page: A considerable number of text indexing solutions exist. This document discusses some of the more widely-known open source solutions. * [http://www.htdig.org/ ht://Dig] - ''a search engine...) |
PaulBoddie (talk | contribs) (Added Tokyo Dystopia.) |
||
(2 intermediate revisions by the same user not shown) | |||
Line 7: | Line 7: | ||
* [http://www.postgresql.org/docs/8.4/static/textsearch.html PostgreSQL full text search] - ''incorporates the previously separate tsearch2 functionality into recent versions of PostgreSQL (from 8.3 upwards)'' | * [http://www.postgresql.org/docs/8.4/static/textsearch.html PostgreSQL full text search] - ''incorporates the previously separate tsearch2 functionality into recent versions of PostgreSQL (from 8.3 upwards)'' | ||
* [http://www.sphinxsearch.com/ Sphinx] - ''used by numerous [http://www.sphinxsearch.com/powered.html large-scale public Web sites and services] in a traditional document search role'' | * [http://www.sphinxsearch.com/ Sphinx] - ''used by numerous [http://www.sphinxsearch.com/powered.html large-scale public Web sites and services] in a traditional document search role'' | ||
− | * [http://swishplusplus.sourceforge.net/ SWISH++] - ''an indexing and searching engine typically used for documents on Web sites'' | + | * [http://swish-e.org/ Swish-e] - ''a Web site indexer "ideally suited for collections of a million documents or smaller", derived from SWISH'' |
+ | * [http://swishplusplus.sourceforge.net/ SWISH++] - ''an indexing and searching engine typically used for documents on Web sites, derived from Swish-e'' | ||
+ | * [http://ir.dcs.gla.ac.uk/terrier/ Terrier] - ''supports various formats and large collections ("in a centralised architecture to at least 25 million documents, and using the Hadoop Map Reduce distributed indexing scheme for even larger collections"), implemented in Java'' | ||
+ | * [http://1978th.net/tokyodystopia/ Tokyo Dystopia] - a full-text search system, implemented in C | ||
* [http://whoosh.ca/ Whoosh] - ''a pure Python search engine, apparently attracting interest from various other Python-based projects reluctant to use Lucene, Xapian and other technologies'' | * [http://whoosh.ca/ Whoosh] - ''a pure Python search engine, apparently attracting interest from various other Python-based projects reluctant to use Lucene, Xapian and other technologies'' | ||
* [http://www.wumpus-search.org/ Wumpus] - ''an information retrieval system being used to investigate desktop search solutions, amongst other things'' | * [http://www.wumpus-search.org/ Wumpus] - ''an information retrieval system being used to investigate desktop search solutions, amongst other things'' | ||
* [http://xapian.org/ Xapian] - ''a [http://xapian.org/users reasonably popular] solution implemented in C++ with bindings for various languages, with a [http://xapian.org/history heritage dating back to 1984 and earlier]'' | * [http://xapian.org/ Xapian] - ''a [http://xapian.org/users reasonably popular] solution implemented in C++ with bindings for various languages, with a [http://xapian.org/history heritage dating back to 1984 and earlier]'' | ||
+ | * [http://www.indexdata.com/zebra Zebra] - ''supports "large databases (more than ten gigabytes of data, tens of millions of records)", implemented in C'' | ||
* [http://www.seg.rmit.edu.au/zettair/ Zettair] - ''previously known as Lucy (possibly the Lucene derivative of that name) which has "indexed the 426GB TREC terabyte track collection", implemented in C'' | * [http://www.seg.rmit.edu.au/zettair/ Zettair] - ''previously known as Lucy (possibly the Lucene derivative of that name) which has "indexed the 426GB TREC terabyte track collection", implemented in C'' | ||
− | Some links to comparisons: | + | Some links to comparisons and summaries: |
− | * [http://www.emeraldinsight.com/Insight/ViewContentServlet?Filename=Published/EmeraldFullTextArticle/Articles/2760550403.html On open source IR] | + | * [http://en.wikipedia.org/wiki/Full_text_search Full text search] ''on Wikipedia'' |
+ | * [http://www.searchenginecaffe.com/2007/03/open-source-search-engines-in-java-and.html Open Source Search Engines, Retrieval Tools and Libraries] | ||
+ | * [http://zooie.wordpress.com/2009/07/06/a-comparison-of-open-source-search-engines-and-indexing-twitter/ A Comparison of Open Source Search Engines] - ''a controversial set of benchmarks applied to various solutions'' | ||
+ | * [http://www.emeraldinsight.com/Insight/ViewContentServlet?Filename=Published/EmeraldFullTextArticle/Articles/2760550403.html On open source IR] - ''mischaracterises the rationale for copyleft and includes various now-unmaintained solutions, but also mentions solutions which are now widely used (such as Lucene)'' | ||
+ | |||
+ | [[Category:Bioscape]] |
Latest revision as of 13:45, 27 October 2009
A considerable number of text indexing solutions exist. This document discusses some of the more widely-known open source solutions.
- ht://Dig - a search engine solution for individual Web sites
- Hyper Estraier - a reasonably well-utilised solution by other software systems and applications
- Lucene - arguably the most popular text indexing solution in current use, original implementation in Java with bindings for, and ports to, other languages
- Managing Gigabytes: Compressing and Indexing Documents and Images - provides software from the book of that name
- PostgreSQL full text search - incorporates the previously separate tsearch2 functionality into recent versions of PostgreSQL (from 8.3 upwards)
- Sphinx - used by numerous large-scale public Web sites and services in a traditional document search role
- Swish-e - a Web site indexer "ideally suited for collections of a million documents or smaller", derived from SWISH
- SWISH++ - an indexing and searching engine typically used for documents on Web sites, derived from Swish-e
- Terrier - supports various formats and large collections ("in a centralised architecture to at least 25 million documents, and using the Hadoop Map Reduce distributed indexing scheme for even larger collections"), implemented in Java
- Tokyo Dystopia - a full-text search system, implemented in C
- Whoosh - a pure Python search engine, apparently attracting interest from various other Python-based projects reluctant to use Lucene, Xapian and other technologies
- Wumpus - an information retrieval system being used to investigate desktop search solutions, amongst other things
- Xapian - a reasonably popular solution implemented in C++ with bindings for various languages, with a heritage dating back to 1984 and earlier
- Zebra - supports "large databases (more than ten gigabytes of data, tens of millions of records)", implemented in C
- Zettair - previously known as Lucy (possibly the Lucene derivative of that name) which has "indexed the 426GB TREC terabyte track collection", implemented in C
Some links to comparisons and summaries:
- Full text search on Wikipedia
- Open Source Search Engines, Retrieval Tools and Libraries
- A Comparison of Open Source Search Engines - a controversial set of benchmarks applied to various solutions
- On open source IR - mischaracterises the rationale for copyleft and includes various now-unmaintained solutions, but also mentions solutions which are now widely used (such as Lucene)