Difference between revisions of "Text Indexing Resources"

Revision as of 17:28, 19 October 2009

A considerable number of text indexing solutions exist. This document discusses some of the more widely-known open source solutions.

ht://Dig - a search engine solution for individual Web sites
Hyper Estraier - a reasonably well-utilised solution by other software systems and applications
Lucene - arguably the most popular text indexing solution in current use, original implementation in Java with bindings for, and ports to, other languages
Managing Gigabytes: Compressing and Indexing Documents and Images - provides software from the book of that name
PostgreSQL full text search - incorporates the previously separate tsearch2 functionality into recent versions of PostgreSQL (from 8.3 upwards)
Sphinx - used by numerous large-scale public Web sites and services in a traditional document search role
Swish-e - a Web site indexer "ideally suited for collections of a million documents or smaller", derived from SWISH
SWISH++ - an indexing and searching engine typically used for documents on Web sites, derived from Swish-e
Terrier - supports various formats and large collections ("in a centralised architecture to at least 25 million documents, and using the Hadoop Map Reduce distributed indexing scheme for even larger collections"), implemented in Java
Whoosh - a pure Python search engine, apparently attracting interest from various other Python-based projects reluctant to use Lucene, Xapian and other technologies
Wumpus - an information retrieval system being used to investigate desktop search solutions, amongst other things
Xapian - a reasonably popular solution implemented in C++ with bindings for various languages, with a heritage dating back to 1984 and earlier
Zebra - supports "large databases (more than ten gigabytes of data, tens of millions of records)", implemented in C
Zettair - previously known as Lucy (possibly the Lucene derivative of that name) which has "indexed the 426GB TREC terabyte track collection", implemented in C

Some links to comparisons and summaries:

Full text search on Wikipedia
Open Source Search Engines, Retrieval Tools and Libraries
A Comparison of Open Source Search Engines - a controversial set of benchmarks applied to various solutions
On open source IR - mischaracterises the rationale for copyleft and includes various now-unmaintained solutions, but also mentions solutions which are now widely used (such as Lucene)

@@ Line 20: / Line 20: @@
 * [http://en.wikipedia.org/wiki/Full_text_search Full text search] ''on Wikipedia''
 * [http://www.searchenginecaffe.com/2007/03/open-source-search-engines-in-java-and.html Open Source Search Engines, Retrieval Tools and Libraries]
+* [http://zooie.wordpress.com/2009/07/06/a-comparison-of-open-source-search-engines-and-indexing-twitter/ A Comparison of Open Source Search Engines] - ''a controversial set of benchmarks applied to various solutions''
 * [http://www.emeraldinsight.com/Insight/ViewContentServlet?Filename=Published/EmeraldFullTextArticle/Articles/2760550403.html On open source IR] - ''mischaracterises the rationale for copyleft and includes various now-unmaintained solutions, but also mentions solutions which are now widely used (such as Lucene)''
+[[Category:Bioscape]]

Anonymous

Search

Difference between revisions of "Text Indexing Resources"

Namespaces

More

Page actions

Revision as of 17:28, 19 October 2009

Navigation

Navigation

Internal Links

Wiki tools

Wiki tools

Anonymous

Search

Difference between revisions of "Text Indexing Resources"

Revision as of 17:28, 19 October 2009

Navigation

Wiki tools

Page tools

Categories