Text Indexing Resources

A considerable number of text indexing solutions exist. This document discusses some of the more widely-known open source solutions.

ht://Dig - a search engine solution for individual Web sites
Hyper Estraier - a reasonably well-utilised solution by other software systems and applications
Lucene - arguably the most popular text indexing solution in current use, original implementation in Java with bindings for, and ports to, other languages
Managing Gigabytes: Compressing and Indexing Documents and Images - provides software from the book of that name
PostgreSQL full text search - incorporates the previously separate tsearch2 functionality into recent versions of PostgreSQL (from 8.3 upwards)
Sphinx - used by numerous large-scale public Web sites and services in a traditional document search role
Swish-e - a Web site indexer "ideally suited for collections of a million documents or smaller", derived from SWISH
SWISH++ - an indexing and searching engine typically used for documents on Web sites, derived from Swish-e
Terrier - supports various formats and large collections ("in a centralised architecture to at least 25 million documents, and using the Hadoop Map Reduce distributed indexing scheme for even larger collections"), implemented in Java
Whoosh - a pure Python search engine, apparently attracting interest from various other Python-based projects reluctant to use Lucene, Xapian and other technologies
Wumpus - an information retrieval system being used to investigate desktop search solutions, amongst other things
Xapian - a reasonably popular solution implemented in C++ with bindings for various languages, with a heritage dating back to 1984 and earlier
Zebra - supports "large databases (more than ten gigabytes of data, tens of millions of records)", implemented in C
Zettair - previously known as Lucy (possibly the Lucene derivative of that name) which has "indexed the 426GB TREC terabyte track collection", implemented in C

Some links to comparisons and summaries:

Full text search on Wikipedia
Open Source Search Engines, Retrieval Tools and Libraries
A Comparison of Open Source Search Engines - a controversial set of benchmarks applied to various solutions
On open source IR - mischaracterises the rationale for copyleft and includes various now-unmaintained solutions, but also mentions solutions which are now widely used (such as Lucene)

Anonymous

Search

Text Indexing Resources

Namespaces

More

Page actions

Navigation

Navigation

Internal Links

Wiki tools

Wiki tools

Anonymous

Search

Text Indexing Resources

Navigation

Wiki tools

Page tools

Categories