Posted on 07-10-2008
Filed Under (inglês, search engine) by David William

I’m working in a new web project using database, which needs a good search engine. Since there are many records and the database must get larger, there’s no doubt the search performance is a critical mission.

I have to admit that I always contented myself using resources such as views, stored procedures, scripts, including database settings to promote a better query performance. However, after an indication of a friend, I found Sphinx. Here, I want to share a little of this search engine that may be the perfect solution for whoever needs high performance working with complex searches, with a huge amount of data.

Sphinx was created by Andrew Aksyonoff, but many others developers are contributing. The project was started back in 2001, when Andrew was working with a web site and using database. He just couldn’t find a acceptable search solution to fill out all his requirements. He was seeking for:

  • Search quality (good relevance)
  • Search speed
  • moderate disk and CPU requirements when indexing

So, he started to develop Sphinx. Despite the amount of time passed and numerous improvements made in the other solutions, his decision is to continue developing Sphinx.

What is Sphinx?

Sphinx is a full-text search engine, distributed under GPL version 2. Commercial license is also available for embedded use. Generally, it’s a standalone search engine, meant to provide fast, size-efficient and relevant fulltext search functions to other applications. Sphinx was specially designed to integrate well with SQL databases and scripting languages. Currently built-in data sources support fetching data either via direct connection to MySQL or PostgreSQL, or using XML pipe mechanism (a pipe to indexer in special XML-based format which Sphinx recognizes). As for the name, Sphinx is an acronym which is officially decoded as SQL Phrase Index.

Full-text????

A full-text search uses fulltext index for faster e relevant queries. With the full-text search engine is possible to search using logical operator of proximity, functions and roles of relevance. Can be created from CHAR, VARCHAR or TEXT columns. Each database implements full-text by a particular way, but the concept remains the same.

Distribution

Currently, Sphinx distribution tarball includes the following software:

  • indexer: an utility which creates fulltext indexes;
  • search: a simple command-line (CLI) test utility which searches through fulltext indexes;
  • searchd: a daemon which enables external software (eg. Web applications) to search through fulltext indexes;
  • sphinxapi: a set of searchd client API libraries for popular Web scripting languages (PHP, Python, Perl, Ruby).

Sphinx features

  • high indexing speed (upto 10 MB/sec on modern CPUs);
  • high search speed (avg query is under 0.1 sec on 2-4 GB text collections);
  • high scalability (upto 100 GB of text, upto 100 M documents on a single CPU);
  • provides good relevance ranking through combination of phrase proximity ranking and statistical (BM25) ranking;
  • provides distributed searching capabilities;
  • provides document exceprts generation;
  • provides searching from within MySQL through pluggable storage engine;
  • supports boolean, phrase, and word proximity queries;
  • supports multiple full-text fields per document (upto 32 by default);
  • supports multiple additional attributes per document (ie. groups, timestamps, etc);
  • supports stopwords;
  • supports both single-byte encodings and UTF-8;
  • supports English stemming, Russian stemming, and Soundex for morphology;
  • supports MySQL natively (MyISAM and InnoDB tables are both supported);

Stopwords?

In English they are word as “a, are, an, at, be, do, in, of, the and to”. As they are words that are very repeated in a language, the sites of search do not index this Stop Words. So, the performance of a reply page is very better. Therefore, the insertion of these words in the text does not make much difference. (Source: www.marketingdebusca.com.br)

Stemming?

Linguistically, words follow morphological rules that allow a speaker to derive variants of a same idea to evoke an action (verb), an object or concept (noun) or the property of something (adjective). For instance, the following words are derived from the same stem and share an abstract meaning of action and movement. Stemming does the reverse process: it deduces the stem from a fully suffixed word according to its morphological rules. These rules concern morphological and inflectional suffixes.

Where to get Sphinx?

http://www.sphinxsearch.com/dowloads.html

This post is just an overview of Sphinx. Right now I’m using Sphinx for the very first tests. My firts impressions and the firsts solid results will be reported in other post.

See you.

    Read More   

Comments

[...] wrote another post about the basic things you should know about Sphinx. If Sphinx is something real new for you, I [...]


Ragahvendra on 12 November, 2009 at 11:35 pm #

Good article,

I am trying to understand, how can sphinx help me search contents of documents like (doc, docx, pdf).
Please let me know if there is anything available on these lines


Post a Comment
Name:
Email:
Website:
Comments: