I’m working in a new web project using database, which needs a good search engine. Since there are many records and the database must get larger, there’s no doubt the search performance is a critical mission.
I have to admit that I always contented myself using resources such as views, stored procedures, scripts, including database settings to promote a better query performance. However, after an indication of a friend, I found Sphinx. Here, I want to share a little of this search engine that may be the perfect solution for whoever needs high performance working with complex searches, with a huge amount of data.
Sphinx was created by Andrew Aksyonoff, but many others developers are contributing. The project was started back in 2001, when Andrew was working with a web site and using database. He just couldn’t find a acceptable search solution to fill out all his requirements. He was seeking for:
So, he started to develop Sphinx. Despite the amount of time passed and numerous improvements made in the other solutions, his decision is to continue developing Sphinx.
What is Sphinx?
Sphinx is a full-text search engine, distributed under GPL version 2. Commercial license is also available for embedded use. Generally, it’s a standalone search engine, meant to provide fast, size-efficient and relevant fulltext search functions to other applications. Sphinx was specially designed to integrate well with SQL databases and scripting languages. Currently built-in data sources support fetching data either via direct connection to MySQL or PostgreSQL, or using XML pipe mechanism (a pipe to indexer in special XML-based format which Sphinx recognizes). As for the name, Sphinx is an acronym which is officially decoded as SQL Phrase Index.
Full-text????
A full-text search uses fulltext index for faster e relevant queries. With the full-text search engine is possible to search using logical operator of proximity, functions and roles of relevance. Can be created from CHAR, VARCHAR or TEXT columns. Each database implements full-text by a particular way, but the concept remains the same.
Distribution
Currently, Sphinx distribution tarball includes the following software:
Sphinx features
Stopwords?
In English they are word as “a, are, an, at, be, do, in, of, the and to”. As they are words that are very repeated in a language, the sites of search do not index this Stop Words. So, the performance of a reply page is very better. Therefore, the insertion of these words in the text does not make much difference. (Source: www.marketingdebusca.com.br)
Stemming?
Linguistically, words follow morphological rules that allow a speaker to derive variants of a same idea to evoke an action (verb), an object or concept (noun) or the property of something (adjective). For instance, the following words are derived from the same stem and share an abstract meaning of action and movement. Stemming does the reverse process: it deduces the stem from a fully suffixed word according to its morphological rules. These rules concern morphological and inflectional suffixes.
Where to get Sphinx?
http://www.sphinxsearch.com/dowloads.html
This post is just an overview of Sphinx. Right now I’m using Sphinx for the very first tests. My firts impressions and the firsts solid results will be reported in other post.
See you.
[...] wrote another post about the basic things you should know about Sphinx. If Sphinx is something real new for you, I [...]
Good article,
I am trying to understand, how can sphinx help me search contents of documents like (doc, docx, pdf).
Please let me know if there is anything available on these lines