Searching MySQL databases with fulltext indexes

Posted by Steve on Tue 16 Aug 2005 at 21:31

When you have a dynamic website with all the content loaded from a database backend you're most likely going to wish visitors to be able to search it. Using the builtin MySQL FULLTEXT indexes makes this very simple.

Many websites, including this one, are implemented by having their contents stored in a database and rendered upon demand by visiting clients. There are several benefits of this approach:

  • Simple backup and mirroring via database replication .
  • Increased flexability when it comes to presentation; simply update your templates to change the complete site.
  • Greater flexibility in general.

(Of course there are drawbacks with dynamic websites; if you're not careful a sudden inrush of visitors to your site can cause it to become overloaded and fail - a so-called Slashdotting. We've survived two ;)

When it comes to searching such a site it's tempting to construct your own SQL statements, in a fashion similar to this:

select * from articles WHERE body LIKE '%blah%' ;

However if you start down this path you quickly begin to realise that you need to handle more complex cases, such as logical operators, finding items which contain some, or all, of the given terms for example.

Instead of implement the logical searches yourself it is better to leaverage the power of the database and allow it to do the work for you.

This is where the MySQL FULLTEXT search comes into play. Rather than constructing your SQL search string and executing it by hand you can allow the database to do the hard work.

To do this you must first add index columns upon your tables, either when you create them - or by adding them later.

Given the table:

CREATE TABLE `articles` 
(
  `id` int(11) NOT NULL default '0',
  `title` varchar(65) default NULL,
  `topic` varchar(25) NOT NULL default '',
  `author` varchar(25) NOT NULL default '',
  `ondate` datetime NOT NULL default '0000-00-00 00:00:00',
  `body` text NOT NULL,
);

You may add an index upon it with the following query:

ALTER TABLE articles ADD FULLTEXT(title,body);

This will create an index upon the two columns "title" and "body", which can later be used to perform searches.

By default the indexes you create will only include words of between four and twenty characters in length, this is something that you can adjust if you need to. You should try to think of the search terms your visitors are liable to use and adjust your limitations accordingly. A mimimum search term of four-letters would not allow visitors to search for such terms as "SSH", or "CVS" which would be a problem upon sites like this!

To alter the minimum word length you should adjust the file /etc/mysql/my.cnf. In the "[mysqld]" section add the following:

# Full Text Minimum Word Length
ft_min_word_len = 3

Once you've restarted your database server with "/etc/init.d/mysql restart" you should be able to index and search against smaller words.

Once you have the fulltext indexes using them to search against is pretty simple, although it is important to note that you must use the indexes you've created exactly.

For example if you have an index upon both the title and body fields of a table then you must search against both of those - searching on one alone is not possible.

A basic search, using the table structure we used earlier, would look like this:

SELECT id,title FROM articles WHERE MATCH (title,body) AGAINST('cvs');

This gives :

+-----+--------------------------------------------------------------+
| id  | title                                                        |
+-----+--------------------------------------------------------------+
| 186 | Permitting anonymous read-only access to your CVS repository |
...
| 131 | Debian Admininistration Site Update                          |
+-----+--------------------------------------------------------------+

As you can see we've successfully managed to perform the search; the two results both contain the term CVS - one in the title, and one presumably in the body.

Using the BOOLEAN search mode you can also allow visitors to construct more advanced queries for example:

-termMake sure that the term mentioned doesn't appear.
+termMake sure that the term mentioned does appears.

As an example of this we'll search for text that contains the word "CVS", but doesn't contain "ssh":

 select id,title FROM articles WHERE 
    MATCH(title,body) AGAINST ("+cvs -ssh" IN BOOLEAN MODE);

The results again look as expected:

+-----+--------------------------------------------------------------+
| id  | title                                                        |
+-----+--------------------------------------------------------------+
|  74 | This site's code is now available                            |
...
| 186 | Permitting anonymous read-only access to your CVS repository |
| 194 | Generating a ChangeLog from CVS history                      |
+-----+--------------------------------------------------------------+

Finally one last thing that you might wish to do is view a "score" of the matching results, which you can use to order your results by relevence.

Using the following query allows you to view the score as well as the results, note that the search terms are included twice:

SELECT id,title,MATCH (title,body) AGAINST ('cvs') AS score
   FROM articles WHERE MATCH (title,body) AGAINST ('cvs') 
      ORDER by score DESC

The resulting score can be used to display to clients :

+-----+----------------------------------------------------+-----------------+
| id  | title                                              | score           |
+-----+----------------------------------------------------+-----------------+
|  72 | Setting up a secure CVS server with OpenSSH        | 4.2276616096497 |
| 194 | Generating a ChangeLog from CVS history            | 3.7725563049316 |
+-----+----------------------------------------------------+-----------------+

 

 


Posted by Anonymous (80.135.xx.xx) on Tue 16 Aug 2005 at 22:42
Nice one! I've use the SELECT WHERE LIKE method for quite some time, but this seems to be much more powerful. Will try it definetely on my site. Thanks.

[ Parent | Reply to this comment ]

Posted by Anonymous (84.92.xx.xx) on Wed 17 Aug 2005 at 13:56
Fantatic!

It's like aspseek without the bloat!

[ Parent | Reply to this comment ]

Posted by Anonymous (82.119.xx.xx) on Wed 17 Aug 2005 at 12:13
One drawback is that you can't search for common words - words found on more than xx% rows (50% by default I think) aren't included in fulltext index.
I was searching here something before few days - and I noticed you are using fulltext indexes, because it gave me no results :) So I had to list through article index to find that article.
So it would be great for this site to allow also "standard" search (in title only) in case fulltext search fails...

[ Parent | Reply to this comment ]

Posted by Steve (82.41.xx.xx) on Wed 17 Aug 2005 at 13:06
[ View Steve's Scratchpad | View Weblogs ]

This seems reasonable, and I will add it to the TODO list.

Previously I think the search was worse, it was matching too many times. So a search for simple, and common, words would pull out 100+ articles.

I think there's certainly a case to be made for "less is more" in search results..

Steve
-- Steve.org.uk

[ Parent | Reply to this comment ]

Posted by Anonymous (70.21.xx.xx) on Wed 17 Aug 2005 at 15:17
We use a hybrid search in Syntax CMS, seen here:

http://www.cgdev.org/content/search/?search_query=birdsall&x=0&y= 0

The best matches uses a simple LIKE against the title field

The other results use a Fulltext index.

[ Parent | Reply to this comment ]

Posted by Steve (82.41.xx.xx) on Wed 17 Aug 2005 at 15:35
[ View Steve's Scratchpad | View Weblogs ]

"Title only" searches are now available.

I hope it is useful...

Steve
-- Steve.org.uk

[ Parent | Reply to this comment ]

Posted by Anonymous (66.245.xx.xx) on Wed 28 Sep 2005 at 04:10
just wanted to express my appreciation for your article showing me how to simply create a search function for multiple database fields. Worked like a charm

[ Parent | Reply to this comment ]

Posted by Anonymous (38.117.xx.xx) on Mon 27 Feb 2006 at 23:25
Works like charm, I changed my acronyms and abbreviations website accordingly and it runs much fast now.

[ Parent | Reply to this comment ]

Posted by Anonymous (84.113.xx.xx) on Thu 13 Mar 2008 at 06:52
For my Mediawiki i changed this and it works. If somebody needs it:

-------------------------------------------------------
Search in Mediawiki words or numbers with 3 characters:
1) ----------------------------------------------------
in
/etc/mysql/my.cnf
add the following line (if they do not exist):

[myisamchk]
ft_min_word_len = 3

[mysqld]
ft_min_word_len = 3

2) ----------------------------------------------------
Restart your MySQL server with this command in your shell:
/etc/init.d/mysql restart

3) ----------------------------------------------------
In phpmyadmin select your mediawiki-database, go to SQL and execute:
REPAIR TABLE yourwikiprefix_searchindex;

[ Parent | Reply to this comment ]

Sign In

Username:

Password:

[Register|Advanced]

 

Flattr

 

Current Poll

What do you use for configuration management?








( 676 votes ~ 10 comments )