Weblog entry #191 for simonw

Free text search in MySQL
Posted by simonw on Mon 15 Oct 2007 at 15:52

A new big game fishing website we did at work, uses the MySQL free text search for the product search.

With tweaking the minimum word length to 3 (ft_min_word_len), creating an index on the name field, and an index on the name, short_description, and long description, of each product (and product options), it was possible to add the scores from these two matches to produce a search (rather crudely) weighted on name. Obviously refinements on the weighting are possible by more matches, and adding some factors for the weighting "a * (MATCH 1) + b * (MATCH 2) + etc, where "a" and "b" are tuned to give the results you want.

alter table products 
 add FULLTEXT(name);
alter table products 
 add FULLTEXT(name,short_description,long_description);

However the MySQL way of matching for scoring means you list the full "MATCH .. AGAINST..." term in bother the "SELECT" and "WHERE" clause, which means that search queries rapidly become very verbose, especially if you start trying to weight term;

SELECT DISTINCT(products.id), products.name, products.thumb, products.short_description , 
(
 MATCH(products.name, products.short_description, products.long_description)
  AGAINST('search term')
 +
 MATCH(products.name) AGAINST ('search term') 
) as scoreproduct 
FROM products, product_options 
WHERE products.visible = 1 
 AND product_options.product_id = products.id 
 AND 
  ( 
   MATCH(products.name, products.short_description, products.long_description)
    AGAINST('search term')
   OR 
   MATCH(product_options.model_id,product_options.note) 
    AGAINST ('search term') 
  )
ORDER BY scoreproduct DESC;

Including the whole "MATCH ... AGAINST ..." repeatedly doesn't seem to fit the whole relational database/SQL model.

Beyond this one can use query expansion in MySQL, but the customer seems content with the results of queries like the one above, and query expansion tends to produce a lot of spurious results. I dare say that Google will do a better job of indexing it within the week, but sites that rely on Google to be their search engine look so terribly amateur, and will miss new products as they are added to the database. I much prefer working with Postgres, but MySQL has these features built in, and the Postgres free text search add-ons feel like a bigger hack, if only because it isn't built in yet.

For static sites we use swish-e, which does a pretty good job. Setting it up for a new site can be a bit of an exercise in love, but it does allow some pretty good refinement in the results. It can also do database driven sites with a bit of love and attention, but again you end up with asynchronous updates, and unless there is significant non-database content that people are searching for, it is more trouble than the MySQL FTS.

Always interested to learn what others are doing here. Seems to me that Google does it so well without any "visible" per site tweaks*, that it must be possible to do it better with "default" settings for most of the sites we do. Perhaps I should just reread the Swish-E manual pages, and play a little more.

I have on the job list, a task that involves doing this in the more general case at some point not too far in the future, as part of building a community based site, that must index sites of community members, turn out custom RSS feeds based on search terms, and maybe even "Google Alert"** style messages etc. Most, but so far not quite all, of the planned functionality is possible with third party tools like Yahoo pipes.

* I know Google do per site tweaks, but I'm guessing based on employees per websites on the Internet, it doesn't happen that often, even if several are doing it as a full time job.

** Who was the science fiction writer who predicted something like these many decades ago? I read the story, I think about an architect whose building falls down or some such vague recollection.

 

Comments on this Entry

Posted by Steve (80.68.xx.xx) on Mon 15 Oct 2007 at 16:37
[ Send Message | View Steve's Scratchpad | View Weblogs ]

This reminds me I really need a better search interface upon this site.

The fact is I use the tag cloud more often than the full-text search since it actually works better!

I will investigate this over the next week.

Steve

[ Parent | Reply to this comment ]

Posted by simonw (84.45.xx.xx) on Mon 15 Oct 2007 at 18:50
[ Send Message | View Weblogs ]
I've noticed the somewhat inconsistent search results, usually when trying to find my own blog posts (fortunately Google knows).

I did already review YAWNS for what we want, and it was a reasonable match (and a lot simpler than slashcode and such like). Although I don't recall looking at the tag cloud, which seems to work very well indeed, and I didn't realize it did per keyword RSS feeds.

I think the search is not including everything (blogs and articles), and not including the author field either. I have those search plugins for Firefox that use the blog search, and tag search, but I think folk want just a box they type in that searches everything, and returns it in a sensible order. Experience is "advanced search" options don't get much use, even if they have a nice simple interface. Heck even I usually tick most of the sections on such forms, because if I knew where it was I wouldn't be using the search ;)

MySQL offers "boolean" search features as well, but I think most "normal" users don't even know that most search engines support such feature, let alone have confidence in their ability to use them.

[ Parent | Reply to this comment ]

Posted by Steve (82.32.xx.xx) on Mon 15 Oct 2007 at 18:59
[ Send Message | View Steve's Scratchpad | View Weblogs ]

Yes the big problem at the moment is the separate searchs for articles, and weblog entries - and no explicit search for comments at all. There is currently a per-author search, but again that isn't tied into the search - instead it merely gives you a "Show me all articles posted by user XXX" feature which isn't terribly useful.

I did try to unify them at one point, but quickly became overwhelmed.

Nowadays I think I'd be more inclined to remove the search code and rely upon an installation of swish, perlfect search, or similar. Even a local spider would work nicely - and that would search comments by accident too!

As for the tag cloud, I like the per-keyword RSS feeds and use them a fair bit. But more useful than that I think are the "related tags". They are what I use to drill down and search around for something which I know has been posted in the past.

Steve

[ Parent | Reply to this comment ]

User Login

Username:

Password:

[ Advanced Login ]

Register Account

Quick Site Search