Writing a simple indexer and searcher with Lucy::Simple

Posted by Steve on Mon 22 Apr 2013 at 09:34

This site was previously hosted upon a single machine, and was recently moved to a cluster instead. This broke the search interface which had to be reworked and this article describes how the new site-search was implemented.

There are several well-known crawlers, indexers, and search interface solutions out there. Previously this site used mnogosearch, which was introduced and documented in the brief article setting up a search engine for your website.

Unfortunately mnogosearch caused a severe load upon the (single) server:

  • Crawling every page of a website, via HTTP.
  • Indexing the contents of those pages, to a MySQL database.

The generated index was pretty efficient though, and visitors could easily query it, via a simple CGI script.

This time round I decided that I absolutely did not wish to implement any search-system which involved crawling the website. Although the new, scalable, cluster could handle it I figured why waste time with the overhead when the article bodies are already stored in a database?

Looking around there were several promising packages which would let me index a series of documents/articles, but with my preference for Perl code I was immediately drawn to Lucy::Simple which is a stripped-down interface for the Apache Lucy search engine library.

So, what is it, and how does it work?

This is a package which lets you write your own indexing/searching system. Creating a simple means to add articles to an index, and later retrieve them.

The following snippet shows us adding two "documents" to a new index:

#!/usr/bin/perl

use strict;
use warnings;
use Lucy::Simple;

#
# Ensure the index directory is both available and empty.
#
my $index = "/tmp/index";
system( "rm", "-rf", $index );
system( "mkdir", "-p", $index );


#  Create the helper.
my $lucy = Lucy::Simple->new( path => $index, language => 'en', );

# Add the first "document".
my %one = ( title => "This is a title" , body => "Body content", id => 1 );
$lucy->add_doc( \%one );

# Add the second "document".
my %two = ( title => "Special article" , body => "My content", id => 2 );
$lucy->add_doc( \%two );

As you can see we've added hashes to an index, the hashes in our example have had three members - "title", "body", and "id". The great thing about Lucy::Simple is that you can add an arbitrary number of hash-keys and each body will be indexed.

Once you've run this command you'll see that /tmp/index will be populated by various files:

root@da-misc ~ # ls /tmp/index/
locks  schema_1.json  seg_1  snapshot_1.json

Of course the useful thing you can do with an index is search it. For that purpose we'll write a simple script which will use the index we've created and search it for terms on the command line:

#!/usr/bin/perl

use strict;
use warnings;

use Lucy::Search::IndexSearcher;

my $term = shift || die "Usage: $0 search-term";

my $searcher = Lucy::Search::IndexSearcher->new( index => '/tmp/index');

my $hits = $searcher->hits( query => $term );
while ( my $hit = $hits->next ) {
        print "Title: $hit->{title} - ID: $hit->{id}\n";
}

That script is pretty simple, and uses the index we've previously constructed to do the search. Any matching documents will have their title & ID printed out. Again these values come from the submissions:

root@da-misc ~ # perl search.pl body
Title: This is a title - ID: 1

root@da-misc ~ # perl search.pl article
Title: Special article - ID: 2

root@da-misc ~ # perl demo2.pl "1 or 2"
Title: This is a title - ID: 1
Title: Special article - ID: 2

The actual search interface upon this site is a little more involved than that one, because it needs to cope with failure, and involves templates for layouts. But there isn't too much more to it than you'd expect.

Because this site is now implemented upon a cluster the indexer runs on each node independently, and whichever node receives the search request merely uses its own local index.

 

 

Sign In

Username:

Password:

[Register|Advanced]

 

Flattr

 

Current Poll

What do you use for configuration management?








( 53 votes ~ 0 comments )