Weblog entry #187 for Steve

Site search implemented
Posted by Steve on Mon 22 Oct 2007 at 22:15

The previously described and much improved site search facility has now been made live.

On the one hand this is good, on the other hand it is now an external dependency - so the code behaves differently upon the live site and my test installs at home. Ho hum.

Bug reports welcome. I'll setup the spider to reindex the site on a semi-daily basis...

 

Comments on this Entry

Posted by mbl (87.96.xx.xx) on Tue 23 Oct 2007 at 01:24
[ Send Message ]
Thanks!
Minor annoyance: The search often returns results ending in "/print". I for one would prefer if you'd not spider those.

BR,
/MBL

[ Parent | Reply to this comment ]

Posted by Steve (82.32.xx.xx) on Tue 23 Oct 2007 at 09:11
[ Send Message | View Steve's Scratchpad | View Weblogs ]

Well spotted, I will remove those.

Steve

[ Parent | Reply to this comment ]

Posted by dkg (67.101.xx.xx) on Tue 23 Oct 2007 at 17:08
[ Send Message | View dkg's Scratchpad | View Weblogs ]
Thanks, very nice, Steve!

The links are all full-URI links, though, which means when accessing them via https, if you click on the results, you'll get taken back to the non-TLS version of the site. Could they be made absolute links (i.e. without the protocol and hostname)?

Also, it would be nice to see them integrated with the rest of the site trimming (sidebars, etc). That way, if the search results don't show what you want, you can still get to the other site goodies.

[ Parent | Reply to this comment ]

Posted by Steve (80.68.xx.xx) on Tue 23 Oct 2007 at 17:14
[ Send Message | View Steve's Scratchpad | View Weblogs ]

I hadn't noticed the http vs. https thing - but I guess there isn't a neat solution to that, except to have the 'force SSL' checkbox on and use the advanced login.

(That is because this is a spider-based program, and it spiders only HTTP. If it did both it wouldn't solve the problem, because it would be randomly returning links from http & https).

As for site-integration I'm going to work on improving that. The big issue is that the search results page is static; so I can't easily inject the dynamic side-bars - but I should be able to add the header at least.

Steve

[ Parent | Reply to this comment ]

Posted by dkg (216.254.xx.xx) on Tue 23 Oct 2007 at 23:30
[ Send Message | View dkg's Scratchpad | View Weblogs ]
Could you add a page-generation-time filter that strips out any leading "http://www.debian-administration.org" from the links? That should just result in a clean protocol- and hostname-independent static output file.

[ Parent | Reply to this comment ]

Posted by Steve (82.32.xx.xx) on Tue 23 Oct 2007 at 23:32
[ Send Message | View Steve's Scratchpad | View Weblogs ]

Not too easily. I'd have to have the main CGI script which powers the site invoke the search script as another CGI, and then marshall stuff back and forth.

For marginal gain I'm not sure the effort is worthwhile..

Steve

[ Parent | Reply to this comment ]

Posted by dkg (216.254.xx.xx) on Wed 24 Oct 2007 at 18:16
[ Send Message | View dkg's Scratchpad | View Weblogs ]
Just noticed another issue: when i do a search for PS1 (looking for stuff about bash prompts), the majority of the links are titled "http://www.debian-administration.org/rec..." even though they're clearly different links. It looks like these are "recent comment" feeds that are being generated by the site and indexed by this system. Maybe the spider shouldn't index pages that are Content-Type application/rss+xml?

[ Parent | Reply to this comment ]

Posted by Steve (82.32.xx.xx) on Wed 24 Oct 2007 at 18:43
[ Send Message | View Steve's Scratchpad | View Weblogs ]

Good catch. I'll update the spider.

Steve

[ Parent | Reply to this comment ]