Weblog entry #60 for Steve
I've been flirting with different approaches to dealing with comments/weblogs. "Scoring" if you will.
There are several approaches that can be emulated, slashdot (so-so), osnews (broken), advogato (person-rating), and bayasian(non-human).
So far I've been writing simple code to see if I can decide which approach works best in practise. I like the idea of bayasian filtering - but I can't decide how to code it.
So far I've been using the AI::NaiveBayes1 module to good effect. But I cannot make it work on "live" data.
This sample shows training the model on some text:
good text: 'Debian GNU/Linux Rocks'
good text: 'Testing this filter works'
good text: 'Steve Kemp is sleepy'
bad text: 'F**kers'
bad text: 'Linux fanbois!'
bad text: 'Get a life'
bad text: 'you suck!'
bad text: 'Redhat Rules'
Now testing line 'Linux Sucks!'
Good score: 0.279078013208562
Bad score: 0.960268432545589
BAD
So here we train the model with three pieces of "good" text, then six pieces of "bad" text. Afterwards we give a pair of scores to a new input - and if the bad score is the greater of the two we mark a comment as "bad".
All seems good. But there is a problem. How do I seed the database with "bad" comments?
I could ask people to click on links beneath comments they feel are bad in some way - but that means I'd have to retrain the model and update the scores of all the comments.
It also doesn't avoid people corrupting the model by scoring reasonable comments badly - and bringing down the effectivenes.
I guess I still have a lot to think about - definitely not an easy thing to do. I like Advogatos model, as that is "provablye" hard to subvert - however I do not understand the mechanics well enough to implement it, even after trying several times.
I guess that University education is good for something after all ;)
Comments on this Entry
I think analysis paralysis is setting in.
Just seed the database with old alt.troll, and maybe some spam archives as bad, and some Linux newsgroups text, and existing site content as good, and hope your algorithmn is robust enough to learn the difference with a small amount of pollution.
Depends on the goals, but do you really want to filter to find the good stuff, or just filter down to draw your attentions to the small amount of complete rubbish? I suspect if the later, then it needn't be too clever.
Viagra, Cialis, Rhyolite, Satellite, Menwith Hill, Elint -- that should get your attention, and that of the kind folks at Echelon ;)
[ Parent | Reply to this comment ]
[ Send Message | View Steve's Scratchpad | View Weblogs ]
I've not yet seen the refutation proof yet, but I'll try to find it tomorrow or so.
You're probably right in suggesting that removing the relatively small amount of bad content is probably possible using a simple mechanism, but I'm still not 100% sure on what the best approach is.
I think if it is going to happen that it must be as automatic as possible, and without any obvious attacks.
[ Parent | Reply to this comment ]