How do you fight image-spam?
Posted by Anonymous on Sat 14 Oct 2006 at 19:25
Over the past few months there has been a dramatic rise in a new type of spam mailings, which comprise of semi-random words and a real message embedded inside an image. How do you deal with this?
There is the gocr package available upon Debian Sarge, and other releases, which attempts to perform OCR, but this process is very fragile.
Although fragile and fairly resource-intensive OCR has made available as a plugin to complex anti-spam solutions such as SpamAssassin. The Fuzzy OCR plugin appears to be the dominant solution right now.
But for those of us not using SpamAssassin which solutions exist, and work?
How do you fight this problem?
Short of using image dimensions, or filtering all mail with an attachment is there a simple solution?
[ Send Message | View Steve's Scratchpad | View Weblogs ]
I've been experimenting with gocr and ocrad for a day or two, with little success.
Both tools will only process "pnm" files rather than .GIF, or .JPG which is what I've been receiving. Converting many of the images I've got to that format just fails - however when the conversion succeeds the OCR generally does a good job.
I'm assuming that the conversion fails because of the multi-image images, or other perversities. Still I've not explored trying to automate this with procmail as I'm not sure how to go about extracting the attachments and working on them ..
[ Parent | Reply to this comment ]
I wouldn't do a full ocr on the image, as it is expensive. Do some mathematical tests on it which don't cost too much cpu time. If the recipient should be able to eventually read a text message that surely shows up somehow...
If you don't need a general purpose solution I think it's easier to build rules on who is allowed to send images to what extent and then do some simple tests on these.
cb
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
----------------------------------------------------------------- ------------ Mon, 16 Oct 2006 17:05:55 IST:19356: SA: REPORT hits = 11.3/4.0 0.9 HTML_IMAGE_ONLY_24 BODY: HTML: images with 2000-2400 bytes of words 0.0 HTML_MESSAGE BODY: HTML included in message 0.9 HTML_10_20 BODY: Message is 10% to 20% HTML 0.0 MIME_HTML_ONLY BODY: Message only has text/html MIME parts 1.0 DC_IMG_HTML_RATIO RAW: Low rawbody to pixel area ratio 0.8 SARE_GIF_ATTACH FULL: Email has a inline gif 3.0 DC_GIF_UNO_LARGO Message contains a single large inline gif 1.7 SARE_GIF_STOX Inline Gif with little HTML 3.0 DC_IMAGE_SPAM_HTML Possible Image-only spam ----------------------------------------------------------------- --------------- Hardik Dalwadi, National Innovation Foundation
[ Parent | Reply to this comment ]
Given some of our customers are in porn, pharmaceuticals, or medicine, I think one can focus overly much on content.
Content does not make the email "bulk", or "unsolicited", which is what makes it spam. Although distributed checksum type ideas, are a form of content inspection that can identify "bulk".
As such content filters may satisfy people by stopping content that they don't want, that isn't necessarily the same thing as stopping bulk unsolicited emails, or spam.
For all but the smallest email servers, all content filters will likely have to be tuned to end users needs, or unacceptable rates of false positives are likely to occur. This is as likely to apply to images in emails as anything else.
[ Parent | Reply to this comment ]
It leads to huge amounts of false positives and most importantly its a only temporary solution. Spammers will keep inventing new content tricks to bypass the spamfilter ruleset. A never ending battle.
Wouldnt it be a great idea to set up a RBL system based on the checksum of a message?
Suppose generic mailserver makes a crc of an incomming e-mail (could be of message body or sender/subject). It then consults a global server to see how often this checksum is present in its database. If its a new crc the global server would store it as a new crc, otherwise it would add +1 to the total count of this crc in the databse.
When the mailserver receives the crc-count from the global server it could reject the e-mail in question based on a local "reject_count" variable.
If enough mailservers join in, one could keep track of bulk messages send-out in the world.
[ Parent | Reply to this comment ]
http://www.rhyolite.com/anti-spam/dcc/
But since emails are rarely exactly identical, it can risk false positives, as you have to make some assumptions, or risk easy defeat, another difficult line to walk. Still it stops the simple ones I'm sure, and raises the barrier to spammers, so those that fancy the idea should go for it.
One must also be careful with email lists, and other solicited bulk email depending how you do such systems.
I'm sure more distributed antispam systems will evolve, since it is a natural way to spot bulk spamming. Indeed some RBLs already use the queries they get to identify new SMTP senders, and thus potential candidates for inclusion, which is a kind of dynamic antispam system.
[ Parent | Reply to this comment ]
it doest it's job on MTA level (working with e-mail envelope headers)
for overall information, good starting points:
http://en.wikipedia.org/wiki/Greylisting
http://www.greylisting.org/
http://greylisting.org/implementations/postfix.shtml (how can it be done with postfix)
--
Dániel Vásárhelyi
[ Parent | Reply to this comment ]
[ Send Message | View Steve's Scratchpad | View Weblogs ]
Or even here previous coverage on Debian greylisting.
[ Parent | Reply to this comment ]
Although there is certainly room for abuse of blacklists, I think RBL checks have two strong advantages in the long run. They are not content-based and can therefore not be circumvented by using images and similar tricks, reducing the amount of catch-up you have to play with spammers. Additionally, the DNS lookup required for an RBL check is a very cheap operation. Currently, your server might still have the spare cycles to do an OCR scan of every mail image, but when you're doing substantially more work to detect spams than it takes to generate them, a denial-of-service attack becomes possible by simply increasing the amount of e-mail that's expensive to check.
To employ RBL checks with Postfix 2.x, look into the reject_rbl_client directive. This will reject the delivery of e-mails by blacklisted servers (it won't just delete e-mail - a proper sender will still receive a bounce if his message is not delivered). If you want more flexibility, you can tag messages using rblcheck and filter them based on your own procmail rules:
apt-get install rblcheck procmailUnfortunately this last process could use more documentation and examples than what is available here.
[ Parent | Reply to this comment ]
Better idea to use RBL for "routing" messages into greylist (or other nasty non-blocking spam-trap).
asd
--
Dániel Vásárhelyi
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
asd
--
Dániel Vásárhelyi
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
Related parts:
smtpd_recipient_restrictions =
reject_non_fqdn_recipient
reject_unknown_recipient_domain
permit_mynetworks
reject_unauth_destination
check_recipient_access hash:/etc/postfix/maps/recipient_access
check_sender_access hash:/etc/postfix/maps/sender_access
check_client_access hash:/etc/postfix/maps/client_access
check_policy_service inet:127.0.0.1:10000
check_policy_service inet:127.0.0.1:2525
the last two lines are the "important factor":
one before last does the spf checking (as described in http://spf.pobox.com/, package name: whitelister)
the last does the greylisting (as described at http://www.greylisting.org, package name: postfix-gld)
The possible scenarios:
- the "first" rule accepts the mail (reply: OK): the "second" rule isn't called at all, mail routes in
- the first rule answers (reply: DUNNO): the "second rule is activated, therefore the mail gets in the greylist, and after a predefined number of seconds, the Nth try will be accepted)
- the first rule rejects the mail (550): in this case the SPF showed that the sender is not permitted to send the mail from the actual host, mail rejected, second rule isn't called at all.
check_policy_service is described at:
http://www.postfix.org/SMTPD_POLICY_README.html
I recommend everyone to use SPF, it can really help ppl fighting against spam. Of course, the effectiveness of SPF heavily depends on how many sysadmin integrates that single TXT record in their domain, but if they do (like the biggest free mail providers gmail, yahoo and even hotmail) the spam senders' ability to fake e-mail addresses significantly shrinks.
It's only one TXT record in your domain, and ppl who are using SPF will not permit letters only from your smtp server....
asd
--
Dániel Vásárhelyi
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
Have the spammer abandoned SPF as well now?
[ Parent | Reply to this comment ]
That may be the case, but if a domain is hijacked for forged spam sending, adding an SPF record helps people who check SPF block the spam very quickly.
The next stage of SPF is a trust metric. Do you know if that's setup?
[ Parent | Reply to this comment ]
Catches 100% of them before the initial SMTP transaction finishes. And if there are any false positives the sender gets an email saying "Your email was rejected because you embedded an attached image in the body." or whatever you set it to. And if it was a spammer no bounce goes to the forged From: address, because the sending MTA delivers the bounce message. I'm extremely happy with it.
http://www.postfix.org/header_checks.5.html
[ Parent | Reply to this comment ]
Sounds pretty hard to me...
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]