====== Fighting Image SPAM with FuzzyOCR and ImageInfo plugins ====== ===== First we need some new packages ===== # apt-get install lsb-base libice6 libx11-6 xlibs-data libsm6 x11-common # apt-get install gifsicle ocrad You should not have both giflib-bin and libungif-bin installed.\\ Simulate removing giflib-bin # apt-get -s remove giflib-bin If it's not installed, then you can move on. If it's the only thing that will be removed, then remove it. Note: an etch system may try to remove libungif-bin instead of giflib-bin. There is no need to remove libungif-bin # apt-get remove giflib-bin ===== Continue to install other required programs. ===== # apt-get install netpbm bzip2 libmldbm-perl libstring-approx-perl libmldbm-sync-perl # apt-get install liblog-agent-perl libdbi-perl libdbd-mysql-perl libtie-cache-perl Download, extract, patch, compile and install libungif (even if you have libungif-bin installed) # cd /usr/local/src # wget http://internap.dl.sourceforge.net/sourceforge/libungif/libungif-4.1.4.tar.gz # tar xzvf libungif-4.1.4.tar.gz # cd libungif-4.1.4/util # wget http://users.own-hero.net/~decoder/fuzzyocr/giftext-segfault.patch # patch giftext.c < giftext-segfault.patch # cd .. # ./configure --prefix=/usr && make && make install Download, extract, compile and install gocr # cd /usr/local/src # wget http://www-e.uni-magdeburg.de/jschulen/ocr/gocr-0.44.tar.gz # tar xzvf gocr-0.44.tar.gz # cd gocr-0.43 # ./configure --with-netpbm=/usr/lib --prefix=/usr && make && make install At this point make we should have all of these programs installed in /usr/bin # which gifsicle # which giffix # which giftext # which gifinter # which giftopnm # which jpegtopnm # which pngtopnm # which bmptopnm # which tifftopnm # which ppmhist # which pamfile # which ocrad # which gocr # which pnmnorm # which pnminvert # which ppmtopgm ===== Install FuzzyOCR 3.5.1 ===== # cd /usr/local/src # wget http://users.own-hero.net/~decoder/fuzzyocr/fuzzyocr-3.5.1-devel.tar.gz # tar xzvf fuzzyocr-3.5.1-devel.tar.gz # cd FuzzyOcr-3.5.1 If you are using netpbm < 10.34 (Debian uses 10.0-10.1) you need to apply these patches. They disable some features only available in newer versions: # wget http://www200.pair.com/mecham/spam/FuzzyOcr-3.5.0-rc1.netpbm_less_than_10.34.patch # wget http://www200.pair.com/mecham/spam/FuzzyOcr-3.5.0-rc1.netpbm_less_than_10.34.patch2 # wget http://www200.pair.com/mecham/spam/FuzzyOcr-3.5.0-rc1.netpbm_less_than_10.34.patch3 # patch -p0 < FuzzyOcr-3.5.0-rc1.netpbm_less_than_10.34.patch # patch -p0 < FuzzyOcr-3.5.0-rc1.netpbm_less_than_10.34.patch2 # patch -p0 < FuzzyOcr-3.5.0-rc1.netpbm_less_than_10.34.patch3 Our Debian version of Netpbm (10.0) may not contain both of these # which pamtopnm # which pamditherbw Those programs are used together.Therefore we need to disable any scansets that use them and we will also remove them from the preprocessors file. If you have both of these programs, you do not need to apply these patches. If you are missing either one of them, apply these patches: # wget http://www200.pair.com/mecham/spam/gary.3.5.0-rc1.old.netpbm.patch1 # wget http://www200.pair.com/mecham/spam/gary.3.5.0-rc1.old.netpbm.patch2 # wget http://www200.pair.com/mecham/spam/gary.3.5.0-rc1.old.netpbm.patch3 # patch -p0 < gary.3.5.0-rc1.old.netpbm.patch1 # patch -p0 < gary.3.5.0-rc1.old.netpbm.patch2 # patch -p0 < gary.3.5.0-rc1.old.netpbm.patch3 ===== Copy the files ===== # cp -r FuzzyOcr /etc/mail/spamassassin # cp FuzzyOcr.cf /etc/mail/spamassassin # cp FuzzyOcr.pm /etc/mail/spamassassin # cp FuzzyOcr.preps /etc/mail/spamassassin # cp FuzzyOcr.scansets /etc/mail/spamassassin # cp FuzzyOcr.words /etc/mail/spamassassin Configure !FuzzyOcr.cf: # vi /etc/mail/spamassassin/FuzzyOcr.cf Set log level to 2 (only while we test): #focr_verbose 3 focr_verbose 2 Enable logging by uncommenting this: focr_logfile /tmp/FuzzyOcr.log uncomment focr_timeout 15 uncomment focr_threshold 0.20 Set focr_base_score to 3 #focr_base_score 5 focr_base_score 3 I change focr_add_score from the default of 1, to 0.5: #focr_add_score 0.375 focr_add_score 0.5 I lower focr_corrupt_score: #focr_corrupt_score 2.5 focr_corrupt_score 1.5 I lower focr_corrupt_unfixable_score: #focr_corrupt_unfixable_score 5 focr_corrupt_unfixable_score 2.5 Start the test by linting spamassassin: # spamassassin --lint If you get this error Subroutine FuzzyOcr::O_NONBLOCK redefined at /usr/share/perl/5.8/Exporter.pm line 65. at /usr/lib/perl/5.8/POSIX.pm line 19 it appears to be related somehow to Net::Ident. If you run spamd with the --auth-ident option then you need this module and will have to deal with the harmless error message (actually it may not be harmless if you depend on a clean --lint). If you don't need Net::Ident (you don't have any programs that use it), then I suggest you remove it: 'apt-get remove libnet-ident-perl'\\ Now we start to test FuzzyORC # cd /usr/local/src/FuzzyOcr-3.5.1/samples # spamassassin -tD < ocr-animated.eml Should look like this 5.0 FUZZY_OCR BODY: Mail contains an image with common spam text inside Words found: "price" in 1 lines "company" in 1 lines "alert" in 1 lines "news" in 1 lines (6 word occurrences found) If you did not get something similar, check the log for the last error message (if any).For example, on a low powered machine you may have to increase focr_timeout in /etc/mail/spamassassin/FuzzyOcr.cf 'cat /tmp/FuzzyOcr.log'\\ Continue on to the next test # spamassassin -tD < ocr-gif.eml the output 1.5 FUZZY_OCR_WRONG_CTYPE BODY: Mail contains an image with wrong content-type set Image has format "GIF" but content-type is "image/jpeg" 2.5 FUZZY_OCR_CORRUPT_IMG BODY: Mail contains a corrupted image Corrupt image: GIF-LIB error: Image is defective, decoding aborted. 8.0 FUZZY_OCR BODY: Mail contains an image with common spam text inside Words found: "target" in 1 lines "service" in 1 lines "stock" in 2 lines "price" in 2 lines "company" in 1 lines "recommendation" in 1 lines (12 word occurrences found) and the next test # spamassassin -tD < ocr-jpg.eml the output 5.0 FUZZY_OCR BODY: Mail contains an image with common spam text inside Words found: "levitra" in 1 lines "cialis" in 1 lines "viagra" in 2 lines (6 word occurrences found) and the next test # spamassassin -tD < ocr-obfuscated.eml the output 3.0 FUZZY_OCR BODY: Mail contains an image with common spam text inside Words found: "profit" in 1 lines "profit" in 1 lines (2 word occurrences found) next test # spamassassin -tD < ocr-png.eml the output 15 FUZZY_OCR BODY: Mail contains an image with common spam text inside Words found: "buy" in 1 lines "target" in 2 lines "service" in 1 lines "stock" in 1 lines "investor" in 1 lines "price" in 3 lines "company" in 2 lines "trade" in 1 lines "software" in 1 lines "recommendation" in 1 lines "news" in 3 lines (25.5 word occurrences found) and the last one # spamassassin -tD < ocr-wrongext.eml and the last output 1.5 FUZZY_OCR_WRONG_CTYPE BODY: Mail contains an image with wrong content-type set Image has format "GIF" but content-type is "image/jpeg" 1.5 FUZZY_OCR_WRONG_EXTENSION BODY: Mail contains an image with wrong file extension Image has format "GIF" but file extension is "jpeg" 2.5 FUZZY_OCR_CORRUPT_IMG BODY: Mail contains a corrupted image Corrupt image: GIF-LIB error: Image is defective, decoding aborted. 8.0 FUZZY_OCR_KNOWN_HASH BODY: Mail contains an image with known hash Words found: "target" in 1 lines "service" in 1 lines "stock" in 2 lines "price" in 2 lines "company" in 1 lines "recommendation" in 1 lines (12 word occurrences found) Restart amavisd-new # /etc/init.d/amavis restart Give ownership of the log file to the amavis user # chown amavis:amavis /tmp/FuzzyOcr.log Now send a test message with this images (http://www200.pair.com/mecham/spam/host.gif) as attachment.\\ After this, check your log # tail -f /tmp/FuzzyOcr.log there should a part look like this GIF: [192x361] host.gif (3696) Found: 1 images Found GIF header name="host.gif" Image is single non-interlaced... Image hashing disabled in configuration, skipping... Scanset Order: ocrad(0) ocrad-invert(0) gocr(0) gocr-180(0) Scanset "ocrad" found word "cialis" with fuzz of 0.0000 line: "viagrd cialis letra " Scanset "ocrad" found word "viagra" with fuzz of 0.1667 line: "viagrd cialis letra " Scanset "ocrad" found word "price" with fuzz of 0.0000 line: "iowest onlie price garanteedi" Scanset "ocrad" found word "profit" with fuzz of 0.1667 line: "w guarantee oo topqalityofthe prodit we oi" Scanset "ocrad" found word "prescription" with fuzz of 0.0000line: "uick here wo prescription re uiredi " Scanset "ocrad" generates enough hits (6),skipping further scansets... Message is spam, score = 6.500 Words found: "drugs" in 1 lines "cialis" in 1 lines "viagra" in 1 lines "price" in 1 lines "profit" in 1 lines "prescription" in 1 lines (9 word occurrences found) Edit !FuzzyOcr.cf, turn off verbose logging and set focr_autodisable_score score back to a suitable level # vi /etc/mail/spamassassin/FuzzyOcr.cf set focr_verbose 0 and set the focr_autodisable_score to the same value as my $sa_kill_level_deflt in amavisd.conf focr_autodisable_score 8 reload AMaViS to bring the effect # amavisd-new reload And keep an eye on the mail.log for a while # tail -f /var/log/mail.log -Done ToDo: Logrotate