/forummsg/99 got me thinking, how can we fingerprint spam effectively? Just md5summing the entirety of it won't due, since many spams use a "randomness" factor, so, here is a way I think we could accurately fingerprint spam.
First, we take the subject, remove all "Re:", "Fwd:", etc tags and MD5 what's left, to be the first print
Then, we take the spam sample, remove all formatting (tags, whitespace, everything except a stream of numbers and text) and we md5sum the first 250 chars. This becomes FingerPrint2.
Next, we take all URLs in the email, remove anything that is obviously wrong like multiple /////'s after a url and ?parameters so we end up with a clean list of URLs, and we then remove all duplicates. We then md5sum each URL left and make those FingerPrint3,4,etc.
Next, we take the "from" email address domain (eg hotmail.com), md5 it, and make it a print.
Finally, we take the server routing information, and make each server name the mail passed though a print.
Now, we end up with a list of "prints" for the spam message. We can print every message this way and then check for matches of 65% and greater (not sure how accurate we want). So if two messages each have 7 prints, and 6 of those match (85%), we match those as a single message chain.
We could also make prints have more importance. So, subject has 1, body has 2, urls have 2, from addresses 3, and servers 4 so if two have the same subject, they'll be paired with a higher percentage.
Robin Monks
RoBiN {At} GmKiNg [dOt] OrG
Robin Monks wrote:Next, we
I'm not comfortable with using the "from" field. Most spam "from" addresses are fake and the ones that are real are usually spoofed. By considering the "from" field we could easily target or block emails from innocent people or websites. Even if we limit our hash to the domain in the from field we could still be targeting an innocent domain that a spammer chooses to spoof.
Re: fingerprinting images
I agree with your approach, however we will need to define a higher
logic / pattern search in mails.
The other day, I observed that I had two very similar looking spams :
Both had different senders, different routes and mail text (garbage
text) but the common thing between them was an image. This image was
attached with the message and had exactly the same content (hoax stock
alert msg).
This is just my 2 cents, obviously we need more input and suggestions.
Cheers!
Jug.
We could also md5sum images, or remove all whitespace (white/black) margins from images, if it's animated keep only the first frame, and then transform that into a black and white bitmap and md5sum. Although more computationally intensive, it would ensure even images with different comments/bits/formats would be (in theory) identical for fingerprinting.
See also, "Fingerprinting" images for recogination (google reformatted PDF), not sure if we can use that, but "fingerprinting" images will probably be nessasary in some form.
Robin
Image processing would
Image processing would definitely be CPU intensive.
Possible solutions could be:
@fungus_b: The only thing
@fungus_b: The only thing you need to find/fingerprint is the URL of the web site being spammed. Nothing more.
I have actually seen scores of messages which do not have any URL in them eg. False stock alerts, etc.
URL and route fingerprinting would definitly be our primary means of catching spam but keeping a system open for other possibilities like pure-image-based (inline or attached) spam would prove useful.
Nevertheless, this is entirely a group effort. I am very sure that once the project starts to kick in, it will be fun to figure out ways of gathering unique tidbits of a message. Surely we will have a Sandbox to play with but our work will only payoff when our system has the ability to automatically identify the uniqueness of the message no matter how random/obscure the mail is.
So coming back to our discussion topic aka 'Image processing' lets just share as much ins and outs as possible and try our best to embedd it in the system IF it is viable.
- Scan random blocks
- Routine full image scans (1/n images)
- A simple filesize scan, etc.. IMHO all are good options.:)
@Jugernaut
I have actually seen scores of messages which do not have any URL in them eg. False stock alerts, etc.
Me too, but I don't see what we can do about them. There's nothing for us to attack - no web site, nothing.
The FTC is interested in stock scams but the FTC already monitors spam for this sort of thing.
when our system has the ability to automatically identify the uniqueness of the message no matter how random/obscure the mail is.
The thing that really worries me is the one-sidedness of effort. We can work for a month on something that a spammer will defeat in ten minutes (randomizing/adding noise is easy to do). Let's not get sidetracked.
Defating those would require
Defating those would require OCR features built in, which would be a handy library for spammers to reference when building their own captcha-defeating software.
--
PharmaMaster is a jerk. Regulation of the internet prevents the majority (angry users) from kicking the arse of the minority (millionaire spammers).
Right, don't get sidetracked
I've noticed already that this image spam has speckles and dots in the image -- i.e., there's *already* a randomized component.
Don't worry about it. We have bigger fish to fry, and once we have a basic client that is effective against at least a significant chunk of spam... then we'll have publicity, and momentum, and the number of developers who might like to deal with issues like stock pump image-only spam will leap.
too many possibilities
i think there are too many possibilities in the spammers hand to confuse any algorithm we create. processing all incoming report via a program would be the ideal way of course but i wouldnt throw away the idea of manual processing in some complicated cases. i think the goal of the processing algorithm should be:
1) processing as many report as possible
2) recognizing reports which cannt be processed and forwarding them for manual analyzis
Fingerprinting images....
i think there are too many possibilities in the spammers hand to confuse any algorithm we create.
Yep. Not only that but it takes us a month to code something and it takes the spammer ten minutes to add a rew random dots to an image, it's a losing battle.
The only thing you need to find/fingerprint is the URL of the web site being spammed. Nothing more.
Re: Image processing would
Yes, just scanning a the first 50x50-100x100 pixels would suffice. Also, sectioning this out to clients would be good, so long as there was a way to occasionally check if the client was being truthful perhaps checking 1 in 25 images for verification.
Robin
Original "Blue" Method?
How did the original Blue Frog accomplish fingerprinting? Did anyone from Blue Security share how the server component of their tool worked, or was it just the client-side that was free software?
Just the client was open
Just the client side was open source (to allay concerns of abuse). The server-side processing was very proprietary, and (I believe) was constantly evolving.