SameMail is a tool to identify same mail as other person received, which is very likely spam. It works with server and client program.
We aim to propose a protocol how to handle the contents of mail in a secure way, and detect spam easily. We show Open Source (GPL) implementation of server-client program, but this is not the only one.
definition of spam
There exist various spam definitions. Here we define spam as the almost same electric mail sent to unspecified people.
This definition enable to identify spam only by comparing mails received by others, without text analysis or blacklists. Exception is a mail from mailing list.
mechanism of SameMail
SameMail is similar to collaborative spam filter, but is not affected by human. All actions are done automatically.
When SameMail client receives an e-mail, it asks SameMail server if the e-mail was received by other clients or not. The client should not send e-mail itself to the server not to invade privacy. It sends hashed e-mail (by the way like MD5).
SameMail server replies comparing query and stored ones. Then SameMail server stores requested hashed e-mail for future query.
SameMail client may be mailer, POP proxy, MTA, etc. When a client find same e-mail, the next action depends on client. It may mark, erase, or return to sender. Our implemented client is POP3 proxy in Java, and it changes 'From' header.
How to make hash (outline)
Spam is NOT exact same as other person received. It may contain user name or random string. To perform text diff in server side, we make a hash per line. We can't trust mail headers. From, Subject, and To is often deceived.
First remove mail header. Then remove empty lines. Make MD5 hash per line. Take 3rd and 6th byte and join them.
MD5 original body
Important: Must Read for ALL. Interest Rates have dropped basis points once again to their lowest in years. We are now offering the lowest debt consolidation interest rates in history. Even if you just consolidated, we can save you more MONEY, faster! We can: * Consolidate All Loans Effectively & Efficiently * Give Loan Advice on the Best courses of Action * Allow for one New Low monthly payment (saving you even more!) * 99.9% of all Loans qualify & we do NO CREDIT CHECKS! All are approved in our program! TODAY'S LOW RATE IS 1.9% http://hogehoge/d1b2t3/?RefID=422904 To be removed from all our future corporate mailings please click below. http://hogehgoe/auto/index.htm
We show basic idea, but in fact, real hash is more complex. Hash should have ckeck digit. Hash shoud be enough long not to conflict. Hash shoud be enough short to handle easily.
Our client program is written in Java as POP3 proxy. Our server program is written in C and perl as HTTP CGI, and it's service is available on this site. Anyone can access the following address.
Client and server program are included in the following archive. For ordinal purpose, you have only to compile and run client program (You need Java 2 JRE).
- all files
C language client is available. It acts as filter now.
Headers colored by orange are rewrited by SameMail client. In this case, I received almost same mail 8 times.
- Vipul's Razor
- Distributed Checksum Clearinghouse
- Project Proposal : Collaborative Spam Filtering
|Home||Nobuyuki Tsuchimura(tutimura(a)mist.i.u-tokyo.ac.jp) Replace '(a)' with '@' modified on 1/ 5 13:18, 2004|