The art of autoresponders (part 2)

Armijn Hemel, July 17, 2009, 7826 views.

Tags: , ,

Using the ideas outlined in the first article we made an autoresponder for our Postfix based email system. It is a simple Python script that is launched everytime if a mail arrives for a mail address for which an autoresponder has been enabled.

The goal of this autoresponder is to determine as fast as possible if a mail should be autoreplied. If the mail should not be autoreplied, the script should exit as early as possible.

The script works by performing quite a few checks in sequence. The checks are tweaked for our system, and are in order of which checks we think will succeed first, but on other sites the order in which the checks should be run might be quite different.

First we import a whole lot of modules we will need:

#!/usr/bin/python
import smtplib
import email
import sys
import MySQLdb
import ConfigParser
import re
import syslog
# important, set default encoding for Python to utf-8!
import codecs
from email.MIMEText import MIMEText
from email.MIMEMultipart import MIMEMultipart
from string import Template
from datetime import datetime
import time

To aid in determining which checks should be run first and optimize the script for a particular setup the rule that matches first is logged. If it turns out that all checks are made by the last check, it is useful to execute that first. Of course, this requires a lot of data and it will definitely change over time as mails change, some headers are not used anymore, new headers are used, and so on.

def writelog(recipient, sender, subject, status, reason):
  syslog.openlog("autoresponder", 0, syslog.LOG_LOCAL5)
  syslog.syslog("RECIPIENT (%s) SENDER (%s) SUBJECT (%s) STATUS (%s) REASON (%s)" %
         (recipient, sender, subject, status, reason))

The matching syslog configuration would look like this:

local5.*              /var/log/autoresponder

The following regular expression filters to determine if a mail address is well formatted. If you are fluent in regular expressions you will of course see that this does not match every valid email address that is possible. It will however match most, if not all, normal mail addresses you will encounter in daily life. And, remember, this is just used to determine if an autoreply should be sent.

def check_mail_address(s):
    res = re.search(r"((?:[\w+-][\w.+-]*[A-Za-z0-9]|[A-Za-z0-9][\w.+-]*
    [\w+-]|[A-Za-z0-9]+)@(?:[\w-]*\w\.)+[A-Za-z]{2,})",s)
        if res != None:
            return res.groups()[0]

This script uses a MySQL database, which keeps history about which addresses have an autoresponder enabled, which addresses have already received an autoreply and when, and so on. The database configuration is kept in a separate configuration file, which is only readable by the autoresponder.

config = ConfigParser.ConfigParser()
config.readfp(open('/home/autoresponder/.my.cnf'))
host = config.get("client", "host")
user = config.get("client", "user")
passwd = config.get("client", "password")
db = "vmail"
try:
        conn = MySQLdb.connect (host = host,
        user = user,
        passwd = passwd,
        db = db,
        use_unicode = True,
        charset = 'utf8' )
except MySQLdb.Error, e:
        sys.exit (0)
cursor = conn.cursor()

The original mail message is passed to this script via standard input and put into an email object for easier access to headers.

orig_mail = sys.stdin.read()
orig_msg = email.message_from_string(orig_mail)
# strip < and > from envelope sender
returnsender = orig_msg['Return-Path'][1:-1]
# replace newlines to prevent spam injection
try:
        orig_subject = orig_msg.get('Subject').replace('\n', ' ').strip()
except:
        orig_subject = ""
# trim subject
orig_subject = orig_subject[:255]

In the rest of the script it is mostly:

fetch a certain header for the mail

check for certain values if it is there

exit the script (if we know we can safely ignore the mail), or continue with the next check

The first test we do is to check if SpamAssassin thinks the mail is spam, or if it is likely be spam ('spammy'). If it is (likely) spam, the script can safely exit and log it was this test that caught it:

if(orig_msg.get('X-Spam-Flag') == 'YES'):
        writelog("-", returnsender, orig_subject, "dropped", "spam")
        sys.exit(0)

Bounces should not be replied:

if "MAILER-DAEMON" in returnsender:
        writelog("-", returnsender, orig_subject, "dropped", "MAILER-DAEMON")
        sys.exit(0)
if "mailer-daemon" in returnsender:
        writelog("-", returnsender, orig_subject, "dropped", "mailer-daemon")
        sys.exit(0)

Mails that have RFC 3834 compliant headers and which clearly indicate the message was not sent by a human (that is: the header has another value than 'no') can be discarded too. This should actually be the way all programs should work:

header = orig_msg.get('Auto-Submitted')
if header != None:
        if header != "no":
                writelog("-", returnsender, orig_subject, "dropped", "Auto-Submitted")
                sys.exit(0)

Some mails actually have a 'X-Autogenerated' header set. Personally we think it would make more sense to use RFC 3834 headers instead:

header = orig_msg.get('X-Autogenerated')
if header == 'Reply':
        writelog("-", returnsender, orig_subject, "dropped", "autogenerated reply")
        sys.exit(0)

Mailinglists are often mentioned as the number one reason why you should not just send autoreplies to every mail you receive. There are a lot of mailinglist software implementations and most of these set headers as described in RFC 2369, but there are other headers in use as well:

mailingheaders = ['List-Id', 'List-Unsubscribe', 'List-Help',
                  'List-Info', 'X-List-ID', 'X-MDMailing-List',
                  'X-Mailing-List', 'X-list', 'X-Mailer-ListID',
                  'X-MailingID', 'X-Mailing-Id']
for mailheader in mailingheaders:
        header = orig_msg.get(mailheader)
        if header != None:
                writelog("-", returnsender, orig_subject, "dropped", "mailing list")
                sys.exit(0)

The 'Precedence' header is a fairly old header that has been used to indicate if mail is sent by a system, list or human and can be ignored by an autoresponder:

precedence_headers = ['list', 'bulk', 'junk']
header = orig_msg.get('Precedence')
if header != None:
        if header in precedence_headers:
                writelog("-", returnsender, orig_subject, "dropped", "Precedence list, bulk or junk")
                sys.exit(0)

Most cron implementations set the 'X-Cron-Env' header. Some new cron implementations also use RFC 3834 headers. Other cron implementations don't set a special header at all:

header = orig_msg.get('X-Cron-Env')
if header != None:
        writelog("-", returnsender, orig_subject, "dropped", "cron")
        sys.exit(0)

Some systems set their own headers to indicate the mail is an autoresponse:

autoresponseheaders = ["X-Autoresponder", "X-Autorespond"]
for ar in autoresponseheaders:
        header = orig_msg.get(ar)
        if header != None:
                writelog("-", returnsender, orig_subject, "dropped", "X-Autorespond(er)")
                sys.exit(0)

Although many bugtrackers have started using RFC 3834 and Precedence headers not all instances that are running use them yet. A simple check for various bugtracker headers is needed:

bugtrackers = ["X-Bugzilla-Product", "X-Bugzilla-Type",
               "RT-Originator", "X-Roundup-Version", "X-Trac-Version", "X-Whups-Generated"]
for bt in bugtrackers:
        header = orig_msg.get(bt)
        if header != None:
                writelog("-", returnsender, orig_subject, "dropped", "bugtracker/issuetracker")
                sys.exit(0)

Delivery reports and bounces can be ignored as well:

header = orig_msg.get('Content-Description')
if header == 'Delivery report':
        writelog("-", returnsender, orig_subject, "dropped", "bounce")
        sys.exit(0)

Many scripts set a clear indication in the envelope sender that the mail should not be replied:

noreplies = ["noreply@", "no_reply@", "no-reply@", "do-not-reply@", "bounce@"]
for noreply in noreplies:
        if noreply in returnsender:
                writelog("-", returnsender, orig_subject, "dropped", "noreply")
                sys.exit(0)

Many scripts set the envelope sender to the name of a Unix system account. These can be ignored as well:

sysaccounts = ["www@", "www-data@", "wwwrun@", "nobody@",
               "apache@", "root@", "nagios@",
               "httpd@", "postmaster@", "svn@"]
for sysacct in sysaccounts:
        if re.match(sysacct, returnsender) != None:
                writelog("-", returnsender, orig_subject, "dropped", "system account")
                sys.exit(0)

Various big sites set their own headers which you have to filter for.

LinkedIn for example does not add any header at all to recognize mails were autogenerated and you actually have to filter on the 'Sender' field:

header = orig_msg.get('Sender')
if header == "messages-noreply@bounce.linkedin.com":
        writelog("-", returnsender, orig_subject, "dropped", "LinkedIn")
        sys.exit(0)

Google is another:

header = orig_msg.get('Reply-To')
if header == "adwords-noreply@google.com":
        writelog("-", returnsender, orig_subject, "dropped", "Google AdWords newsletter")
        sys.exit(0)
# Gmail address confirmation
header = orig_msg.get('X-Google-Address-Confirmation')
if header != None:
        writelog("-", returnsender, orig_subject, "dropped", "google address confirmation")
        sys.exit(0)

and Facebook:

header = orig_msg.get('X-Facebook-Notify')
if header != None:
        writelog("-", returnsender, orig_subject, "dropped", "FaceBook")
        sys.exit(0)

and PayPal:

header = orig_msg.get('X-Email-Type-Id')
if header != None:
        if 'PP' in header:
                writelog("-", returnsender, orig_subject, "dropped", "PayPal")
                sys.exit(0)

The biggest check in our script is for the X-Mailer header. This header is incredibly popular and a lot of organisations that send newsletters use it. The problem is that they are all different. If you have a few hundred people on your system, that are together subscribed to 400 different newsletters (which is not uncommon) you easily end up with a really big list of X-Mailer headers (edited here for brevity):

xmailer_headers = ["MediaWiki mailer"]
header = orig_msg.get('X-Mailer')
if header != None:
   for mailhdr in xmailer_headers:
       if mailhdr == header:
          writelog("-", returnsender, orig_subject, "dropped", "X-Mailer header filter : %s" % header)
          sys.exit(0)

After these, and possibly more, checks it is time to process the mail, which we will do in a next article.

Social networking: Tweet this article on Twitter Pass on this article on LinkedIn Bookmark this article on Google Bookmark this article on Yahoo! Bookmark this article on Technorati Bookmark this article on Delicious Share this article on Facebook Digg this article on Digg Submit this article to Reddit Thumb this article up at StumbleUpon Submit this article to Furl

Talkback

respond to this article