Destress Your Mail Server with Postfix Troubleshooting Tips: Part 2

2071

As I discussed in the previous article, it’s clearly important to know when your trusty mail transfer agent (MTA) is struggling to perform its duties. In this article, I’ll take a closer look at some warning signs and consider potential ways to mitigate issues when you spot that your server is stressed.

When there’s no stress causing your mail server to creak at the seams, a standard response time is, for all intents and purposes, immediate. It’s a very snappy retort with generally no latency visible to the human eye. If there are unwieldy, sizeable attachments, you may see a slight performance hit because network IO and disk IO obviously need to stretch their legs, limber up, and get involved.

Warning Signs

For Postfix, the point of no return is when the number of SMTP clients exceeds the number of SMTP server processes. There’s a whole host of reasons why a higher volume of usual of legitimate emails may arrive within a short period of time. For example, a mail server that has been unavailable for some time (due to maintenance, network outages, DNS problems, or other issues) might suddenly become available again and empty its previously undelivered queue. Or, of course, it could be due to an attack of varying flavors, and the deluge of email may not be legitimate or welcome at all.

On a basic level, you can probably spot a busy Postfix machine relatively easily. You can fire up a Telnet session to your mail server (running the command telnet <mail.chrisbinnie.tld> 25) and look for a slow initial response. This is where the following line is presented, replacing “mail.chrisbinnie.tld” with your MTA’s name obviously, acting as the opening gambit of the transaction:

220 mail.chrisbinnie.tld ESMTP Postfix

As I’ve said, these delays might be because of DNS issues somewhere in the transaction chain and can obviously exist even when the server is running without excessive load, so don’t immediately jump to conclusions.

Let’s consider another error. It’s possible that your logs contain multiple “lost connection after CONNECT” errors. Here a mail client is connecting but not completing the transaction and, frankly, just irritating your poor, old mail server. Note, from a security perspective, that it’s possible to attempt to tie your server in knots with many of these connections each second; you can generate this error by running a port scan, for example. The clever Postfix (as of version 2.3) adds a helpful reminder to your logs to help you figure out if you’re suffering from process depletion as follows:

warning: service "smtp" (25) has reached its process limit "30": 
  new clients may experience noticeable delays

In reality, all sorts of reasons could cause this issue, from saturated bandwidth, incorrect MTU, slow DNS lookups, reverse/forward DNS lookup mismatches, to misconfigurations at the other end of the connection.

There are a few things to bear in mind, however. These errors could potentially be a symptom of Postfix tripping over itself with any rules that process inbound emails before they hit its queue. If you’re stumped about what the root cause of such problems might be, then look for errors and typos in your config (or switch off pre-processing to see if it fixes it). In most cases, legitimate clients aren’t likely to generate these errors, so something else is to blame.

Hopefully, this highly logical, supplementary comment, which is apparently from the writer of Postfix, Wietse Venema, will help if you get stuck thanks to the appearance of such an issue:

“The client disconnected before Postfix could ask the KERNEL for the client IP address. Either your server is too slow or the client is too impatient.“

There’s additional wisdom regarding that comment online.

One final thing to assert is that, if your server’s capacity issues are only temporary, then clearly there’s much less chance of emails actually being lost. Retries should mean that deliveries will be successful when the condition subsides.

Destressing Yourself

From Postfix version 2.5 on, a clever addition was added to assist with automatically mitigating stressful operating conditions. Postfix calls it “stress-adaptive behavior” and, sadly, the solution only applies to your MTA and not real life, which would be welcome for many sysadmins I’m sure.

According to the docs, if your Postfix server encounters an “all server ports are busy” error, then, as well as logging a warning, your trusty MTA will also actually restart itself (and without interrupting existing network sessions). When Postfix comes fully back up, the eagle-eyed among you will notice it has added this option to its command-line options: -o stress=yes. It would look something like this if you ran the ps -ef command:

1010  ??  S      11:11.11 smtpd -n smtp -t inet -u -c -o stress=yes

Should this cause consternation, note that apparently the “yes” is usually missing, but “-o stress=” is present, and this option only applies to public-facing services and not local ones.

This super-clever option means that several additional options kick in. I was genuinely intrigued when I discovered that Postfix had this undeniably sophisticated capability. Let’s look at the options it controls in more detail.

The first change it makes is that the “smtpd_timeout” option is altered when Postfix feels stressed out. Normally, it would default to 300 seconds, but under heavy load this will drop to just 10 seconds. This will not meet RFC rules so it should only be used temporarily. Even lowering this to 5 seconds means that most clients will still manage to deliver email successfully, and those that are too slow will retry. So, in theory, if this is used briefly, then no valuable email will be lost.

Next, we get seriously medieval with a command that we’ve already looked at, namely “smtpd_hard_error_limit”. The usual default value of 20 drops to just one. Very sensibly, Postfix is far less sympathetic to mail client problems when it’s busy, as you can see.

Our battle-hardened MTA also massively reduces the value for “smtpd_junk_command_limit” all the way down to one instead of the default 100. Essentially, this should penalize any client that tries to keep connections open. It stops them from continually bombarding your server with HELO, EHLO, NOOP, RSET, VRFY, or ETRN commands.

The last three adjustments only apply to Postfix versions 2.6 and up — by all accounts. In this case, Postfix alters the way that it handles the “smtpd_timeout” and “smtpd_starttls_timeout” options. Rather than focusing on a time limit for reads or writes, what now becomes important is the time taken, for both sending and receiving a complete email transaction (this involves an SMTP command line, an SMTP response line, and SMTP message content line or a TLS protocol message).

The second countermeasure, from version 2.6 on — used as a stress relief mechanism — comes in the form of “smtpd_timeout” now also applying to TLS handshakes, too. Here we drop to the floor and offer 10 push-ups instead of 300. Clearly, any transaction involving encryption encourages greater server load, and again there should be retries so that no email is lost.

The third option enforces a strict option whereby Postfix won’t wait for six seconds for an address verification probe. An explanation of how it works from the unerring docs: “Address verification is a feature that allows the Postfix SMTP server to block a sender (MAIL FROM) or recipient (RCPT TO) address until the address has been verified to be deliverable.” Again, the functionality is very clever; (much) more information is available online.

The long and short of using this setting under heavy load is that if the result is not already in the address verification cache, it will reply immediately with “unverified_recipient_tempfail_action” or “unverified_sender_tempfail_action”. If this option is a used as a temporary countermeasure, email should arrive after retries kick in. I would encourage a quick read of the information cited above for more detail.

Permanence

What about if you encounter ongoing load issues, and you can’t bump up your server specification or bandwidth? Let’s consider a few options to assist in making useful changes permanent.

Think back to the “all server processes busy” error again for a moment. To mitigate the effects of that particular issue, we need to increase the number of SMTP processes. I’m sure that you won’t be surprised to read that Postfix prefers, by default, not to aim for world domination and utilize every possible last modicum of system resource. Instead, it prefers to take a considerate approach. Let’s look at hogging more of your server’s overall resources in order to meet the demands of more simultaneous client connections.

We’ll avoid tinkering with the “master.cf” file and ignore the optional way to achieving this outcome via that route (it’s considerably easier to break Postfix). Instead, we’ll edit or add the “default_process_limit” option in “main.cf” adding a higher value as we do so. Remember to execute a reload following this change:

# postfix reload

The docs warn that you need Postfix version 2.4 or higher to configure more than 1,000 processes (and an operating system that supports BSD kqueue(2), Linux epoll(4), or Solaris “/dev/poll”) and that adding more processes soaks up more of your precious RAM, so tread carefully.

You can apparently reduce the RAM hit by opting for “cdb” lookup tables instead of other alternatives, such as Berkeley DB’s hash table format. One quick caveat: In case you’re using a relatively old version, the “SMTPD_POLICY_README” might contain misleading information relating to fixed daemon processes, so even if you can’t install the latest version, view its README file.

If you’re in an unhappy job and want to spend less time with clients who pester you, then you can of course limit your server’s attention span instead of bumping up your RAM usage.

A popular addition to “greylisting” (an invaluable anti-spam measure) is to use Realtime Black Lists (RBLs), also called Realtime Blackhole Lists, which need to perform a remote, third-party lookup before an incoming email is deemed acceptable.

It’s common to add several RBLs to your configuration, but over time some of these blacklists are closed down or no longer useful due to obsolete or dated information. Additionally, it’s possible that you are duplicating your RBLs lookups because a provider, such as the excellent SpamHaus, offers several blacklists in a combined format. The Postfix docs recommend disabling both obsolete RBL and duplicated RBLs in order to remove their lookup times and therefore the time you spend on each transaction.

An (incomplete) example of how efficient Postfix is with RBL follows. Here, we are actually only querying the same RBL once with one lookup despite the repetition (added to your “main.cf” file again):

smtpd_client_restrictions =
       permit_mynetworks
       reject_rbl_client zen.spamhaus.org=127.0.0.10
       reject_rbl_client zen.spamhaus.org=127.0.0.11
       reject_rbl_client zen.spamhaus.org

Additionally, if you add rules for checking email headers to a greater degree 
than is usual, then it’s more efficient to combine them as opposed to have them 
spread out in your config file as individual rules. The docs suggest this 
approach in the “/etc/postfix/header_checks” file:
if /^Subject:/
  /^Subject: virus found in mail from you/ reject
  /^Subject: ..other../ reject

endif

As you can see, more than one rule is held within one “if” statement to expedite its processing time.

Innately Suspicious

When your MTA is stressed out, you might adopt a much stricter stance so that any machine connecting to you — which appears even a little bit suspicious — is presented with a cold shoulder.

We can use the SMTP 521 error code, which returns a response saying that a certain domain name does not accept email as per RFC 1846. This applies to newer versions of Postfix (2.6 onwards) and other codes are used in earlier versions. The sophisticated Postfix dutifully rejects the email and then stubbornly disconnects. This is Postfix acting as intransigent as it gets; there’s no patience and the connection is dropped even without receiving a remote QUIT command.

Testing

Previously, I mentioned that it is very easy to give Postfix build headaches if you break the “master.cf” file. We need to carefully edit that file for these next two tweaks.

You can force Postfix into, what I refer to as, overload-mode (where it deploys its countermeasures to combat stress) by diligently adding a line to the “master.cf” file. Introducing “-o stress=yes” as an option will enable that mode as shown in Listing 1:

# ==========================================================================
# service type  private unpriv  chroot  wakeup  maxproc command + args
#               (yes)   (yes)   (yes)   (never) (100)
# ==========================================================================
smtp      inet  n       -       n       -       -       smtpd
     -o stress=yes
     -o . . .

Listing 1: Enabling countermeasures manually, which usually Postfix will do automagically under heavy load.

After making the changes in Listing 1, remember to run this command:

# postfix reload

Conversely, if you would prefer to never allow automatic stress-adaptive behavior, then you simply alter “master.cf” as seen in Listing 2.

# =============================================================
# service type  private unpriv  chroot  wakeup  maxproc command
# =============================================================
# 
 submission inet n       -       n       -       -       smtpd
        -o stress=
        -o . . .

Listing 2: Never permit Postfix’s anti-stress mode to automatically kick in.

Essentially, we’re adding the option but leaving it empty. Again, remember to run a config reload after deploying the changes seen in Listing 2.

Flesh Eating

Sadly, botnets continue to remain popular, and spambots also continue to relentlessly hammer poor, unsuspecting mail servers. As they search for the latest place to deposit their unwelcome payloads, clearly it’s important that MTAs evolve to keep up. As of Postfix 2.8, the Postscreen tool was introduced to help mitigate spambots. It cleverly achieves its welcome effects by handling several SMTP connections within just one system process apparently.

The efficacious Postscreen also keeps track of connections by using its own whitelist for clients that have been proven valid previously by passing a series of connection-related checks. It then remembers IP addresses within its whitelist and, if it deems one such machine is friend and not foe, then it will push through the email delivery onto an SMTP process. This cache of friendly IP addresses cuts down on how much effort Postfix applies to each valid email, thereby reducing load and delivery times. If you would like to read more about this clever addition to Postfix, please refer to the manual.

EOF

Tempting as it is to continue exploring the many other options that Postfix includes, unfortunately, there are simply too many to cover. What with talking about debugging, automatic stress-adaptive functionality, and briefly touching upon Postscreen, we haven’t really done justice to the powerhouse that is Postfix. The quality of the features included with Postfix, its reliability, and the sheer volume of choices available in how you deploy the superior MTA make this subject one for continued study.

The next time you encounter a config mistake, an unwelcome zombie attack, or simply need to check how your mail queue is performing, then, with the information I’ve provided here, I hope you’ll be armed with the information required to make headway.

 

Chris Binnie is a Technical Consultant with 20 years of Linux experience and a writer for Linux Magazine and Admin Magazine. His new book Linux Server Security: Hack and Defend teaches you how to launch sophisticated attacks, make your servers invisible and crack complex passwords.