Technical deep-dive: deep queue assignment
In this new series, we ask our developers to provide insights into features and optimizations that are not very visible on the surface but have a lot of impact under the hood. For this edition, our lead C++ engineer Michael van der Werve is going to talk about a recently introduced optimization for queueing messages on three different levels: per IP, IP Pool, and domain.
Sending an email in its essence is pretty simple. Create an email, resolve the recipient's domain, open a connection to the server, and finally deliver it to the other server. Although this is conceptually very simple, numerous practical problems are making this difficult. For one, maybe the connection cannot be opened, which means we should try again later. Perhaps the other server does not want to accept the email right now, greylisting it to be tried again later from the same IP. Maybe some specific mailboxes do not exist. There are a plethora of things that can go wrong in this process.
Making it even more complicated is that MailerQ is not managing a simple IP but, in some cases, many thousands of IPs! Each with its throttles, its state, and each MTA IP can also have multiple connections to the same server. Overall, this greatly increases the throughput, but this can potentially cause some extra problems.
Message queueing in MailerQ
We'll dive a bit into how MailerQ handles messages internally. You might have seen it in the frontend already, but there are three different stages for messages.
When the message is first consumed from the outbox, the message will be in memory. After the message is loaded in memory, MailerQ will append the message to its internal memory queues if there is still space, depending on the email throttle settings and the maximum number of messages that may be in memory simultaneously. Then, depending on the state of deliveries to the domain and previous attempts, MailerQ will either assign the message if it can be sent directly or it will park the message if there is no more space in the in-memory queues.
If everything goes well and the email may be sent directly or within a very short amount of time, the message will move to the assigned state. Every message in this state is somewhere in between being scheduled inside MailerQ and about to be sent; it may have even already been partly sent! These are the messages that are practically out the door and are waiting for the receiving MTA's response. Usually, not too many messages can be in this state since there may only ever be one message scheduled per actual connection. This rule slightly decreases throughput per connection but has a much more stable pattern, so overall performance and reliability increase.
If the domain is currently not accepting emails or there is some other reason the message should wait, it is moved to the parked state, meaning that it currently resides in (or is being moved to) RabbitMQ and is waiting for a queue there until it is consumed again by MailerQ. These are the queues like to:gmail.com, from:ip-pool to:hotmail.com, etc., and can also be seen during pauses from the management console. MailerQ is not interested in the messages in the parked state because every path that the message can take is blocked. This block means that either the deliveries have been paused in the interface or none of the available servers are currently accepting emails. If there is a bit of load, most messages will actually be parked until we're ready for them. This parking makes sure that the critical path is not obstructed by irrelevant messages, reducing the administration MailerQ needs to do and subsequently increasing the performance. Tying it back to the in-memory messages, the maximum size of the in-memory queue can be configured in the email throttles, with the spillover being moved to the parked state. Raising these limits causes MailerQ to be less aggressive in parking emails in RabbitMQ, which can be better for performance for large domains, for example, in which case as many emails should be kept ready as possible.
Queues per domain: potentially problematic under high load
In older versions of MailerQ (prior to 5.8), there only used to be a single queue on the domain level. This single queue meant that for example, every single email going to gmail.com ended up in the same internal queue. Depending on the usage, this can be a large performance problem because two emailings going to gmail.com could end up competing with each other for resources, despite them having no overlapping resource requirements. For example, imagine a newsletter was sent to 10000 recipients on gmail.com, and another single email was sent from MTA IP 2.3.4.5 to gmail.com. In the worst-case scenario, the second email would have had to wait until the first mailing would be done until it finally could get scheduled, even though MailerQ did not use the other IP at all!
In reality, it was not as dramatic as the example because we had some smart internal decision making on which emails should have more 'priority.' However, there were some cases where this popped up, especially under load.
The solution: queueing on three levels
At MailerQ, we have a habit of critically looking at the designs and infrastructure and asking ourselves how to increase the application's performance and throughput. We researched and measured common usage and found that a lot could be improved.
Since MailerQ 5.8, messages can be queued on three levels; the domain, the pool level, and the IP level. The domain level contains all the messages from any sender to a specific domain. The pool level contains all the messages that are queued from a given pool to a particular domain. This way, messages sent via some internal pool to gmail.com will no longer compete with messages sent via a customer pool to gmail.com! Lastly, the IP level contains all the messages to be sent from a specific IP to a particular domain. This level also means that among other things, MailerQ can now assign greylisted messages to this deeper queue and that they are also not competing for resources anymore with untried messages.
Another added benefit is that these queues can now opportunistically look upward; this way, the connections that go faster can acquire more messages to send, decreasing the administrative part's overall load, bypassing all the irrelevant messages that MailerQ should send via other IPs.
Practically, what you will see in the Management Console is that more messages are parked across more queues, and when you look at RabbitMQ you will see many more queues than with old versions of MailerQ. This also makes it easier to monitor these queues externally and see where potential problems show up. MailerQ tries to keep the outbox as empty as possible since it is the single injection point. MailerQ does not even see messages that are stuck in there, which prevents us from assigning emails that can already be sent just because the queue is backed up, which has a negative impact on throughput.
MailerQ will internally make a lot of use of this new infrastructure but can only do so suboptimally if you are sending via a specific set of multiple IPs instead of using IP pools. In most cases, it is easy to upgrade to a pool instead, allowing you to take full advantage of the speedup this feature can bring.
If you need help or advice on upgrading your infrastructure to use IP pools and subsequently to make optimal use of these deep queues, let us know!