I made a mistake
technical Jack Ellis · Mar 7, 2024Last month, I tweeted that I had rolled all our email report queries into one query, allowing us to deliver our email reports much faster.
Amazing, right?
Nope, let me share my mistakes.
Why did we do this?
Let's go back in time, and I'll explain how we'd been delivering email reports before January 2024. We would run a scheduled job to load the metadata for all the email reports we needed to send (email address, sites to include, etc.). We then dispatched a single job for each email report.
As we grew, we had to find ways to avoid overloading our database because these jobs would run concurrently on our Lambda (via Laravel Vapor) at scale. And we only had one queue for all our jobs (yes, I know, keep reading).
The first thing I did to protect our database was to introduce random delays to the jobs when dispatching them. The logic was that we could space the jobs equally out over 15 minutes, and that would reduce the load. That worked until it didn't, and then we needed something new.
The second thing I did was to use Laravel's native rate limiter and define how many email reports could be run per minute. I opted for 40 email reports per minute. Again, this worked until it didn't, and then we saw our email report fighting our other database workloads. The email report jobs wanted to get their queries executed ASAP, which I appreciate, but the ingest & other areas were also demanding CPU.
The mistake
One day, I woke up to our dashboard running slowly. This made no sense because we're heavily over-provisioned for our typical workload. I opened the active queries window in our database portal, and the email reports were fighting for CPU, taking enormous amounts of time to execute. This was caused by the need to re-calculate the query plan cache at high concurrency following a database update and alongside our regular busy workload. This recipe for disaster made me realize how much tech debt we'd accumulated by not properly scaling our email reports system.
We jumped into action and even introduced a whole bunch of metrics to our email report. After all, Paul (cofounder) had redesigned a new email report, so now would be a good time to implement it.
The approach I chose was as follows:
- We would run the
SendEmailReports(monthly)
command as we already were, but we generate roll-ups for every single site we provide analytics for and produce a "cache." One large query vs multiple small queries. - We would then dispatch the
EmailReport
notifications, but they would now select from the "cache" we had already created. Much faster. - We ran the tests, and it was all beautiful
We pushed it into production, and I ran the monthly email report. Our database immediately began to struggle. The query was fighting to grab all possible CPUs to deliver the query as fast as possible. We had to wait it out.
Once it had finished, I realized what I'd done wrong. The database will obviously use all the CPU it can because I was using an unrestricted user with access to 100% of the CPU/memory. I needed to fix this.
The other mistake
Our database software, SingleStore, has a feature called "Resource Governance" which was exactly what we needed. I created a new user called background-tasks
and limited it to around 15-20% of the total CPU, meaning background tasks could never take our highest-priority workloads offline. I was pretty proud of myself and told the team we were sorted.
Fast-forward to the following email report, which would've been the following Monday, as we'd be delivering weekly email reports. Everything ran as smooth as butter. We send fewer weekly email reports than monthly, but I still felt confident with the new system since it had worked fine the week before. Case closed, problem solved, and back to some fun work.
We hit March 1st, 2024; it was time to send monthly email reports.
But the monthly email reports weren't sending...
An email from Sentry arrived containing the following error:
App\Jobs\SendEmailReports has timed out. It will be retried again.
The base job, which generates our entire cache and dispatches all email reports, was timing out. AWS Lambda has a maximum execution time of 15 minutes, and Jeff Bezos wasn't returning my calls. So, what were we supposed to do now?
Chunk it up, you've got to chunk it up
After working through a few ideas and getting nowhere, I was lying in bed with my wife, complaining about the situation.
I'd been juggling multiple things for days and trying to solve this issue, too, but I'd yet to get anywhere. I didn't want to create an EC2 server or Fargate for a single Laravel command. This is ridiculous. Then my wife said: "Is there not a way you can spread it out again?" and it hit me. The only reliable way to move forward was to return to what we had been doing initially.
On March 6th, 2024, I changed everything, removing the initial cache generation and moving towards running multiple queries for every single email report. I didn't fully understand where I was going with this, but I knew I had to get into motion and out of my head. I figured I would simply bring the rate limiter back and let the jobs run repeatedly for hours, retrying as needed. This would be better than DDoSing our database with heavy analytics queries.
Joe Dixon is a hero
As I finished the rewrite, a memory of a blog post popped into my head. I was sure that Laravel Vapor had introduced a way to isolate queue workloads. Well, it turns out I was right; on January 30th, 2023, Joe Dixon announced Individual Queue Concurrency on the Laravel Vapor blog. This meant that one of Laravel Vapor's main limitations was now solved and had been for over a year.
We could introduce an isolated queue, with the concurrency limited to a sensible amount, and put every single email report in there. Then, the jobs would sit there, waiting their turn, with zero risk of overloading our database. This was everything we needed!
The final set-up
We set up a new queue called email-reports with a concurrency of 5. My logic was that it doesn't matter if the concurrency limit is small because most email reports will be rapid and easy, and it wasn't a "per minute" limit; it was a concurrency limit. So it would move on to other jobs right away.
We kept the resource governor in place (thanks, SingleStore!) and combined it with the new isolated queue system… and it was beautiful.
We'd previously had email reports running for hours, causing all kinds of lag on our dashboard, but the new monthly email report set-up ran in under 20 minutes. I was stunned.
This solution will be able to scale to handle millions of email reports comfortably, and I can spend my time working on more important things.
P.S. I'd like to give a huge thank you to Luis Neves, a staff engineer at SingleStore who went out of his way to email me some tips about using the resource governor. We now utilize this feature in multiple application areas, which makes me very happy; thank you, Luis!
BIO
Jack Ellis, CTO + teacher
Recent blog posts
Tired of how time consuming and complex Google Analytics can be? Try Fathom Analytics: