What happened to our infrastructure when a customer got over 10 million page views in a few hours?
Yesterday was an exhausting day. I woke up at 6 AM to a 1.5 million job backlog in our queue and immediately jumped out of bed. Now, this has happened before, so I no longer stress about it, and I got on my computer.
The main detriment here was that people couldn’t see their current visitors & there was a delay in their dashboard stats. One of our big positioning points is that, unlike various competitors, we don’t run on cheap VPS hardware, we run on serverless infrastructure (Lambda, SQS, etc.) via Laravel Vapor. So we can handle floods of extra traffic without downtime. In this scenario, it was the equivalent of an extra 1 billion page views per month, which is a lot.
One of our customers, who were no strangers to viral content, had launched a new project that had gone unbelievably viral. In addition to their primary property, they were now receiving tens of millions of views in a very short time, causing the queue to pile up.
But why was the queue piling up? We’d built Fathom to handle infinite scale, running on serverless infrastructure, auto-scaling as needed. This didn’t make any sense. It should be processing the jobs without a hiccup.
We need more soldiers
One of the things we do in our configuration file is to limit the number of concurrent background workers. We do this intentionally because we’d rather accumulate a big backlog in our queue than have too much load fired at our database.
So my first instinct was “Oh, we need to increase the workers to clear out this queue”. And that’s exactly what I did for the first few hours. I doubled it, then doubled it again, and it was working through the backlog. But suddenly our database CPU was running at 97%.
I immediately scaled-down the background workers, as this was the highest I’d ever seen the database CPU at. After doing this, things did calm down but our backlog started increasing again.
Time to bring in my cofounder
Fathom Analytics has two founders, Paul Jarvis (who handles design) and myself (who handles the code/servers). So why on earth would I bring in a designer to help me with debugging server issues? Because he thinks differently to me and knows enough to have a stab at ideas. So we jumped on a call, I explained the problems and we talked through possible solutions. We were reluctant to upgrade the database as we didn’t know if it would include downtime (spoiler: it doesn’t).
One idea Paul came up with was to provision a high spec database that could accept all of the viral customer traffic, leaving thousands of others running on the main database. This idea was solid, and a great band-aid, but we would’ve then had to find a way to clean up (process) the data in this new database and it seemed too complex.
We eventually settled on the idea of rate-limiting. We would allow 500 page views to be processed from the queue per minute per site and postpone any excess. This would completely remove the blocking issue we had, where all of our customers were drowned out by a single site. Imagine that. We have thousands of sites on Fathom, some of them insanely popular, regularly going viral, and this one viral site blocked everyone. Never seen anything like it. But I digress.
We implemented rate-limiting and suddenly we were starting to see some of the blocked jobs being processed. We celebrated… for 30 minutes, and then we took a look at the size of the backlog. We had a compounding problem. The queue was only allowing the viral site 500 page views per minute, and it was doing so much more than that. So we removed the rate-limiting and returned to the drawing board.
We’re running out of time
As we hit midday in my part of the world, I was very aware that we were running out of time. We use a GDPR compliant hashing mechanism to track uniques (which we pioneered) and we only keep hashes until Midnight UTC. So we had to get this queue processed before midnight else we would have to somehow copy the hashes for an extra day to keep uniques correct, which we just didn’t want to do. This was our worst-case scenario. But what else could we try?
I had been messaging my buddy Chris Fidao throughout the day with thoughts and hoping he’d reply. He has a full-time job so obviously, he’s not going to be around all day but eventually, he replied to me. We spoke about a few different ideas, and after talking about what the various graphs in the database were showing, he realized that we were maxing out our IOPS. Now for those of you who don’t know, when you provision an RDS instance in AWS, you select how much storage you want for your SSD disk. If you choose a disk size of 100GB, you are given 300 IOPS. Now I’m not an expert on IOPS but it stands for Input/output operations per second onto the filesystem. It’s not as simple as 1 query = 1 I/O, but I have no idea how to calculate it. Anyway, this seemed to be the problem, so we decided to change things.
Throwing money at the problem
One of the reasons why Fathom isn’t a budget analytics provider is because we know we need contingency in our budget. And guess what? We needed that contingency for this problem, as we would rather spend a bunch of cash on solving the problem so that our customers wouldn’t be affected any longer.
We upgraded to 1000GB of storage, which would 10x our IOPS, allowing the jobs to run fast and get it all written to the database. We’d never done this before and had no idea what to expect. Would there be minor downtime? Would there be performance degradation? We didn’t know. But this was the only choice we had left.
We pulled the trigger and, after about an hour of waiting, we had increased IOPS. Time to increase the background workers and attack that backlog, that was now way past 2 million jobs.
Mama, I'm Coming Home
My first course of action was to ramp up the queue workers again now that I was confident that our IOPS could take it. And what do you know? Our backlog was coming down at a rapid speed. I was thanking Chris, and Paul was sending him limited edition Fathom cat hoodies to say thank you.
As we approached zero, I felt good about the world. I hugged my dog and re-enabled the aggregation systems. Everything was looking great. Oh, what a day we’ve had, time to relax.
A false sense of security
After going through all that drama, it was 7 PM and it was time to wind down, put my toddler to bed and go to sleep. We were into the next day (in the UTC timezone) so I was no longer worried about the risk of uniques with hashes.
Around 9 PM, I went into my office with plans to watch some YouTube videos & reply to some support tickets. I checked the queue and the backlog was back. This time it was at 70,000. And I get it, this may not seem a lot to some of you but we never see more than 20 jobs in our backlog, as they get processed unbelievably fast and we have adequate queue workers provisioned. So seeing a backlog of 70,000 was a bad sign.
I went straight to our database to check the CPU, and it was at 99%. Are you kidding me? How could this be happening? The lack of IOPS was the problem. That was only part of the problem.
I took a look at the database processes to try and get an idea about what was going on, and I could see a whole bunch of hanging queries that were running to select current visitors. This was the downside of how current visitors are designed because it reads off of the “aggregation backlog” (note: This is separate from the queue backlog, and consists of ‘processed views’ that are ready to aggregate aka roll-up). So I killed all of those hanging queries.
It was 10 PM and I wanted a quick fix. After speaking with Paul, I figured the database just couldn’t handle the query due to power. After all, the pageviews table that the current visitors' script uses has sufficient indexes, so it shouldn’t be taking 2 seconds for smaller sites and 40 seconds for bigger ones.
I doubled the specs of the database. It took around 1 hour to complete and we had 0-2 seconds of downtime (God bless failovers). Our MySQL database server now had 16 vCPUs and 64 GiB of Memory. I was expecting everything to become much faster and for our performance to improve. After all, it had more memory and CPU, and IOPS were way up.
Looking in the wrong places
Within 30 minutes, the CPU was up at 50%. What the hell? I checked the server processes and I noticed that a few select queries, that were run constantly, were taking multiple seconds. It was the queries that pulled data out of the backlog_countries, backlog_browsers table. Interesting. I took a look at the tables and, are you serious, I hadn’t added indexes to the backlog tables. There was a recent addition and, until now, had never become backlogged, so I hadn’t even considered adding an index, since they are typically processed within minutes. Well, because there were now millions of rows in each of them, queries had become much slower. So I added indexes and the common queries went from 2 seconds to 60 milliseconds. Fantastic news.
Fast forward to 12:15 AM, the CPU was still having challenges. The aggregation was running very fast and was on its way to “catch-up” after a day of craziness, so it wasn't causing the CPU issues. I hopped into the database and decided to test our current visitors' query, to make sure it is much faster after the upgrade.
I ran a query to select current visitors for the main Fathom Analytics website. 2 seconds to run. What the hell? How can such a tiny sample of data take that long to count / group, that doesn’t make sense.
I checked the index and could see that we had a perfect index in place, and the query was hitting it. I wondered if I needed the hostname & pathname in the index, but I saw that I couldn’t do that due to them being longtext. Wait, why is site_id a Varchar?
OH MY GOD.
There was a hangover from the past where we would put the site “code” into the pageviews table, and not the ID. We switched to a numeric ID and I had completely forgotten to change the column type to Unsigned Big Integer. I made the change to the table structure and that 2-second query became 58ms. I then ran the same query on the viral client's data and it came back so much faster. And that fixed the lagging current visitors problem. It’s funny because we’d been seeing some timeouts/connection issues for current visitors now and then but I had no idea I had put the site_id as a string, causing much slower queries.
After making these changes, the CPU was sitting comfortably at 6%. I sent a message to Paul “Man, these days are hard but the challenge is great.”. We put a banner on the dashboard letting people know that there was a slight lag, and then I went to bed.
The next day
I woke up the next morning and immediately checked the Fathom dashboard on my phone. I had instant dopamine as I saw that we had current visitors, meaning there was no backlog in the queue. Fantastic news. I checked the aggregation backlog and it was 700 entries long, and the queue backlog was empty. Fantastic. And here I am writing this article. The CPU has been sitting comfortably at 7% CPU since the changes at midnight, and customers are happy.
Will this happen again?
For the nerds reading this, the most obvious question is “What will you do when this happens again? When 3000 IOPS won’t be enough? How will you monitor it?”. And these are great questions.
We’ve had a plan in place for months to avoid situations like this. When a typical customer goes viral you’re probably dealing with a few million page views, maybe tens of millions over a day, but this was a whole new level of viral. But anyway, here’s what’s changing over the next few months:
1. We’re moving from MySQL to DynamoDB
I don’t like servers. I don’t want to have to worry about IOPS, slow scaling and all that nonsense. The fixed MySQL instance was one of the only things in our infrastructure that isn’t fully serverless, and I don’t like that. After reading Alex DeBrie’s incredible DynamoDB Book, I’ve decided that we’re going to completely nuke MySQL and move to On-Demand DynamoDB. This means that we can scale infinitely, get charged on what we use (yes there’s a premium for that benefit), and not have to worry about IOPS.
2. Stats are getting moved to a managed Elasticsearch service
After speaking with Elasticsearch genius, Peter Steenbergen, we’ve decided to go all-in on Elasticsearch. Before speaking with Peter, I had been looking at the Elastic.co managed service, and I was getting excited about it, especially with the option to have 3 availability zones (aka overkill). Peter showed me that scaling can be done quickly with zero downtime, various index roll-ups, and demoed a whole bunch of examples for my specific use case. I was already sold on Elasticsearch, which I’ve used before, but he was able to fill in so many gaps and show me so many things I didn’t know about.
Before you ask me “Have you thought about ABC?”. No, I haven’t. In the same way that when I chose MySQL for our database, I didn’t think about PostgreSQL, Oracle, MariaDB, or SQL Server. I like Elasticsearch, I have used it before, it’s very mature and it’s more than fast enough for our use case. I know that other options exist but I’m comfortable with Elasticsearch.
3. We’re going to be introducing Pusher
Another exciting change that we’ve not spoken about much before is the move to Pusher. At present we use a really old school way of doing things, where we poll via AJAX and query the database. Very inefficient and not great at scale. It’s done the job but we’re moving onto WebSockets for Version 3. This will mean that instead of queuing up a pageview & reading from the database, we will send data (stripped of any PII, of course) to the pusher and then the WebSocket will read it back. That’ll be great.
Overall, the day went well. I have been in high-pressure situations before. When I was younger, I was in a perceived “crazy high stress” situation, and I remember saying to myself “Remember this moment, it will pass” and from that moment, I respond to every situation with “This too shall pass”.
Also, remember that you won’t always know the answer to problems. Yesterday morning I had no clue. I saw no path. But I knew I had to try. All you need is the desire and a basic level of competence. Plus I’m friends with Chris, who is a server genius.
We’ve also created a status page, so our customers can now see what’s up the odd time that something is delayed.
Let me know on Twitter if this article was interesting, and I’ll make sure I share more about our experiences.