Let’s talk about Friday and the Cloudflare outage
On Friday 17th July, the internet collapsed for 27 minutes. Okay, maybe that’s a bit dramatic, but a huge amount of the internet went offline and Cloudflare was to blame. Credit where it’s due, every time Cloudflare has issues, they always publish a transparent blog post, and speak about how they’re going to fix it. The same can’t be said about a lot of software companies.
In total, the outage lasted less than half an hour, and a huge portion of the internet was completely unavailable. Normally, Cloudflare outages are a big popcorn moment for me, because they don’t affect us. But now that we offer uptime monitoring, this ended up being the craziest 27 minutes I’ve had for a long time. It wasn’t bad for everyone, and my cofounder Paul was enjoying his 5 minutes of fame (with his tweet that went viral-ish).
How much did the Cloudflare outage cost us?
One of the main purposes of uptime monitoring is to alert our customers. And we did. Multiple times. On and off again for half an hour. Here’s our cost breakdown:
- 8,350 emails - $12.50
- 4,000 SMS messages - $155
And then we also sent Telegram / Slack / Discord notifications. Discord was offline, so it was impossible to deliver webhooks there. When it was all happening, I said to Paul: “Look, let’s expect this to cost us $1,000, then anything less is a bonus”. Luckily it wasn’t too bad.
Did we send too many notifications?
Absolutely we did. I personally received about 29 SMS messages, and a whole bunch of Telegram notifications. It was great to see that our uptime monitoring was working as expected, and that our service remained online and stable, but we sent an overwhelming amount of notifications. Yes, we want you to know when your websites are offline, but clearly you don’t need to receive 20 notifications in such a short time-frame.
What have we changed?
Whilst it was great that our service worked so fantastically, we've made some essential changes to our uptime monitoring feature. PingPing (our uptime monitoring service provider) worked unbelievably hard over the weekend to build new functionality for us, so a huge thank you to them.
We now have multi-region checks. This means that we will not send you notifications unless 2 different regions confirm that your website is down. This was something that a handful of customers requested, and it’s a big win for our service
We’ve added an option for you to set your “How long should your site be offline before we notify you?”. The idea here is that you might not care if you have 15 seconds downtime, or even 30 seconds, but you want to know if your website has been offline for 1 minute. We’ve updated all existing monitors to 1 minute, which we consider to be the best option as it allows for blips, but you are free to change that.
Uptime check frequency option
We’ve added an option for you to set your “Check interval”. For some of you, every 30 seconds is just too regular, and you’re finding that your servers aren’t enjoying the extra load. Not a problem, you can now choose your interval per monitor.
We have implemented notification throttling so that we won’t send you a whole bunch of notifications at once in the future. This new throttling will be a maximum of 4 notifications per hour. We believe this is adequate for most people. This is a starting point though, and we’ll adapt based on feedback.
With this cocktail of changes, we’re going to see a huge improvement to our service. And realistically, the probability of this kind of thing happening is just so low, but we still need to prepare for it. If cloudflare does go offline again, you'll receive a maximum of 4 notifications per site, and we will honor your delay settings.
So we hope you love the new changes and, as always, we’d love to hear any feedback you have. Let us know on Twitter.