Migrating a 2TB database in 7.5 minutes
technical Jack Ellis · Jun 5, 2023In June 2023, we had a 2TB+ database with over 17 billion rows, and we needed to move it ASAP. The goal of the migration was to have 0 seconds of downtime on our ingest (where we collect pageviews & events) and under 10 minutes of downtime for our dashboard & API.
But did we succeed?
Why are we doing another migration?
Before we get into things, let’s discuss why we’re migrating. After all, aren’t we happy with our current database provider (SingleStore)? Yes, but we were on their “old” managed service (“Cluster”). The new managed service (“Cloud”) was actively developed, and we’d been advised to move there. In addition, it also meant we’d get a new layer of data redundancy for free because all local disk operations would be synced to S3 (I have no idea how this works, don’t even ask, but I love it).
And the final WHY to address… Why on earth are we doing this ourselves? And that’s a great question. The SingleStore support team had offered to perform the migration, but they told us it would mean multiple hours of downtime. This wasn’t acceptable to us. After all, we have all of our data in a single database cluster, so it would mean our entire application would be offline, which meant analytics would not be collected for customers. At that moment, we knew we’d have to do the migration ourselves.
Mental preparation
I always try to manage my psychology at the start of a migration. I don't want to feel too nervous or too confident, I want to be right in the middle of those feelings. Of course, you need confidence to handle things like this, but you want a healthy amount of nerves so you don't let experience and ego cause you to overlook something important.
This migration differs from the migrations I've done before, where I've been dealing with hundreds of millions of rows. I was dealing with billions this time, and it would be a whole different game.
I would only succeed with this migration if I broke it into small parts. If I tried to view this migration as a "whole piece of work," I would never complete it. And please don't get it twisted; I hit walls when planning the migration, I'm human, but I also know that the way to break through the wall is to always aim for smaller wins
The past won’t help you here
As I began to think about the database migration, my brain felt like a clean slate. This was both a good feeling and a frustrating one. I needed to complete the whole migration within seven days and had yet to make a plan. So I read a previous migration blog post for inspiration. And fortunately for me, the author had shared his code and notes, and I was going to copy that exact process.
I started copying the code and experimenting, but we soon realized that this approach would only work for small amounts of data (under one billion rows). Otherwise, we would be here for weeks, trying to avoid overloading the source/target databases, spacing out the work and eating popcorn. We had far too much data. What a great problem to have!
How do we move this data?
Our tables can't be treated as equals within our database, and we have to migrate them in very different ways. For example, our Users table can be hit by INSERT, UPDATE, READ, and DELETE 24/7. It's not realistically possible for us, with our current set-up, to block any write actions to that table without application downtime. But if you look at our Pageviews table, we can easily modify things to make that append-only, with zero updates and zero deletes, making it easy to migrate without downtime.
Funny story, though, we did build a way to migrate all tables handling non-OLAP workloads without downtime. Here's the original plan we cooked up:
- Centralize our persistence into a repository or similar
- Add an atomic lock for all write operations to a table
- Write a migration command that loops through your tables and then obtains a lock per table, which would then block the repository from writing temporarily, and then it would unlock the table again when it was finished migrating, and you’d have a sufficient “wait time” on the atomic lock in the repository
- You would then also need a dynamic “connections” table, where you’d keep either “old” or “new” for each table, so your repository could switch connections dynamically based on which tables had been migrated and which hadn’t
- You’d then also need to “group migrate” tables. For example, you can’t migrate “subscriptions” without migrating “users” because then your joins and transactions would break
We came close to building this solution, but the whole "group migrate" stuff made us kill this idea.
Don’t touch our analytics
As I said above, migrating our analytics is easy if you make it immutable. So that was step one. Our analytics data is append-only. Yes, even updates. For example, if we want to set the bounce rate to 0, we insert a -1 to undo the "assumed bounce" of 1 of the first pageview. So we already have our pageviews & events tables completely append-only. There was no way to edit the rows from other areas of the application either, so we were rolling. But wait, what about people importing their Google Analytics data? What about people wiping data from and deleting sites? The tables weren't yet immutable.
With this in mind, we looked to see what we had to do to make these tables immutable:
- We would need to disable our Google Analytics Importer for a few days temporarily
- We would have to remove the ability to wipe data from sites (note: we spent half a day building a "query queueing" solution, which would then be re-run against the pageviews table once the migration was done. This would allow us to keep import, wipe, etc., online but my friend Carl Sverre advised me against this; thanks Carl)
- We would need to turn off our "DeleteSites" background task, which took a soft deleted site and wiped all of the pageviews & events
This compromise was perfect for us. We are big fans of disabling functionality that isn’t “time critical” when performing maintenance. People don’t care. And if you add a notice on the page and say, “Yeah, we’re migrating our database right now and will be back by X of June,” nobody complains (and literally no one did!). We could’ve chosen to “queue the imports” or similar, but we didn’t want to create extra work for ourselves.
It took us under an hour to temporarily turn off various application pieces. Then, boom, our pageviews & events tables were immutable, ready to move.
The Migration
Now let's get into the migration. If you haven't had your coffee yet, I need you to go and grab one before we get into this next section. I'm going to break it up into phases, which will match the phases I used in the actual migration, so you will be following the exact path I took in the migration.
Note: When I refer to OLTP data, I am talking about our Users, Sites, Subscriptions, etc., tables with OLTP workloads. When I say OLAP data, I am talking about our analytics tables (pageviews, events, etc.)
Phase one: Backup and restore
Our pageviews table was immutable. Well, kind of. Data could be inserted into the tables, but once it was there, it couldn't be changed or deleted (for now). Here were the steps I took next:
- We backed up the entire production database to S3. This is a fully online process that doesn't impact the day-to-day workload
- We then ran the following query on the production database to get an idea of how many records we have: `SELECT COUNT(*) from pageviews WHERE timestamp < '2023-06-16 00:00:00". You'll notice that we set a timestamp "cut off point." This was a stable point where the pageview count BEFORE then wouldn't change, only after, and this would count the bulk of our pageviews. The total amount of rows in this table was 17,237,198,280
- Once we'd done that, we were ready to restore the backup to the new database server
- Once that was restored, we ran the above query on our new database server, and the count matched
So far, so good; the database has been cloned from the old (current) database server to the new server.
Phase two: Our new tables
Now we had to re-create the database to take advantage of bottomless storage (remember, the unlimited storage feature backed by S3) and enable flexible parallelism, a feature in our database software that allows for more flexible CPU utilization across partitions. I won't get into the details here, but I want to clarify that we couldn't just copy & restore the database; we had to do additional work. And since we had to do this re-creation, now was the time to re-structure things, so here's what we did:
- Created a second new database on the new server (this would be our new database)
- Re-created all the database tables, reviewing the schemas for inefficiencies (e.g. accidental INTs or BIGINTs when we could've used TINYINT)
- Re-create the pageviews table, sharding on the user_signature column, our anonymized session ID, and converting the field type to binary(32). We previously used VARCHAR(256) to store a SHA256 hash, and it took up so much space. Also, joining on Varchar is slow; joining on Binary is much faster. Huge shout out to Carl Sverre and Aaron Francis for their advice on this
- Re-created the events table, sharding on user_signature, doing the same binary(32) change, and switching all strings to TEXT to allow larger values here
Brilliant, our analytics tables were ready, and the expectation was that they’d now be much faster and we’d reduce storage space significantly. Our current database storage space was just over 2 TB.
Phase three: Moving the pageviews
At this point, we had two databases sitting on our new database server. We had database one, a restored backup of the production database, and database two, which held our new tables, but was empty. It was time to do something about that.
The beauty here was that our database software allows us to run “cross-database” queries within a single database server. So we could insert into database two from database one. Lovely. And, hey, you’ll be pleased to see that I kept track of the query durations in the comments.
An important note for all of the queries is that we had to specify the fields since we wanted to UNHEX the user_signature to convert it to binary (remember, we changed that field to binary(32) above).
Here are the queries that we ran one by one:
-- For year 2018 to May 2021 (just over 2h, connection closed but query ran!)INSERT INTO production.pageviews(client_id, site_id, `timestamp`, user_signature, pageviews, visits, known_durations, total_seconds, bounce_rate, sessions, hostname, pathname, uniques, exits, referrer_hostname, referrer_pathname, referrer_group, referrer_type, browser, country_code, device_type, utm_campaign, utm_content, utm_medium, utm_source, utm_term, keyword, q, ref, s, merged_from, meta_source, type)SELECTclient_id, site_id, `timestamp`, UNHEX(user_signature), pageviews, visits, known_durations, total_seconds, bounce_rate, sessions, hostname, pathname, uniques, exits, referrer_hostname, referrer_pathname, referrer_group, referrer_type, browser, country_code, device_type, utm_campaign, utm_content, utm_medium, utm_source, utm_term, keyword, q, ref, s, merged_from, meta_source, typeFROM source_database.pageviewsWHERE `timestamp` < '2021-05-01 00:00:00'; -- For May to Aug 2021 (13 minutes 611 million rows)INSERT INTO production.pageviews(client_id, site_id, `timestamp`, user_signature, pageviews, visits, known_durations, total_seconds, bounce_rate, sessions, hostname, pathname, uniques, exits, referrer_hostname, referrer_pathname, referrer_group, referrer_type, browser, country_code, device_type, utm_campaign, utm_content, utm_medium, utm_source, utm_term, keyword, q, ref, s, merged_from, meta_source, type)SELECTclient_id, site_id, `timestamp`, UNHEX(user_signature), pageviews, visits, known_durations, total_seconds, bounce_rate, sessions, hostname, pathname, uniques, exits, referrer_hostname, referrer_pathname, referrer_group, referrer_type, browser, country_code, device_type, utm_campaign, utm_content, utm_medium, utm_source, utm_term, keyword, q, ref, s, merged_from, meta_source, typeFROM source_database.pageviewsWHERE `timestamp` >= '2021-05-01 00:00:00' AND `timestamp` < '2021-09-01 00:00:00'; -- For Sep to Dec 2021 (22 minutes 1.2 billion rows)INSERT INTO production.pageviews(client_id, site_id, `timestamp`, user_signature, pageviews, visits, known_durations, total_seconds, bounce_rate, sessions, hostname, pathname, uniques, exits, referrer_hostname, referrer_pathname, referrer_group, referrer_type, browser, country_code, device_type, utm_campaign, utm_content, utm_medium, utm_source, utm_term, keyword, q, ref, s, merged_from, meta_source, type)SELECTclient_id, site_id, `timestamp`, UNHEX(user_signature), pageviews, visits, known_durations, total_seconds, bounce_rate, sessions, hostname, pathname, uniques, exits, referrer_hostname, referrer_pathname, referrer_group, referrer_type, browser, country_code, device_type, utm_campaign, utm_content, utm_medium, utm_source, utm_term, keyword, q, ref, s, merged_from, meta_source, typeFROM source_database.pageviewsWHERE `timestamp` >= '2021-09-01 00:00:00' AND `timestamp` < '2022-01-01 00:00:00'; -- For Jan to Apr 2022 (35 minutes for 1.8 billion rows)INSERT INTO production.pageviews(client_id, site_id, `timestamp`, user_signature, pageviews, visits, known_durations, total_seconds, bounce_rate, sessions, hostname, pathname, uniques, exits, referrer_hostname, referrer_pathname, referrer_group, referrer_type, browser, country_code, device_type, utm_campaign, utm_content, utm_medium, utm_source, utm_term, keyword, q, ref, s, merged_from, meta_source, type)SELECTclient_id, site_id, `timestamp`, UNHEX(user_signature), pageviews, visits, known_durations, total_seconds, bounce_rate, sessions, hostname, pathname, uniques, exits, referrer_hostname, referrer_pathname, referrer_group, referrer_type, browser, country_code, device_type, utm_campaign, utm_content, utm_medium, utm_source, utm_term, keyword, q, ref, s, merged_from, meta_source, typeFROM source_database.pageviewsWHERE `timestamp` >= '2022-01-01 00:00:00' AND `timestamp` < '2022-05-01 00:00:00'; -- For May to Aug 2022 (50 minutes for 2.8 billion rows)INSERT INTO production.pageviews(client_id, site_id, `timestamp`, user_signature, pageviews, visits, known_durations, total_seconds, bounce_rate, sessions, hostname, pathname, uniques, exits, referrer_hostname, referrer_pathname, referrer_group, referrer_type, browser, country_code, device_type, utm_campaign, utm_content, utm_medium, utm_source, utm_term, keyword, q, ref, s, merged_from, meta_source, type)SELECTclient_id, site_id, `timestamp`, UNHEX(user_signature), pageviews, visits, known_durations, total_seconds, bounce_rate, sessions, hostname, pathname, uniques, exits, referrer_hostname, referrer_pathname, referrer_group, referrer_type, browser, country_code, device_type, utm_campaign, utm_content, utm_medium, utm_source, utm_term, keyword, q, ref, s, merged_from, meta_source, typeFROM source_database.pageviewsWHERE `timestamp` >= '2022-05-01 00:00:00' AND `timestamp` < '2022-09-01 00:00:00'; -- For Sep to Dec 2022 (1 hour for 3.4 billion rows)INSERT INTO production.pageviews(client_id, site_id, `timestamp`, user_signature, pageviews, visits, known_durations, total_seconds, bounce_rate, sessions, hostname, pathname, uniques, exits, referrer_hostname, referrer_pathname, referrer_group, referrer_type, browser, country_code, device_type, utm_campaign, utm_content, utm_medium, utm_source, utm_term, keyword, q, ref, s, merged_from, meta_source, type)SELECTclient_id, site_id, `timestamp`, UNHEX(user_signature), pageviews, visits, known_durations, total_seconds, bounce_rate, sessions, hostname, pathname, uniques, exits, referrer_hostname, referrer_pathname, referrer_group, referrer_type, browser, country_code, device_type, utm_campaign, utm_content, utm_medium, utm_source, utm_term, keyword, q, ref, s, merged_from, meta_source, typeFROM source_database.pageviewsWHERE `timestamp` >= '2022-09-01 00:00:00' AND `timestamp` < '2023-01-01 00:00:00'; -- For Jan to Apr 2023 (1 hour 30 minutes for 4.5 billion rows)INSERT INTO production.pageviews(client_id, site_id, `timestamp`, user_signature, pageviews, visits, known_durations, total_seconds, bounce_rate, sessions, hostname, pathname, uniques, exits, referrer_hostname, referrer_pathname, referrer_group, referrer_type, browser, country_code, device_type, utm_campaign, utm_content, utm_medium, utm_source, utm_term, keyword, q, ref, s, merged_from, meta_source, type)SELECTclient_id, site_id, `timestamp`, UNHEX(user_signature), pageviews, visits, known_durations, total_seconds, bounce_rate, sessions, hostname, pathname, uniques, exits, referrer_hostname, referrer_pathname, referrer_group, referrer_type, browser, country_code, device_type, utm_campaign, utm_content, utm_medium, utm_source, utm_term, keyword, q, ref, s, merged_from, meta_source, typeFROM source_database.pageviewsWHERE `timestamp` >= '2023-01-01 00:00:00' AND `timestamp` < '2023-05-01 00:00:00'; -- For May to the end of Jun 15 2023 (34 minutes for 1.9 billion rows)INSERT INTO production.pageviews(client_id, site_id, `timestamp`, user_signature, pageviews, visits, known_durations, total_seconds, bounce_rate, sessions, hostname, pathname, uniques, exits, referrer_hostname, referrer_pathname, referrer_group, referrer_type, browser, country_code, device_type, utm_campaign, utm_content, utm_medium, utm_source, utm_term, keyword, q, ref, s, merged_from, meta_source, type)SELECTclient_id, site_id, `timestamp`, UNHEX(user_signature), pageviews, visits, known_durations, total_seconds, bounce_rate, sessions, hostname, pathname, uniques, exits, referrer_hostname, referrer_pathname, referrer_group, referrer_type, browser, country_code, device_type, utm_campaign, utm_content, utm_medium, utm_source, utm_term, keyword, q, ref, s, merged_from, meta_source, typeFROM source_database.pageviewsWHERE `timestamp` >= '2023-05-01 00:00:00' AND `timestamp` < '2023-06-16 00:00:00';
Fantastic. All the pageviews had been moved across up until right before 2023-06-16 00:00:00. To make sure we had the same amount of rows copied across, we ran this query: SELECT COUNT(*) from pageviews WHERE timestamp < '2023-06-16 00:00:00"
. I also ran a few other queries, such as some random SUMs and GROUP BYs, to make myself feel better about the integrity of the data.
Phase four: Moving events
It was then time to move. The events. This was hilariously easy, as we only have around 320 million events to move. The query is pretty much the same as above, but for events, so I won't include it here. I'm simply including this phase as this is how I separated it.
Phase five: Moving over unchanged entities
In phase five, we moved over any tables that wouldn't change. The tables we moved over were our old custom domains, the Laravel migrations table and our Google Analytics Importer tables since we'd turned off that area completely.
Phase six: Preparing for the real thing
Now we were getting ready to switch over the database connection details. Our biggest priority is keeping our ingest online. Our ingest is where we collect pageviews & events, and we must keep that from falling offline. So we had to get that into position.
- Double-check new connection details for the database
- Set up a new connection in Laravel, then deploy a hidden page into ingest to make sure the connection works in that environment
- We need to test that our new database can accept pageview inserts and that they work as expected. Our tests are solid, but we all know how production can mess about, even if you have solid staging. So I created new behaviour for our usefathom.com website to test it all
- Insert the pageview/event into the new database
- Use UNHEX on the user_signature since it's now being stored as binary
- Test that it works as expected
Everything worked as expected, so we were ready to roll it out to everyone. Gulp.
Phase seven: It’s happening
The most critical part of our business (ingest) was working perfectly, so we were ready to roll.
- We added a notice to the top of the dashboard saying we would be undergoing maintenance, so pageviews would be lagging temporarily. We would be switching over to the new database, and, if you remember, we only had data up to 2023-06-16 00:00:00, and we would've been running this migration around the Jun 18
- Create a custom maintenance mode page that details what was happening
- Deploy the changes to production
- Update the ingest environment file but don't deploy ingest yet (Vapor allows you to make changes that will go live the next time you deploy)
- Put the application in maintenance mode. This locks everyone from making any changes. The only database activity we'd have on our "old" database would be cache read/write and reads. Nothing could be updated or inserted in there. Perfect!
- Test that maintenance mode is actually on (yes, I did this, don't ever assume)
- Migrate all OLTP tables from our current database (important note: this was not from the backup we took, this was from the live, frozen database)
- Once we'd copied all of the data over, we had to sync the auto-increment on all tables, so we ran
AGGREGATOR SYNC AUTO_INCREMENT ALL
- Deploy the ingest environment, forcing ingest to use the new environment file details and roll out the new changes to everyone
- We checked that ingest was working, and it was buzzing beautifully; all page views were now coming into our new database
- We updated the environment file for our dashboard/API to point to the new database and deployed it
- We updated the name of our Fathom website to "Fathom Analytics New" in our new database
- We brought the dashboard/API back online by disabling maintenance mode, and I immediately checked our website. As expected, it displayed "Fathom Analytics New," so I knew it was using the correct database (again, always check your assumptions)
The total downtime for our dashboard & API was ~7.5 minutes.
The total downtime for our ingest environment, where we collect pageviews and events, was 0.00 minutes.
Phase eight: Missing data
Okay, so at this point, we were missing data between 2023-06-16 00:00:00, our cut-off point and CURRENT_TIMESTAMP. So, since the old database is now dead, with no new pageviews/events coming in, we only need >= 2023-06-16 00:00:00, and it will grab them all.
We took the following steps:
- Checked Sentry for bugs
- Run a backup to S3 on the old database server (this is so we can get those pageviews > 2023-06-16 00:00:00)
- Restore the backup to the new database server as a separate database
Okay, so now we needed to copy over the missing pageviews, which was easy.
-- Migrate missing pageviews-- 114,357,840 rows copied in 2 minuteesINSERT INTO production.pageviews(client_id, site_id, `timestamp`, user_signature, pageviews, visits, known_durations, total_seconds, bounce_rate, sessions, hostname, pathname, uniques, exits, referrer_hostname, referrer_pathname, referrer_group, referrer_type, browser, country_code, device_type, utm_campaign, utm_content, utm_medium, utm_source, utm_term, keyword, q, ref, s, merged_from, meta_source, type)SELECTclient_id, site_id, `timestamp`, UNHEX(user_signature), pageviews, visits, known_durations, total_seconds, bounce_rate, sessions, hostname, pathname, uniques, exits, referrer_hostname, referrer_pathname, referrer_group, referrer_type, browser, country_code, device_type, utm_campaign, utm_content, utm_medium, utm_source, utm_term, keyword, q, ref, s, merged_from, meta_source, typeFROM final_backup.pageviewsWHERE `timestamp` >= '2023-06-16 00:00:00'; -- Migrate missing events-- 3,513,238 rows copied in 11 secondsINSERT INTO production.events(uuid, site_id, goal_id, `timestamp`, user_signature, conversions, unique_conversions, `value`, hostname, pathname, referrer_hostname, referrer_pathname, referrer_group, referrer_type, browser, country_code, device_type, utm_campaign, utm_content, utm_medium, utm_source, utm_term, keyword, q, ref, s, merged_from)SELECTuuid, site_id, goal_id, `timestamp`, UNHEX(user_signature), conversions, unique_conversions, `value`, hostname, pathname, referrer_hostname, referrer_pathname, referrer_group, referrer_type, browser, country_code, device_type, utm_campaign, utm_content, utm_medium, utm_source, utm_term, keyword, q, ref, s, merged_fromFROM final_backup.eventsWHERE `timestamp` >= '2023-06-16 00:00:00'; -- Migrate the rest of the missing OLAP data, I won't include it here as there's no point
I kept a "final backup" of our old database in S3 in case I had made a mistake with the data and needed to reference it, but, honestly, everything had gone perfectly.
And we were now entirely switched over to our new database. Also, our database had shrunk from over 2TB to closer to 1TB thanks to the binary field change.
I’m tired of migrations
We are migrating database set-ups more than any company I know because of business changes, data growth, more users, etc. But the problem we have is that we need to re-shard our database as we increase in size to improve performance. This is fine when you run a purely OLAP workload, but we also have our OLTP in the same database. And as you can see from above, it was the migration of the OLTP data that caused us downtime. I can migrate OLAP data in my sleep.
For our OLTP data, we can't afford downtime. We may be a small company, but our goal is to provide the most reliable analytics service in the world. Sure, our ingest didn't go offline during this migration, but I spent so much time planning how we'd move OLTP data over.
We'll soon explore moving our entire OLTP workload to another... planet.
UPDATE (13th July 2023): After writing this article, I had a conversation with SingleStore about concerns with re-sharding the database, downtime for upgrading OLAP, and various other things. Long story short, I had misunderstood a series of things, including how features such as Point-in-time recovery work. I will admit when I'm wrong and I was wrong about that. Long story short, we've just signed a new six figure deal with them and are excited to keep rocking.
Any thoughts, abuse or questions? Click here
BIO
Jack Ellis, CTO + teacher
Recent blog posts
Tired of how time consuming and complex Google Analytics can be? Try Fathom Analytics: