Fathom retains IP addresses temporarily for security. How do you protect that data?
Here’s a question we recently got from a customer regarding Fathom being a privacy-first company and how that relates to the data we collect for our analytics
July 13, 2021
Here’s a question we recently got from a customer regarding Fathom being a privacy-first company and how that relates to the data we collect for our analytics. We like being as transparent as possible, especially when it comes to our privacy practices and data collection, so here’s the whole question and our answer.
Fathom retains logs of IP addresses that visit my website 24 hours before the logs and visitor data is hashed. I was always taught that “what is collected can be intercepted,” so how do I avoid this? I would strongly prefer to be able to tell my visitors that their IP addresses are totally safe. I’m dealing with democracy, human rights, and privacy issues, so I need to be absolutely accurate with my statements on this topic. So my question is this: How can I be certain that the IP addresses you collect for security cannot be intercepted and used against my website visitors?
Everyone should be hyper-cautious about the data ANY third-party solution is collecting. Fathom has been built from the ground up to be as respectful as possible to every visitor’s data (and collect as little of it as possible and, when collected, anonymize/aggregate it).
Let’s walk you through how this came about and, very specifically, what we do.
First things first, we needed to find a way to protect our customers’ analytics data. We were getting flooded with spam attacks and are still hit every weekend. We realized that we needed some access log functionality to establish patterns/measure IP activity back in November. It’s clear to see the problem with keeping access logs, especially with a third-party service like Fathom, which runs on over 100,000 websites, some of those dealing with highly sensitive traffic (religion, politics, human rights, privacy, etc.). There was absolutely no way we could keep full access logs—even for just 24 hours. Because we would then have an inventory of all access from a single IP address across multiple websites. That would be a disaster, so we didn’t pursue that route.
Fortunately, we have the option to keep redacted access logs and have them automatically wiped after 24 hours. This means that we keep records of IP addresses but no information about the website they visited. So if a government or malicious actor were to get hold of our redacted access logs, they would only see IPs, and they’d have no insight into which websites or pages a visitor viewed. Fathom is like a VPN. If a VPN service had only one customer, and authorities/malicious actors got hold of any kind of “connection logs” and “history”, all activity by those VPN IP addresses can be tied to an individual. But if you have thousands/tens of thousands/hundreds of thousands/millions of people using the VPN, you have safety in numbers. That’s exactly how Fathom works and, whilst we’re already running on 100,000+ websites, our privacy improves as we are used on more websites.
Away from the redacted access logs, you then get to an interesting issue. Sure, the access logs may not have the website URLs, but what about the analytics database? Could you correlate browsing activity by matching timestamps between access logs and timestamps in the database? You absolutely could because both the database and access logs would have timestamps. So the way we approach this is that the access logs keep “to the second” activity from an IP address, with zero information about the website they visited, and then in our analytics database, we round the timestamp to the nearest minute. We receive far more page views in a single minute than we do in a single second, meaning that if a malicious actor could get 24h of our redacted access logs and our entire analytics database, they can’t match the data. This is incredibly important to us. That’s the design we’ve put in place. And as I mentioned, we process millions of page views a day across hundreds of thousands of sites, so the sheer volume here helps obfuscate the log data.
If a government wants to intercept a “raw IP” accessing the material, they can go to the ISP and request it. So we get into this “which government do you trust?” situation. For example, some EU customers have insisted that we offer a method for them to process all traffic within the EU using EU servers. If we went this route, that would mean that the US government couldn’t intercept that data, as they don’t have the legal ability (we’re a Canadian company). Other customers want EU citizen data to go through the EU and the rest of the world through the US. We’re currently building something called EU Isolation, and that feature will be able to handle that.
If you haven’t already read this, I recommend our data journey - it lays out exactly what we do with data that comes in from each and every page view.
I hope we’ve answered your question and you now feel more comfortable relaying this information to your own audience and visitors.
Update: We recently launched EU Isolation so we now process EU visitor data in the EU on EU servers owned by an EU company. We do this to be fully GDPR (Schrems II) compliant. We don't want our website analytics to break the law.