Lots of companies want to collect data about their users. This is a good thing, generally; being data-driven is important, and it's jolly hard to know where best to focus your efforts if you don't know what your people are like. However, this sort of data collection also gives people a sense of disquiet; what are you going to do with that data about me? How do I get you to stop using it? What conclusions are you drawing from it? I've spoken about this sense of disquiet in the past, and you can watch (or read) that talk for a lot more detail about how and why people don't like it.
So, what can we do about it? As I said, being data-driven is a good thing, and you can't be data-driven if you haven't got any data to be driven by. How do we enable people to collect data about you without compromising your privacy?
Well, there are some ways. Before I dive into them, though, a couple of brief asides: there are some people who believe that you shouldn't be allowed to collect any data on your users whatsoever; that the mere act of wanting to do so is in itself a compromise of privacy. This is not addressed to those people. What I want is a way that both sides can get what they want: companies and projects can be data-driven, and users don't get their privacy compromised. If what you want is that companies are banned from collecting anything... this is not for you. Most people are basically OK with the idea of data collection, they just don't want to be victimised by it, now or in the future, and it's that property that we want to protect.
Similarly, if you're a company who wants to know everything about each individual one of your users so you can sell that data for money, or exploit it on a user-by-user basis, this isn't for you either. Stop doing that.
Aggregation
The key point here is that, if you're collecting data about a load of users, you're usually doing so in order to look at it in aggregate; to draw conclusions about the general trends and the general distribution of your user base. And it's possible to do that data collection in ways that maintain the aggregate properties of it while making it hard or impossible for the company to use it to target individual users. That's what we want here: some way that the company can still draw correct conclusions from all the data when collected together, while preventing them from targeting individuals or knowing what a specific person said.
In the 1960s, Warner and Greenberg put together the randomised response technique for social science interviews. Basically, the idea here is that if you want to ask people questions about sensitive topics -- have they committed a crime? what are their sexual preferences? -- then you need to be able to draw aggregate conclusions about what percentages of people have done various things, but any one individual's ballot shouldn't be a confession that can be used against them. The technique varies a lot in exactly how it's applied, but the basic concept is that for any question, there's a random chance that the answerer should lie in their response. If some people lie in one direction (saying that they did a thing, when they didn't), and the same proportion of people lie in the other direction (saying they didn't do the thing when they did), then if you've got enough answerers, all the lies pretty much cancel out. So your aggregate statistics are still pretty much accurate -- you know that X percent of people did the thing -- but any one individual person's response isn't incriminating, because they might have been lying. This gives us the privacy protection we need for people, while preserving the aggregate properties that allow the survey-analysers to draw accurate conclusions.
It's something like whether you'll find a ticket inspector on a train. Train companies realised a long time ago that you don't need to put a ticket inspector on every single train. Instead, you can put inspectors on enough trains that the chance of fare-dodgers being caught is high enough that they don't want to take the risk. This randomised response is similar; if you get a ballot from someone saying that they smoked marijuana, then you can't know whether they were one of those who were randomly selected to lie about their answer, and therefore that answer isn't incriminating, but the overall percentage of people who say they smoked will be roughly equal to the percentage of people who actually did.
A worked example
Let's imagine you're, say, an operating system vendor. You'd like to know what sorts of machines your users are installing on (Ubuntu are looking to do this as most other OSes already do), and so how much RAM those machines have would be a useful figure to know. (Lots of other stats would also be useful, of course, but we'll just look at one for now while we're explaining the process. And remember this all applies to any statistic you want to collect; it's not particular to OS vendors, or RAM. If you want to know how often your users open your app, or what country they're in, this process works too.)
So, we assume that the actual truth about how much RAM the users' computers have looks something like this graph. Remember, the company does not know this. They want to know it, but they currently don't.
So, how can they collect data to know this graph, without being able to tell how much RAM any one specific user has?
As described above, the way to do this is to randomise the responses. Let's say that we tell 20% of users to lie about their answer, one category up or down. So if you've really got 8GB of RAM, then there's an 80% chance you tell the truth, and a 20% chance you lie; 10% of users lie in a "downwards" direction, so they claim to have 4GB of RAM when they've actually got 8GB, and 10% of users lie in an "upwards" direction and claim to have 16GB. Obviously, we wouldn't actually have the users lie -- the software that collects this info would randomly either produce the correct information or not with the above probabilities, and people wouldn't even know it was doing it; the deliberately incorrect data is only provided to the survey. (Your computer doesn't lie to you about how much RAM it's got, just the company.) What does that do to the graph data?
We show in this graph the users that gave accurate information in green, and inaccurate lies in red. And the graph looks pretty much the same! Any one given user's answers are unreliable and can't be trusted, but the overall shape of the graph is pretty similar to the actual truth. There are still peaks at the most popular points, and still troughs at the unpopular ones. Each bar in the graph is reasonably accurate (accuracy figures are shown below each bar, and they'll normally be around 90-95%, although because it's random it may fluctuate a little for you.) So our company can draw conclusions from this data, and they'll be generally correct. They'll have to take those conclusions with a small pinch of salt, because we've deliberately introduced inaccuracy into them, but the trends and the overall shape of the data will be good.
The key point here is that, although you can see in the graph which answers are truth and which are incorrect, the company can't. They don't get told whether an answer is truth or lies; they just get the information and no indication of how true it is. They'll know the percentage chance that an answer is untrue, but they won't know whether any one given answer is.
Can we be more inaccurate? Well, here's a graph to play with. You can adjust what percentage of users' computers lie about their survey results by dragging the slider, and see what that does to the data.
0% 100%
of submissions are deliberately incorrect
Even if you make every single user lie about their values, the graph shape isn't too bad. Lying tends to "flatten out" the graph; it makes tall peaks less tall, and short troughs more tall, and every single person lying probably flattens out things so much that conclusions you draw are probably now going to be wrong. But you can see from this that it ought to be possible to run the numbers and come up with a "lie" percentage which accurately balances the company's need for accurate information with the user's need to not provide accuracy.
It is of course critical to this whole procedure that the lies cancel out, which means that they need to be evenly distributed. If everyone just makes up random answers then obviously this doesn't work; answers have to start with the truth and then (maybe) lie in one direction or another.
This is a fairly simple description of this whole process of introducing noise into the data, and data scientists would be able to bring much more learning to bear on this. For example, how much does it affect accuracy if user information can lie by more than one "step" in every direction? Do we make it so instead of n% truth and 100-n% lies, we distribute the lies normally across the graph with the centrepoint being the truth? Is it possible to do this data collection without flattening out the graph to such an extent? And the state of the data art has moved on since the 1960s, too: Dwork wrote an influential 2006 paper on differential privacy which goes into this in more detail. Obviously we'll be collecting data on more than one number -- someone looking for data on computers on which their OS is installed will want for example version info, network connectivity, lots of hardware stats, device vendor, and so on. And that's OK, because it's safe to collect this data now... so how do our accuracy figures change when there are lots of stats and not just one? There will be better statistical ways to quantify how inaccurate the results are than my simple single-bar percentage measure, and how to tweak the percentage-of-lying to give the best results for everyone. This whole topic seems like something that data scientists in various communities could really get their teeth into and provide great suggestions and help to companies who want to collect data in a responsible way.
Of course, this applies to any data you want to collect. Do you want analytics on how often your users open your app? What times of day they do that? Which OS version they're on? How long do they spend using it? All your data still works in aggregate, but the things you're collecting aren't so personally invasive, because you don't know if a user's records are lies. This needs careful thought -- there has been plenty of research on deanonymising data and similar things, and the EFF's Panopticlick project shows how a combination of data can be cross-referenced and that needs protecting against too, but that's what data science is for; to tune the parameters used here so that individual privacy isn't compromised while aggregate properties are preserved.
If a company is collecting info about you and they're not actually interested in tying your submitted records to you (see previous point about how this doesn't apply to companies who do want to do this, who are a whole different problem), then this in theory isn't needed. They don't have to collect IP addresses or usernames and record them against each submission, and indeed if they don't want that information then they probably don't do that. But there's always a concern: what if they're really doing that and lying about it? Well, this is how we alleviate that problem. Even if a company actually are trying to collect personally-identifiable data and they're lying to us about doing that it doesn't matter, because we protect ourselves by -- with a specific probability -- lying back to them. And then everyone gets what they want. There's a certain sense of justice in that.