On Taxis and Rainbows: Anonymising is not easy

I’ve written a series of blog posts for the Rathenau Institute on ethical issues in data research. In this article I show that proper anonymisation of data is no easy task. This article is a translation of the Dutch original, and is available under CC-BY

Introduction

In 2013 Chris Whong discovered that the data of all taxi rides in New York City was available based on the FOIL (the Freedom of Information Law). After filling out some forms, providing a large USB flash drive and waiting a few days, he received his flash drive with nearly 20 gigabytes of data.

This data described  the taxi rides in New York City over the past few years. In the data, each taxi ride is described by a record that includes the start and end time, start and end point, and the number of passengers. The cost of the ride and the tip amount were given in a separate file. In the original data all this was linked to specific taxis, in the public dataset these were random identifiers. In trying to anonimise, but still provide useful data, the identifying taxi number was encrypted  in the disclosed data.

Chris Whong made some nice graphics and visualizations with the data he received. He analysed the data to find popular places where many taxis drive by, frequency with which taxis drive, tip distribution, etc. He also made his data available to others, so that other “civilian hackers” could work on it. Among these was someone who   looked in more detail at the encrypted identification numbers.
This person suddenly noticed that one number appeared very often. On closer inspection it turned out to be the encryption of the number “0”. With this information, the rest of the numbers were quite easily broken by applying a so-called “rainbow table”. The taxi identification numbers have a fixed structure and have relatively few options. With a computer, it is fairly simple to calculate the encryption for each of those numbers, and in this way basically look up the original taxi identifier. So it is quite easy to see which taxi which has been active, what its driver earned in a year, including tips.

fullscreen_8_3_14__12_11_am1

 

But not only the drivers are victims of this reidentification. Later, someone used photos of celebrities who were spotted in New York when they got into or out of a taxi. Those photos are fairly easily linked to a time, and along with the number that is displayed on the taxi, the corresponding ride could be found in the data. This way it could be established that one celebrity was more generous with tips compared to another.

Clarification

The above case shows that anonymizing data is not easy. The civil servant tried to use a simple operation to make the identification of taxi drivers impossible, but this did not work properly. In this case the reidentification was successful  because of a combination of clumsy encryption and too much structure in the original data which made it possible to deanonimise the data.

The deanonimisation of data does not only occur with the data of taxis in New York, but can happen with all kinds of data. In the Netherlands identifiers in medical or statistical data are often reduced to the zip code, gender and date of birth. For example, to look into the effects of air pollution in certain areas and whether this effect is different for men and for women.

A good indicator of anonymity in a dataset is called the k-anonymity (Sweeney 2002); given a set of attributes, to how many people can you reduce it? For example, how many people have exactly the same zip code, gender and date of birth? It appears that in the Netherlands a very large part of the people are uniquely identifiable with those characteristics. If we restrict the data to the 4-digit zip code and date of birth, still two-thirds of the Dutch are uniquely identifiable (Koot 2010).

Conclusion

Anonymisation of data is not so easy. The structure of the source data, and the distinctive character with general data should be properly considered. A proper balance must be made between what exactly is needed for analyzing the data, and what is released. This consideration is increasingly difficult to make, as more and more datasets are available. In America, you must be registered to vote, and this dataset is public. This dataset can then be used to revert the anonimisation in other datasets (Sweeney 2002). In the Netherlands it is more difficult to actually reidentify a person by name, but it is doable to reduce it to a unique person. With some extra effort one such person can often be reidentified manually.

When dealing with requests for data sharing, you should consider not to hand out the data, but to let researchers supply their analytical software and only share the results. Another option is to make strict agreements about the distribution and destruction of the provided data.

You are completely free to distribute and re-use this text, for example to start discussion with researchers and policy makers. If you do, I’d like to hear about your experiences. Translations of the other articles in this series are forthcoming.