This is a great article on how to improve privacy for individuals in datasets that are disseminated. Quick summary:
The paper talks about “quasi-identifiers” – combinations of attributes within the data that can be used to identify individuals. For example, the statistic given is that 87% of the population of the United States can be uniquely identified by gender, date of birth, and 5-digit zip code. Given that three-attribute “quasi-identifier”, a dataset that has only one record with any given combination of those fields is clearly not anonymous – most likely it identifies someone. Datasets are “k-anonymous” when for any given quasi-identifier, a record is indistinguishable from k-1 others.
The next concept is “l-diversity”. Say you have a group of k different records that all share a particular quasi-identifier. That’s good, in that an attacker cannot identify the individual based on the quasi-identifier. But what if the value they’re interested in, (e.g. the individual’s medical diagnosis) is the same for every value in the group? There are 7 different values in a group, and I don’t know which one of them is Bob Smith, but since I know that all of them are flagged with a diagnosis of cancer, the data has “leaked” that Bob Smith has cancer. (Figuring this out is unsurprisingly called a “homogeneity attack”) The distribution of target values within a group is referred to as “l-diversity”.
The paper outlines the mathematical underpinnings of what l-diversity is, and shows that it is practical and be implemented efficiently.
Improving both k-anonymity and l-diversity requires fuzzing the data a little bit. Broadly, there are three ways you can do this:
- You can generalize the data to make it less specific. (E.g. the age “34” becomes “30-40”, or a diagnosis of “Chronic Cough” becomes “Respiratory Disorder”
- You can suppress the data. Simply delete it. (Which leads us into our host of “missing data” questions)
- You can perturb the data. The actual value can be replaced with a random value out of the standard distribution of values for that field. In this way, the overall distribution of values for that field will remain the same, but the individual data values will be wrong.