A short intro to privacy preservation

With the ever growing number of datasets being published, there is an increased risk of seeing personal identifying information being leaked directly or indirectly. The Netflix Prize dataset incident provides a concrete illustration of how quasi-identifying information can be coupled with background knowledge to uncover people's real identities.

As regulators have been putting more emphasis on privacy preservation over the last two decades, various techniques have emerged and aim at safeguarding individual's privacy. This Notebook provides an overview of two mainstream privacy-preserving strategies: Anonymization and Differential privacy. Additionally, it features some concrete implementation of both techniques in Python.

Getting the dataset ready

Instead of utilizing real dataset with actual personal identifiable information (PII), a solution would be to generate a sample dataset randomly.

We can now generate a dataset containing 10,000 entries created randomly and save it in CSV format.

Anonymization in action

Preserving privacy through anonymization involves encrypting, removing, and replacing — using generalization for instance — any personally identifiable information from a dataset. Below is an illustration of k-anonymity agorithm, a well known data anonymation technique which relies on data suppression and generalization.

The unanonymized dataset

The same dataset after k-anonymization

Differential privacy at work

In a nutshell, the idea behind differential privacy is the promise to make it nearly impossible for anyone to identify private information about an individual from a dataset. This is particularly vital as large datasets are available today of which many include quasi-identifying information such as Zip code, gender, and birthdate, which when combined were enough to identify 86% of US population.

A differentially algorithm will take some dataset as input and inject some noise into the identifying pieces of information it contains. The noise will be generated randomly by levaging statistical distributions such as Laplace or Gaussian. As a result, the identifying information will be hidden behind the noise, protecting the privacy of the individuals having their identifying information in the dataset.

Below is an implementation of differential privacy in Python, using IBM's Diffprivlib library, Laplace and Exponential distribution to generate random noise.

Install DiffPrivlib dependency

Randomize all numerical values through Laplace distribution

Randomize categorical values through Exponential distribution

Render final dataset