Data scientists have an ethical (and often legal) responsibility to protect sensitive data. Differential privacy is a leading edge approach that enables useful analysis while protecting individually identifiable data values.

A machine learning project typically involves an iterative process of data analyses in order to gain insights into the data and determine which variables are most likely to help build predictive models. Analyzing data usually involves aggregative and statistical functions that provide insights into the statistical distribution of variables and the relationships between them. With large volumes of data, the aggregations provide a level of abstraction; but with smaller amounts of data, or with repeated analyses, even aggregated results may reveal details about individual observations.

Differential privacy is a technique that is designed to preserve the privacy of individual data points by adding “noise” to the data. The goal is to ensure that enough noise is added to provide privacy for individual values while ensuring that the overall statistical makeup of the data remains consistent, and aggregations produce statistically similar results as when used with the original raw data.

SmartNoiseSDK

SmartNoise is a toolkit from OpenDP; a joint project between researchers at Microsoft, Harvard University, and other contributors that aims to provide building blocks for using differential privacy in data analysis and machine learning projects.

smart noise SDK graphic

How To Install

!pip install opendp-smartnoise==0.1.3.1

Add Statistical Noise

You can use SmartNoise to create an analysis in which noise is added to the source data. The underlying mathematics of how the noise is added can be quite complex, but SmartNoise takes care of most of the details for you. However, there are a few concepts it’s useful to be aware of.

Upper and lower bounds: Clamping is used to set upper and lower bounds on values for a variable. This is required to ensure that the noise generated by SmartNoise is consistent with the expected distribution of the original data. Sample size: To generate consistent differentially private data for some aggregations, SmartNoise needs to know the size of the data sample to be generated. Epsilon: Put simplistically, epsilon is a non-negative value that provides an inverse measure of the amount of noise added to the data. A low epsilon results in a dataset with a greater level of privacy, while a high epsilon results in a dataset that is closer to the original data. Generally, you should use epsilon values between 0 and 1. Epsilon is correlated with another value named delta, that indicates the probability that a report generated by an analysis is not fully private.

As a rule of thumb, $\epsilon$ should be thought of as a small number, between approximately 1/1000and 1. In each implementation of differential privacy, a value of $\epsilon$ that allows a reasonable compromise between privacy and accuracy should be carefully chosen.

see https://privacytools.seas.harvard.edu/files/privacytools/files/pedagogical-document-dp_0.pdf