A new way to look at data privacy

A privacy technique that protects sensitive data while maintaining a machine-learning model’s performance.


Privacy concerns with information leakage from data processing are receiving increasing attention. Obtaining usable security, which provides high utility, low implementation overhead, and meaningful interpretation, is challenging.

MIT scientists have proposed a new privacy definition called Probably Approximately Correct (PAC) Privacy. It enables the user to add the smallest amount of noise possible while still protecting sensitive data.

Based on the PAC privacy, they also built a framework that can automatically determine the minimal amount of noise that needs to be added. Additionally, this framework is simpler to utilize for many models and applications because it does not require understanding a model’s internal workings or training procedure.

Srini Devadas, the Edwin Sibley Webster Professor of Electrical Engineering and co-author of a new paper on PAC Privacy, said, “PAC Privacy exploits the uncertainty or entropy of the sensitive data in a meaningful way, and this allows us to add, in many cases, an order of magnitude less noise. This framework allows us to understand the characteristics of arbitrary data processing and privatize it automatically without artificial modifications. While we are in the early days and doing simple examples, we are excited about the promise of this technique.”

PAC Privacy approaches the issue quite differently. Instead of concentrating just on the distinguishability issue, it characterizes how difficult it would be for an adversary to reconstruct any portion of randomly selected or created sensitive data once the noise has been added.

After creating a definition for PAC Privacy, scientists created an algorithm that automatically tells the user how much noise to add to a model to prevent an adversary from confidently reconstructing a close approximation of the sensitive data. It relies on uncertainty or entropy- in the original data from the adversary’s viewpoint. This algorithm guarantees privacy even if the adversary has infinite computing power.

The user’s machine-learning training algorithm is run on the subsampled data using this automatic technique, which randomly selects samples from a data distribution or a huge data pool to create an output-learned model. This is repeated on several subsamples, and the variance of all outputs is compared. The variance determines the amount of noise that must be added; a lower variance indicates a need for less noise.

PAC Privacy can be computationally expensive since it requires repeatedly training a machine-learning model on numerous data subsamplings.  

One method for enhancing PAC Privacy involves altering a user’s machine-learning training procedure so that its output model does not change significantly when the input data is subsampled from a data pool. The PAC Privacy algorithm would need to run fewer times to determine the ideal amount of noise, and it would also need to add less noise as a result of this stability, resulting in fewer deviations between subsample outputs.

Srini Devadas, the Edwin Sibley Webster Professor of Electrical Engineering and co-author of a new paper on PAC Privacy said, “In the next few years, we would love to look a little deeper into this relationship between stability and privacy and the relationship between privacy and generalization error. We are knocking on a door here, but it is not clear yet where the door leads.”

Jeremy Goodsitt, senior machine learning engineer at Capital One, who was not involved with this research, said, “Obfuscating the usage of an individual’s data in a model is paramount to protecting their privacy. However, to do so can come at the cost of the data and, therefore model’s utility. PAC provides an empirical, black-box solution, which can reduce the added noise compared to current practices while maintaining equivalent privacy guarantees. In addition, its empirical approach broadens its reach to more data-consuming applications.”

Journal Reference:

  1. Hanshen Xiao, Srinivas Devadas. PAC Privacy: Automatic Privacy Measurement and Control of Data Processing. DOI: 10.48550/arXiv.2210.03458