Lee,Bonwoo

Differential privacy

Most researchers agree a deterministic algorithm, e.g. anonymization, can not protect the privacy of data owners, excluding trivial exceptions(e.g. constant prediction). For the history of the consent see why DP?.

To protect the privacy in statistical analysis, one has use an algorithm returning randomized output, which is why we consider a randomized algorithms(from now on, mechanism). $\mathcal{X}$ be a dataspace our data origins, and $\mathcal{Z}$ be space our target statistics reside. Formally we can define a mechanism as follows: for $n\in\mathbb{N},$

Definition. Mechanism A map $M:\mathcal{X}^n\rightarrow\mathcal{P}(\mathcal{Z})$ is called an randomized algorithm or mechanism, where $\mathcal{P}(\mathcal{Z})$ denotes the collection of probability measures on $\mathcal{Z}$.

In the nonprivate statistics, one will use a deterministic algorithm such as sample average $\hat{\mu}=\frac{x_1+\cdots+x_n}{n}$, which maps $n$ samples to one real number. Such algorithms are said to be 'nonprivate' in the sense that one can easily detect the change of the input from the output: i.e. suppose one receives the average income of a town every month, and recognized the average income suddenly peaked compared to last month. Puzzled, one wonders what happend to the town and realize the arrival of a new occupant. That guy must have high income. If one wishes, a simple calculation will be enough to figure out the exact income. One can detect and deduce the change of the input from the output causing a privacy leakage.

Then is a mechanism different? Consider for example, coin-tossing respondents. Suppose one wants to evaluate the sex ratio of a certain group. But to respect their privacy, the surveyor toss a coin and allowed the respondent to lie if it is head, and tell the truth otherwise. The sex ratio evaluated from the aggregated responses will be a random variable, and not exact. Even when a new respondent is added, the surveyor will not be able to deduce the respondent's sex from the output change as long as the possibility of the respondent to lie exists, which is why we can say the mechanism protects the privacy.

Naturally, as the probability of head is higher, the more private the mechanism will be, which means there exists different degrees of privacy. In traditional anonymization, a confidential information were masked or unmasked. They were protected or not protected. But in mechanism, an information can be protected in different degrees. This was a new phenomenon, motivating the term 'differential privacy'. Then exactly how, can we compare the privacy of given mechanisms?

Definition. Differential privacy For given $\epsilon,\delta>0$ a mechanism $M:\mathcal{X}^n\rightarrow\mathcal{P}(\mathcal{Z})$ is $(\epsilon,\delta)$-differentially private if it satisfies $\mathbb{P}(M(x)\in S)\leq e^{\epsilon}\mathbb{P}(M(x')\in S)+\delta$ for every $S\subset \mathcal{Z}$ and $x,x'\in\mathcal{X}^n$ which differs by only one element.

The definition of differential privacy is