options(scipen = 999)
library(tidyverse)
library(urbnthemes)
library(smoothmest)
set_urbn_defaults()
7 Formally Private Definitions, Fundamental Mechanisms, and Algorithms
7.1 Recap
7.1.1 Formal privacy definitions
In general, formally private methods have the following features (Bowen and Garfinkel 2021):
- Ability to quantify and adjust the privacy-utility trade-off, typically through parameters.
- Ability to rigorously and mathematically prove the maximum privacy loss that can result from the release of information.
- Formal privacy definitions also allow one to compose multiple statistics. In other words, a data curator can compute the total privacy loss from multiple individual information releases.
7.1.2 Privacy loss budget
Differential privacy and other formal privacy uses the concept of a privacy loss budget, typically represented mathematically as \(\epsilon\). The privacy loss budget bounds the privacy risk associated with releasing data or query results.
(Note: \(\epsilon\) is not the only privacy loss parameter, but we will use it here as a general representation of the privacy loss budget.)
- A larger value of \(\epsilon\) increases the maximum disclosure risk (the upper bound of the disclosure risk) associated with a given release of information.
- larger \(\epsilon\) = less noise added to data = more accuracy, but less privacy
- smaller \(\epsilon\) = more noise added to data = less accuracy, but more privacy
- Extreme cases (note that these cases are not realistic in the sense of real-world applications, but are presented to demonstrate the intuition):
- \(\epsilon \to \infty\)
- all privacy will be lost; data retains all utility, but no privacy
- \(\epsilon \to 0\)
- no privacy is lost; data is completely distorted and no utility remains
- \(\epsilon \to \infty\)
7.2 Formal Privacy Features
Formal privacy is a relatively new set of definitions for quantifying the worst-case amount of information disclosed from statistics calculated on a private database. We provide conceptual and mathematical definitions below.
7.2.1 \(\epsilon\)-Differential privacy
Formal privacy does not make assumptions about:
how a data intruder will attack the data;
the amount of external information or computing power an intruder has access to, now or in the future;
which information in the data poses a higher disclosure risk (Near 2020).
Instead, formal privacy assumes the worst-case scenario:
the intruder has information on every observation except one;
the intruder has unlimited computational power;
missing observation is the most extreme possible observation (or an extreme outlier) that could alter the statistic.
We mathematically define several formally private definitions and key theorems. We use the following notation: \(X\in\mathbb{R}^{n\times r}\) is the confidential data set representing \(n\) data points and \(r\) variables and \(M:\mathbb{R}^{n\times r}\rightarrow\mathbb{R}^k\) denotes the statistical query, i.e., \(M\) is a function mapping \(X\) to \(k\) real numbers. We denote a randomized or noisy version of \(M\) using \(\mathcal{M}:\mathbb{R}^{n\times r}\rightarrow\mathbb{R}^k\), which is a function that satisfies a formally private definition.
DP is the most widely known formal privacy definition. Privacy experts often refer to the original definition of DP as pure-DP or \(\epsilon\)-DP.
Differential Privacy (Dwork, McSherry, et al. 2006): A sanitization algorithm, \(\mathcal{M}\), satisfies \(\epsilon\)-DP if for all subsets \(S\subseteq Range(\mathcal{M})\) and for all \(X,X'\) such that \(d(X,X')=1\), \[\begin{equation}\label{eqn:dp} \frac{\Pr(\mathcal{M}( X) \in S)}{ \Pr(\mathcal{M}( X')\in S)}\le \exp(\epsilon) \end{equation}\] where \(\epsilon>0\) is the privacy loss budget and \(d(X,X')=1\) represents the possible ways that \(X'\) differs from \(X\) by one record.
- \(\epsilon\) is logarithmic.
- This is an inequality, not an equation; \(\epsilon\) is up to us to define and represents an upper bound for disclosure risk that we are comfortable with for our particular data.
7.2.2 Global Sensitivity
In addition to the privacy loss budget, most formally private methods rely on the concept called global sensitivity, which describes how resistant the formally private sanitizer is to the presence of outliers (Bowen and Garfinkel 2021). We can think of the global sensitivity as another value that helps determine how much noise is needed to protect the released data or statistic, because some information is more sensitive than other information to outliers.
Imagine the data we want to protect contains socioeconomic information and the question we want answered is, “What is the median wealth?” Under formal privacy, we must consider the change of the most extreme possible record that could exist in any given data that has demographic and financial information. In our example, that person is Elon Musk, who was the wealthiest person in the world in 2023.1 If Musk is present or absent in the data, the median wealth should not change too much. This means we can provide a more accurate answer by applying fewer alterations to the median income statistic, because it is less sensitive to the extreme outlier, Musk Consider, however, the question, “What is the average wealth?” Unlike the previous statistic, the answer would significantly change if Musk were present or absent from the data. To protect the extreme case at a given level of privacy loss, a formally private algorithm would need to provide a significantly less accurate answer by altering the statistic more.
7.2.3 \(l_1\)-Global Sensitivity
\(l_1\)-Global Sensitivity (Dwork, McSherry, et al. 2006): The \(l_1\)-global sensitivity calculates the maximum amount a statistic can change in absolute value terms with the addition or removal of the most extreme possible observation.
\(l_1\)-Global Sensitivity (Dwork, McSherry, et al. 2006): For all \(X,X'\) such that \(d(X,X')=1\), the global sensitivity of a function \(M\) is \[\begin{equation}\label{eqn:gs} \Delta_1 (M)= \underset{d(X,X')=1}{\text{sup}} \|M(X)-M(X') \|_1 \end{equation}\]
For scalars, the \(l_1\)-Global Sensitivity is \(|M(X) - M(X')|\).
7.2.4 \(l_2\)-Global Sensitivity
\(l_2\)-Global Sensitivity (Dwork, McSherry, et al. 2006): \(l_2\)-global sensitivity calculates the maximum amount a statistic can change when the statistic is squared, summed, and rooted with the addition or removal of the most extreme possible observation.
\(l_2\)-Global Sensitivity (Dwork, McSherry, et al. 2006): For all \(X,X'\) such that \(d(X,X')=1\), the global sensitivity of a function \(M\) is
\[\Delta_2 (M)= \underset{d(X,X')=1}{\text{sup}} \|M(X)-M(X') \|_2\] For scalars, the \(l_2\)-Global Sensitivity is \(\sqrt{(M(X) - M(X'))^2}\).
Global sensitivity is straightforward but calculating the global sensitivity for some statistics can be very difficult. For instance, we cannot calculate a finite global sensitivity of sample mean if we do not bound the variable.
7.3 Exercise 1
Suppose we are interested in counting the number of sole proprietorships in Washington, DC. What are the \(l_1\) and \(l_2\) global sensitivities of this statistic?
In other words, what is the maximum difference between \(M(X)\) and \(M(X')\) when \(d(X,X')=1\)?
The answer is one. The most a count can change by adding or subtracting one observation is one!
\(\Delta_1 (M) = \Delta_2 (M) = 1\)
7.4 Exercise 2
Suppose we are interested in calculating the total income of sole proprietorships in Washington, DC. What are the \(l_1\) and \(l_2\) global sensitivities of this statistic?
In other words, what is the maximum difference between \(M(X)\) and \(M(X')\) when \(d(X,X')=1\)?
The answer is \(\infty\). A total can theoretically change by any amount with the addition or deletion of one observation.
7.5 Statistics
7.5.1 Counts
Counts are the best explored statistics in differential privacy. With unbounded differential privacy, the global sensitivity of a count is always 1.
For example, assume we are counting the number of billionaires in the United States. The most the count can change with the addition or removal of Elon Musk is one.
7.5.2 Sums
Calculating the global sensitivity is more difficult for other statistics than counts. The global sensitivity of a sum is unbounded because the addition or removal of one row can theoretically change the sum by any amount.
One approach is to clip or truncate values. If we assume that all observations are between 6 and 10, inclusive, then we can treat the global sensitivity as \(10 - 6 = 4\).
- Differential privacy does not hold if we look at the data to determine the bounds.
- Bounds that truncate actual values bias statistics.
- This assumption is often problematic with economic data where distributions can be highly skewed.
7.5.3 Means
Means can be rewritten as two queries: a total divided by a count.
- GS(sum) / GS(count)
Sometimes the number of observations is assumed to be known. In this case, the global sensitivity is smaller.
- GS(sum) / n if we assume n is known
7.6 DP Sanitizers
A sanitizer protects against disclosure risk. A differentially private sanitizer protects against disclosure risk and meets the definition of differential privacy. If we know the global sensitivity of statistics, then we can often add noise in a way that sanitizers satisfy differential privacy. We review three fundamental formally private sanitizers. We call these sanitizers fundamental because they are the original formally private sanitizers that privacy researchers still often use as building blocks for more sophisticated formally private methods.
7.6.1 Laplace sanitizer
Dwork, McSherry, et al. (2006) first proposed protecting statistics by adding noise from a Laplace distribution, called the Laplace mechanism, but we can think of it as a Laplace sanitizer. The Laplace distribution is centered at zero and the distribution variability is the ratio of the privacy loss budget, \(\epsilon\), over the \(l_1\)-global sensitivity of the statistic. Since the distribution is centered at zero, there is a higher probability of adding very little or no noise to the statistic. For the noise variability, if \(\epsilon\) is large or the sensitivity of the statistic is low, then there is a higher probability of adding very little noise to the confidential data statistic. If \(\epsilon\) is small or the sensitivity of the statistic is high, then there is a higher probability of adding a lot of noise to the statistic.
The Laplace sanitizer satisfies \(\epsilon\)-DP by adding noise from a Laplace distribution to statistics. More sensitivity means more expected noise. More \(\epsilon\) means less expected noise.
Laplace Mechanism (Dwork, McSherry, et al. 2006): Given any function \(M:\mathbb{R}^{n\times r}\rightarrow\mathbb{R}^k\), the Laplace mechanism is defined as: \[\begin{equation}\label{eqn:lap} \mathcal{M}(X)=M(X)+(\eta_1,\ldots,\eta_k). \end{equation}\] where \((\eta_1,\ldots,\eta_k)\) are i.i.d. \(Laplace(0, \frac{\Delta_1(M)}{\epsilon})\).
7.6.2 Laplace sanitizer Example
Let’s consider a simple example with the Palmer Penguins data set. The data set contains 333 observations about Adelie, Chinstrap, and Gentoo penguins in Antarctica. Suppose we want to count how many penguins are Adelie penguins.
<- palmerpenguins::penguins |>
penguins drop_na()
penguins
# A tibble: 333 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen 36.7 19.3 193 3450
5 Adelie Torgersen 39.3 20.6 190 3650
6 Adelie Torgersen 38.9 17.8 181 3625
7 Adelie Torgersen 39.2 19.6 195 4675
8 Adelie Torgersen 41.1 17.6 182 3200
9 Adelie Torgersen 38.6 21.2 191 3800
10 Adelie Torgersen 34.6 21.1 198 4400
# ℹ 323 more rows
# ℹ 2 more variables: sex <fct>, year <int>
The global sensitivity is \(\frac{\Delta_1(M)}{\epsilon} = \frac{1}{\epsilon}\). This means our formally private statistic is one draw from a Laplace distribution with center at the confidential statistics and scale parameter equal to \(\frac{1}{\epsilon}\).
Below is code for drawing values from a Laplace distribution, which we will call laplace_sanitizer()
.
# function to draw Laplace noise for one statistic
<- function(sensitivity, epsilon, n = 1) {
laplace_sanitizer
# lambda (distribution width) is sensitivity/privacy loss budget
<- sensitivity / epsilon
l
# draw from Laplace distribution
<- smoothmest::rdoublex(n = n, # draw one observation (adding noise to one statistic)
noise mu = 0, # centered at 0
lambda = l) # scale based on l calculated above
return(noise)
}
Let’s calculate the statistic without any noise.
<- sum(penguins$species == "Adelie")
answer_conf
answer_conf
[1] 146
Now, let’s calculate the statistic with noise that satisfies the definition of \(\epsilon\)-differential privacy.
set.seed(1)
<- answer_conf + laplace_sanitizer(sensitivity = 1, epsilon = 0.1)
answer_dp
answer_dp
[1] 153.5518
Maybe we got a lucky or unlucky draw from the Laplace distribution. Let’s calculate this statistic 100 times to understand the distribution of noisy statistics. This is purely for demonstration to understand the expectation of the noisy statistic.
7.6.3 Gaussian Sanitizer
Similar to the Laplace mechanism, the Gaussian mechanism adds random noise from a Gaussian distribution. The Gaussian distribution is also centered at zero and the distribution variability is the ratio of the privacy loss budget, \(\epsilon\), over the \(l_2\)-global sensitivity of the statistic (mathematically, the Gaussian mechanism does not satisfy \(l_1\)-global sensitivity).
The Gaussian sanitizer satisfies \((\epsilon,\delta)\)-DP by adding noise from a Gaussian distribution (also known as Normal distribution or bell curve) to statistics. More sensitivity means more expected noise. More \(\epsilon\) means less expected noise. More \(\delta\) means less expected noise.
Gaussian Mechanism (Dwork and Roth 2014): The Gaussian Mechanism satisfies \((\epsilon,\delta)\)-DP by adding Gaussian noise with zero mean and variance, \(\sigma^2\), such that
\[\mathcal{M}(X)=M(X)+(\eta_1,\ldots,\eta_k)\]
where \(\eta_1,\ldots,\eta_k\) are independently drawn and \(\sigma=\frac{\Delta_2(M)\sqrt{2 \log(1.25/\delta)}}{\epsilon}\).
This sanitizer includes two parameters: \(\epsilon\) and \(\delta\). We can think of \(\delta\) as a small probability that the bound created by \(\epsilon\) does not hold.
The Gaussian sanitizer uses \(l_2\)-Global Sensitivity.
7.6.4 Gaussian Sanitizer Example
We repeat the last example, except this time using the Gaussian sanitizer.
# function to draw Gaussian noise for one statistic
<- function(sensitivity, epsilon, delta) {
gaussian_sanitizer
# lambda (distribution width) is sensitivity/privacy loss budget
<- (sensitivity * sqrt(2 * log(1.25 / delta))) / epsilon
sigma
# draw from Gaussian distribution
<- rnorm(n = 1, # draw one observation (adding noise to one statistic)
noise mean = 0,
sd = sigma) # scale based on l calculated above
return(noise)
}
Let’s calculate the statistic without any noise.
<- sum(penguins$species == "Adelie")
answer_conf
answer_conf
[1] 146
Now, let’s calculate the statistic with noise that satisfies the definition of \((\epsilon, \delta)\)-differential privacy.
set.seed(1)
<- answer_conf + gaussian_sanitizer(sensitivity = 1, epsilon = 0.1, delta = 10^-7)
answer_dp
answer_dp
[1] 110.1865
Maybe we got a lucky or unlucky draw from the Normal distribution. Let’s calculate this statistic 100 times to understand the distribution of noisy statistics. This is purely for demonstration to understand the expectation of the noisy statistic.
The Gaussian sanitizer is worse than the Laplace sanitizer! So why do we even need a Gaussian sanitizer?
The Gaussian sanitizer can compose better for multiple queries. This is because the sum of two normally distributed random variables is normally distributed but the sum of two Laplacian distributed variables is not Laplacian.
7.7 Important Theorems
As mentioned earlier, formal privacy requires methods to compose or account for the total privacy loss from each public data release or statistic. For example, composition or accounting allows the data curator to track the total privacy loss from multiple summary tables or multiple statistics requests from several data users. This is the main advantage of formal privacy compared to traditional SDC methods, which cannot quantify the total privacy loss. There are two main composition theorems: sequential and parallel. We also cover another important theorem (post-processing) that is essential in developing formally private methods.
7.7.1 Sequential Composition Theorem
The sequential composition theorem allows the data users to calculate the privacy loss budget from multiple noisy statistics on the same part of the confidential data (Bun and Steinke 2016; McSherry 2009).
To help explain this concept, suppose we have establishment economic data set that reports the state of operation, the number of employees, and the average income for each establishment. We want to conduct three different analyses that cost \(\epsilon_1=1\), \(\epsilon_2=0.5\), and \(\epsilon_3=0.5\), respectively. Since we are applying the three analyses on the entire data, sequential composition requires us to add up the individual privacy loss budgets for the total. i.e., \(\epsilon_{total}=\epsilon_1+\epsilon_2+\epsilon_3=2\). Figure 7.1 shows the application of sequential composition to our fictitious economic data set.
Mathematically, a mechanism, \(\mathcal{M}_j\), provides \(\epsilon_j\)-DP. The sequence of \(\mathcal{M}_j(X)\) applied on the same \(X\) provides \(\sum_{j=1}^J\epsilon_j\)-DP.
7.7.2 Sequential Composition Theorem Example
Let’s return the penguins example from above. Suppose \(\epsilon = 1\) and we want to count the number of “Adelie” penguins and the number of “Chinstrap” penguins.
<- 1
epsilon
set.seed(20220505)
sum(penguins$species == "Adelie") +
laplace_sanitizer(sensitivity = 1, epsilon = epsilon / 2)
[1] 146.7052
sum(penguins$species == "Chinstrap") +
laplace_sanitizer(sensitivity = 1, epsilon = epsilon / 2)
[1] 78.36747
For reference, let’s look at the truth.
sum(penguins$species == "Adelie")
[1] 146
sum(penguins$species == "Chinstrap")
[1] 68
7.7.3 Parallel Composition Theorem
The parallel composition theorem allows data users to calculate the privacy loss budget from multiple noisy statistics on different or disjoint parts of the confidential data (Bun and Steinke 2016; McSherry 2009).
Using the same example as before in Figure 7.1, suppose we apply three analyses to partitions of the data (i.e., the three different states) that cost \(\epsilon_1=1\), \(\epsilon_2=0.5\), and \(\epsilon_3=0.5\), respectively. Since we are applying the three analyses on disjoint subsets of the data, parallel composition states that the total privacy loss budget is the maximum privacy loss budget of the three analyses. i.e., \(\epsilon_{total}=\max(\epsilon_1,\epsilon_2,\epsilon_3)=1\). Figure 7.2 shows the application of sequential composition to our fictitious economic data set.
Mathematically, let \(D_j\) be disjoint subsets of the input domain \(D\). The sequence of \(\mathcal{M}_j(X\cap D_j)\) provides \(\max_{j \in \{1,\ldots,J\}} \epsilon_j\)-DP
7.7.4 Parallel Composition Theorem Example
Let’s consider a larger data set with 53,940 observations about diamonds. Suppose we want to calculate a differenitally private histogram of diamond sizes with bins [0, 1], (1, 2], (2, 3], (3, 4], (4, 5], and (5,6] with \(\epsilon = 0.1\).
<- count(diamonds, carat = ceiling(carat))
diamonds_conf
diamonds_conf
# A tibble: 6 × 2
carat n
<dbl> <int>
1 1 36438
2 2 15613
3 3 1857
4 4 27
5 5 4
6 6 1
One approach is to use \(\frac{\epsilon = 0.1}{6}\) for each of the six counting queries. This is based on sequential composition.
<- 0.1
epsilon
set.seed(10)
<- bind_cols(
diamonds_conf
diamonds_conf,sequential = diamonds_conf$n +
laplace_sanitizer(sensitivity = 1, epsilon = epsilon / 6, n = 6)
)
diamonds_conf
# A tibble: 6 × 3
carat n sequential
<dbl> <int> <dbl>
1 1 36438 36437.
2 2 15613 15558.
3 3 1857 1902.
4 4 27 -67.5
5 5 4 17.9
6 6 1 66.2
The bins for carat
partition the data set and each bin is a disjoint subset of the data. Therefore, we can use parallel composition and get more accurate differentially private counts!
set.seed(11)
<- bind_cols(
diamonds_conf
diamonds_conf,parallel = diamonds_conf$n +
laplace_sanitizer(sensitivity = 1, epsilon = epsilon, n = 6)
)
diamonds_conf
# A tibble: 6 × 4
carat n sequential parallel
<dbl> <int> <dbl> <dbl>
1 1 36438 36437. 36446.
2 2 15613 15558. 15683.
3 3 1857 1902. 1857.
4 4 27 -67.5 69.0
5 5 4 17.9 28.6
6 6 1 66.2 9.53
7.7.5 Post-Processing Theorem
Another important theorem is the post-processing theorem that allows the continued use of formally private information without losing the privacy guarantee (Bun and Steinke 2016; Dwork, McSherry, et al. 2006; Nissim, Raskhodnikova, and Smith 2007). In other words, if someone modifies a formally private data set or statistic without using additional information from the confidential data, then that data set or statistic is still formally private.
For example, if the number of employees from a formally private method said there are 3.7 employees, then we could round that value to 4 without leaking more information. Simply put, the post-processing theorem makes the data usable after formally private noise is added.
Mathematically, if \(\mathcal{M}\) is a sanitizer that satisfies \(\epsilon\)-DP and \(g\) is any function, then \(g\left(\mathcal{M}(X)\right)\) also satisfies \(\epsilon\)-DP.
Post-processing also provides the opportunity to improve utility. Data stewards can use available public or expert knowledge to reduce the amount of noise without accruing additional privacy loss. The public information can come from data released without formal privacy or from individuals who are comfortable sharing their information without noise. Rounding and eliminating impossible values like negative counts are common types of post-processing. There are also types of post-processing that can improve accuracy by leveraging information calculated from the same data set.
Formal privacy is transparent and allows users to account for the noise introduced into statistics. Post-processing can give up some of this transparency and make it more difficult to account for the noise added to statistics.
7.8 Exercise 3
Consider a simulated data set with information about small businesses (0-20 employees) in Texas and Vermont.
set.seed(20220509)
<- bind_rows(
small_businesses Texas = tibble(
employees = rpois(n = 100010, lambda = 10),
income = rlnorm(n = 100010, meanlog = 10, sdlog = 2)
),Vermont = tibble(
employees = rpois(n = 403, lambda = 10),
income = rlnorm(n = 403, meanlog = 10, sdlog = 2)
),.id = "state"
|>
) mutate(employees = if_else(employees > 20, 20L, employees))
Using the Laplace sanitizer, calculate the number of small businesses in Texas and Vermont (count) with the overall \(\epsilon = 0.1\). Use the parallel composition theorem.
<- count(small_businesses, state)
ex3_conf
ex3_conf
set.seed(46)
bind_cols(
ex3_conf,$n + laplace_sanitizer(
ex3_confsensitivity = ### ______,
epsilon = ### ______,
n = 2
) )
- Which state has more absolute error introduced into its count?
- Which state has more relative error introduced into its count?
The observations from Texas and Vermont are disjoint, so we can use the full \(\epsilon = 0.1\) for each statistics instead of splitting it across the statistics.
<- count(small_businesses, state)
ex3_conf
ex3_conf
# A tibble: 2 × 2
state n
<chr> <int>
1 Texas 100010
2 Vermont 403
set.seed(46)
bind_cols(
ex3_conf,n_dp = ex3_conf$n + laplace_sanitizer(
sensitivity = 1,
epsilon = 0.1,
n = 2
) )
# A tibble: 2 × 3
state n n_dp
<chr> <int> <dbl>
1 Texas 100010 100029.
2 Vermont 403 418.
The absolute error is larger for Texas, but the relative error is much bigger for Vermont.
7.9 Exercise 4
Using the Laplace sanitizer, calculate the number of employees in the entire data set (sum) with the overall \(\epsilon = 0.1\). We know from auxiliary information that the number of employees varies from 0 to 20 because they are small businesses.
<- small_businesses |>
ex4_conf summarize(employees = sum(employees))
set.seed(47)
bind_cols(
ex4_conf,employees_dp = ex4_conf$employees + laplace_sanitizer(
sensitivity = ### ______,
epsilon = ### ______,
n = 1
) )
<- small_businesses |>
ex4_conf summarize(employees = sum(employees))
set.seed(47)
bind_cols(
ex4_conf,employees_dp = ex4_conf$employees + laplace_sanitizer(
sensitivity = 20,
epsilon = 0.1,
n = 1
) )
# A tibble: 1 × 2
employees employees_dp
<int> <dbl>
1 1003755 1003807.
7.10 Other Formal Privacy Definitions
7.10.1 Approximate Differential Privacy
Approximate Differential Privacy, also known as \((\epsilon, \delta)\)-Differential Privacy is a relxation of \(\epsilon\)-Differential Privacy. We saw this definition above with the Gaussian sanitizer.
\((\epsilon, \delta)\)-Differential Privacy (Dwork, Kenthapadi, et al. 2006): A sanitization algorithm, \(\mathcal{M}\), satisfies \((\epsilon, \delta)\)-DP if for all \(X, X'\) that are \(d(X,X')=1\),
\[\Pr(\mathcal{M}( X) \in S)\le \exp(\epsilon) \Pr(\mathcal{M}( X')\in S) + \delta\] where \(\delta\in [0,1]\).
We can think of \(\delta\) as a small probability that the bound created by \(\epsilon\) does not hold. \(\epsilon\)-DP is a special case of \((\epsilon, \delta)\)-DP when \(\delta=0\).
7.10.2 Zero-Concentrated Differential Privacy
Zero-Concentrated Differential Privacy is another relaxation of \(\epsilon\)-Differential Privacy. This definition is used by the Census Bureau for the 2020 Decennial Census.
Zero-Concentrated Differential Privacy (Bun and Steinke 2016): A sanitization algorithm, \(\mathcal{M}\), satisfies \((\xi, \rho)\)-zero-concentrated differential privacy if for all \(X, X'\) that are \(d(X,X')=1\) and \(\alpha\in (1, \infty)\),
\[D_\alpha(\mathcal{M}(X)||\mathcal{M}(X'))\leq\xi+\rho\alpha\]
where \(D_\alpha(\mathcal{M}(X)||\mathcal{M}(X'))\) is the \(\alpha\)-R'enyi divergence % between the distribution of \(\mathcal{M}(X)\) and the distribution of \(\mathcal{M}(X')\), \(\xi\) and \(\rho\) are positive constants, and \(\alpha \in (1,\infty)\).
Zero-Concentrated Differential Privacy, also known as R'enyi Differential Privacy, only holds for counts.
7.11 Unpacking \(\epsilon\)
Differential privacy states that the log of the ratio of the probability that any individual observation was in the data that generated the output vs. not in the data that generated the output is bounded by the value of \(\epsilon\).
\[\frac{\Pr(\mathcal{M}( X) \in S)}{ \Pr(\mathcal{M}( X')\in S)}\le \exp(\epsilon)\]
The bound is in exponential units, so modest increases in \(\epsilon\) correspond with large increases in the ratio of the probabilities.
Early differential privacy researchers thought \(\epsilon = 1\) or \(\epsilon = 2\) were upper bounds on \(\epsilon\). Today, much higher values of \(\epsilon\) are used. The April 2021 Decennial Census demonstration data used \(\epsilon = 4.0\) and \(\epsilon = 10.3\) for the person-level file. The Decennial Census ended up using \(\epsilon = 17.14\) for the person-level file and \(\epsilon = 2.47\) for the housing unit data, with \(\delta = 10^{−10}\) for each.
Let’s consider the ratios of the probabilities for different values of \(\epsilon\):
# A tibble: 10 × 2
epsilon ratio
<dbl> <dbl>
1 0.25 1
2 0.5 2
3 0.75 2
4 1 3
5 2 7
6 4 55
7 6 403
8 8 2981
9 10.3 29733
10 17.1 27784809
It is tough to reason what a ratio of 27784809 even means.
7.12 Key Takeaways
- Differential privacy places a bound on the amount of information released under extreme assumptions about the knowledge of an attacker and their computing power.
- Global sensitivity measures how much a statistic can change with the addition or removal of the most extreme possible value.
- Sanitizers, like the Laplace sanitizer, satisfy differential privacy by adding a specific amount of random noise to statistics.
- Higher values of \(\epsilon\) mean more information is potentially released.
- Sanitizers applied to statistics with higher global sensitivity require more noise to satisfy a definition of differential privacy than with statistics with lower global sensitivity.
7.13 Bonus Exercises
7.13.1 Exercise 5
The Laplace sanitizer uses l2-global sensitivity.
- True
- False
The Laplace sanitizer uses l2-global sensitivity.
- True
- False
7.13.2 Exercise 6
This theorem improves accuracy when data can be broken into disjoint subsets.
- Sequential composition theorem
- Postprocessing theorem
- Parallel composition theorem
This theorem improves accuracy when data can be broken into disjoint subsets.
- Sequential composition theorem
- Postprocessing theorem
- Parallel composition theorem
At the time of session, Elon Musk is the wealthiest person in the world.↩︎