options(scipen = 999)
library(tidyverse)
library(gt)
library(palmerpenguins)
library(urbnthemes)
library(here)
source(here("R", "create_table.R"))
Last week we learned how to generate parametric and non-parametric synthetic data.
We generated synthetic data for the palmerpenguins
data using sequential synthesis.
The models we used for synthesis were OLS regression, and simple random sampling with replacement.
Now that we’ve generated synthetic data, how do we evaluate the
data, both in terms of usefulness and disclosure risks
As a refresher, general utility metrics measure the distributional similarity (i.e. all statistical properties) between the original and synthetic data.
Also known as “broad” or “global” utility because they compare wider aspects of the differences between synthetic and confidential data, often in the form of one summary metric.
General utility metrics are useful because they can provide a sense of how “fit for use” your synthetic data is for analysis, without having to make assumptions about the kinds of analysis people might use the synthetic data for.
Some common utility metrics are listed below:
For discrete variables, check if the counts or relative frequencies are similar. You can also make joint/relative frequency tables among pairs of variables.
For continuous variables: check mean, sd, skewness, kurtosis
(i.e. first four moments)
Also useful to check medians, percentiles, and number of zero/non-zero values particularly with economic data
You should also visualize and compare univariate distributions using density plots. We already did this last week with our datasets, but this is very useful to do for each variable you synthesize.
Correlation Fit: Measures how well the synthesizer recreates the linear relationships between variables in the confidential dataset.
Discriminant based methods: Can a model distinguish (i.e. discriminate) between records from the actual vs synthetic data?
Basic idea is to take the confidential data, smush it together with the synthetic data, and see if a model can distinguish (i.e. discriminate) between the two!
If the data synthesis process is good, then hopefully a model won’t be able to distinguish between the two.
As a visual example of how these discriminant based methods work, imagine that we generated a really good synthetic dataset that closely aligned with the confidential data. These are what the general discriminant based utility metrics would look like.
For all the below discriminant based methods, we generate propensity scores (i.e. the probability that a particular data point belongs to the confidential data) using a classifier model.
The first few steps for all the specific methods outlined below are the same:
Combine the synthetic and confidential data. Add an indicator variable with 0 for the confidential data and 1 for the synthetic data
species | bill_length_mm | sex | ind |
---|---|---|---|
Chinstrap | 49.5 | male | 0 |
... | ... | ... | ... |
Adelie | 46.0 | male | 1 |
Calculate propensity scores (i.e. probabilities for group membership) for whether a given row belong to the synthetic dataset, typically with a classifier like logistic regression or CART.
species | bill_length_mm | sex | ind | prop_score |
---|---|---|---|---|
Chinstrap | 49.5 | male | 0 | 0.32 |
... | ... | ... | ... | ... |
Adelie | 46.0 | male | 1 | 0.64 |
These propensity scores can be used to calculate various metrics for general utility, some of which are described below:
pMSE: Calculates the average Mean Squared Error (MSE) between the propensity scores and the expected probabilities:
Proposed by Woo et al. (Woo et al. 2009) and enhanced by Snoke et al. (Snoke et al. 2018a)
After doing steps 1) and 2) above:
Calculate expected probability, i.e. the share of synthetic data in the combined data. In the cases where the synthetic and confidential datasets are of equal size, this will always be 0.5.
species | bill_length_mm | sex | ind | prop_score | exp_prob |
---|---|---|---|---|---|
Chinstrap | 49.5 | male | 0 | 0.32 | 0.5 |
... | ... | ... | ... | ... | ... |
Adelie | 46.0 | male | 1 | 0.64 | 0.5 |
\[pMSE = \frac{(0.32 - 0.5)^2 + ... + (0.64-0.5)^2}{N} \]
Often people use the pMSE ratio, which is the average pMSE score across all records, divided by the null model (Snoke et al. 2018b).
The null model is the the expected value of the pMSE score under the best case scenario when the model used to generate the data reflects the confidential data perfectly.
pMSE ratio = 1 means that your synthetic data and confidential
data are indistinguishable.
SPECKS: Synthetic data generation; Propensity score matching; Empirical Comparison via the Kolmogorov-Smirnov distance. After generating propensity scores (i.e. steps 1 and 2 from above), you:
Calculate the empirical CDF’s of the propensity scores for the synthetic and confidential data, separately.
Calculate the Kolmogorov-Smirnov (KS) distance between the 2 empirical CDFs. The KS distance is the maximum vertical distance between 2 empirical CDF distributions.
AUC: Area under the Receiver Operating Curve, a summary of how good your discriminator is.
In our context, High AUC = good at discriminating = poor synthesis.
We want in the best case, AUC = 0.5 because that means the discriminator is no better than a random guess
Assume that you have a confidential dataset of the starwars data,
which is named conf_data
below. You have already
synthesized a fully synthetic dataset, named synth_data
based on the confidential data. The conf_data
looks
like:
gender | height | mass |
---|---|---|
masculine | 172 | 77 |
masculine | 167 | 75 |
... | ... | ... |
And synth_data
looks like:
gender | height | mass |
---|---|---|
masculine | 163.2612 | 99.68595 |
masculine | 150.4994 | 92.96685 |
... | ... | ... |
Question 1: Calculate the correlation fit between the synthetic
and confidential data. Fill in the blanks and run the code below.
# Fill in the blanks below:
# The cor() function can take in a dataframe and compute correlations
# between all columns in the dataframe and spit out a correlation matrix
conf_data_correlations = cor(###)
synth_data_correlations = cor(###)
correlation_differences = conf_data_correlations - synth_data_correlations
# Correlation fit is the sum of the sqrt of the squared differences between each correlation in the difference matrix.
cor_fit = sum(sqrt( ### ^2))
cor_fit
Question 2: Compare the univariate distributions for
mass
and height
in the confidential and
synthetic data using density plots. Fill in the blanks and run the code
below.
combined_data = bind_rows("synthetic" = synth_data,
"confidential" = conf_data,
.id = "type")
# Create a density plot of the mass distributions
combined_data %>%
ggplot(aes(x = ###,
fill = type,),
position = "dodge",
color = "white") +
geom_density(alpha = 0.4)
# Create a density plot of the height distributions
combined_data %>%
ggplot(aes(x = ###,
fill = type,),
position = "dodge",
color = "white") +
geom_density(alpha = 0.4)
Specific utility metrics measure how suitable a synthetic dataset is for specific analyses.
These specific utility metrics will change from dataset to dataset, depending on what you’re using the data for.
A helpful rule of thumb: general utility metrics are useful for the data synthesizers to be convinced that they’re doing a good job. Specific utility metrics are useful to convince downstream data users that the data synthesizers are doing a good job.
Some examples of specific utility metrics, though again these will vary dramatically, are below.
Urban often uses microsimulation models, particularly with administrative tax data.
When we were synthesizing administrative tax data, one of the utility metrics were how close the microsimulation results ( e.g. projected income, capital gains, dividends, etc) were when using the confidential vs synthetic data as inputs to the tax calculator.
Big picture: How often can we correctly re-identify confidential records from synthetic records for partially synthetic data?
For fully synthetic datasets, there is no one to one relationship between individuals and records so identity disclosure risk is a little ill-defined. Generally identity disclosure risk applies to partially synthetic datasets (or datasets protected with traditional SDC methods).
Most of these metrics rely on data maintainers essentially performing attacks against their synthetic data and seeing how successful they are at identifying individuals.
We start by making assumptions about the knowledge an attacker has (i.e. external publicly accessible data they have access to).
For each confidential record, the data attacker identifies a set of partially synthetic records which they believe contain the target record (i.e. potential matches) using the external variables as matching criteria.
There are distance-based and probability-based algorithms that can perform this matching. This matching process could be based on exact matches between variables or some relaxations (i.e. matching continuous variables within a certain radius of the target record, or matching adjacent categorical variables).
We then evaluate how accurate our re-identification process was using a variety of metrics.
As a simple example for the metrics we’re about to cover, imagine a data attacker has access to the following external data:
homeworld | species | name |
---|---|---|
Naboo | Gungan | Jar Jar Binks |
Naboo | Droid | R2-D2 |
And imagine that the partially synthetic released data looks
like this:
homeworld | species | skin_color |
---|---|---|
Tatooine | Human | fair |
Tatooine | Droid | gold |
Naboo | Droid | white, blue |
Tatooine | Human | white |
Alderaan | Human | light |
Tatooine | Human | light |
Note that the released partially synthetic data does not have names. But using some basic matching rules in combination with the external data, an attacker is able to identify the following potential matches for Jar Jar Binks and R2D2, two characters in the Starwars universe:
homeworld | species | skin_color |
---|---|---|
Potential Jar Jar matches | ||
Naboo | Gungan | orange |
Naboo | Gungan | grey |
Naboo | Gungan | green |
Potential R2-D2 Matches | ||
Naboo | Droid | white, blue |
And since we are the data maintainers, we can take a look at the confidential data and know that the highlighted rows are “true” matches.
homeworld | species | skin_color |
---|---|---|
Potential Jar Jar matches | ||
Naboo | Gungan | orange |
Naboo | Gungan | grey |
Naboo | Gungan | green |
Potential R2-D2 Matches | ||
Naboo | Droid | white, blue |
These matches above are counted in various ways to evaluate identity disclosure risk. Below are some of those specific metrics. Generally for a good synthesis, we want a low expected match rate and true match rate, and a high false match rate.
Expected Match Rate: On average, how likely is it to find a “correct” match among all potential matches? Essentially, the expected number of observations in the confidential data expected to be correctly matched by an intruder.
Higher expected match rate = higher identification disclosure risk.
The two other risk metrics below focus on the subset of confidential records for which the intruder identifies a single match.
In our example, this is \(\frac{1}{3} + 1 = 1.333\)
True Match Rate: The proportion of true unique matches among all confidential records. Higher true match rate = higher identification disclosure risk.
Assuming there are 100 rows in the confidential data in our
example, this is \(\frac{1}{100} =
1\%\)
False Match Rate: The proportion of false matches among the set of unique matches. Lower false match rate = higher identification disclosure risk.
In our example, this is \(\frac{0}{1} =
0\%\)
Big picture: How well can we predict a sensitive attribute in a data set using the synthetic data (and external data)
Similar to above, you start by matching synthetic records to confidential records.
key variables: Variables that an attacker already knows about a record and can use to match.
target variables: Variables that an attacker wishes to know more or infer about using the synthetic data.
Individual CAP (correct attribution probability) risk measure: Probability that an intruder can correctly predict the value of the target variable for a record using the empirical distribution of this variable among synthetic observations with the same key variables.
Average CAP risk measure: Average of CAP measures across all records in the confidential data.
Membership Inference Tests: Can we perform a membership attack to determine if a particular record is in the confidential data?
Why is this important? Sometimes membership in a synthetic dataset is also confidential (e.g. a dataset of HIV positive patients or people who have experienced homelessness).
Also particularly useful for fully synthetic data where identity disclosure and attribute disclosure metrics don’t really make a lot of sense.
Assumes that attacker has access to a subset of the confidential data, and wants to tell if one or more records was used to generate the synthetic data.
Since we as data maintainers know the true answers, we can evaluate whether the attackers guess is true and can break it down many ways (e.g. true positives, true negatives, false positives or false negatives).
source for figure: Mendelevitch and Lesh (2021)
The “close enough” threshold is usually determined by a custom distance metric, like edit distance between text variables or numeric distance between continuous variables.
Often you will want to choose different distance thresholds and evaluate how your results change.
Copy Protection Metrics: Is our synthesizer simply memorizing the confidential data? i.e. Are our models too good?
Distance to Closest record: Measures distance between each real record (\(r\)) and the closest synthetic record (\(s_i\)), as determined by a distance calculation.
Many common distance metrics used in the literature including Euclidean distance, cosine distance, Gower distance, or Hamming distance (Mendelevitch and Lesh 2021).
Goal of this metric is to easily expose exact copies or simple perturbations of the real records that exist in the synthetic dataset.
Following the same example with Jar Jar Binks above, let’s assume that using external data an attacker was able to identify these 3 potential matches for Jar Jar Binks in the data. And because we have access to the confidential data, we know that the row in pink is a “correct” match.
homeworld | species | skin_color |
---|---|---|
Naboo | Gungan | green |
Naboo | Gungan | green |
Naboo | Gungan | grey |
Question 1: If an attacker randomly chooses one of these matches to be Jar Jar Binks, what is the probability they will be right?
Question 2: Assume that previously an attacker did not know the
skin_color
of Jar Jar Binks. Using this list of matches,
what approaches could an attacker take to guess the skin_color of Jar
Jar Binks? What is the accuracy of each of those approaches?
You can access the optional HW here. We promise it will only take a few minutes!
Snoke, Joshua, Gillian M Raab, Beata Nowok, Chris Dibben, and Aleksandra Slavkovic. 2018b. “General and Specific Utility Measures for Synthetic Data.” Journal of the Royal Statistical Society: Series A (Statistics in Society) 181 (3): 663–88.
Bowen, Claire McKay, Victoria Bryant, Leonard Burman, Surachai Khitatrakun, Robert McClelland, Philip Stallworth, Kyle Ueyama, and Aaron R Williams. 2020. “A Synthetic Supplemental Public Use File of Low-Income Information Return Data: Methodology, Utility, and Privacy Implications.” In International Conference on Privacy in Statistical Databases, 257–70. Springer.