Access a recording of Lesson 1 here.
Email Maddie Pickens (mpickens@urban.org) for lesson HTML files.
To follow along with optional code exercises, see “Exercise 3 (Optional)” for instructions on setting up an RStudio Cloud account and accessing the coding exercises (available in R and Python). This week’s lesson contains dropdowns with R code for reference (click boxes labeled “Code” to view). We will not discuss these exercises or the R code in class.
Synthetic data consists of pseudo or “fake” records that are statistically representative of the confidential data. Records are considered synthesized when they are replaced with draws from a model fitted to the confidential data.
Partially synthetic data only synthesizes some of the variables in the released data (generally those most sensitive to disclosure). In partially synthetic data, there remains a one-to-one mapping between confidential records and synthetic records. Below, we see an example of what a partially synthesized version of the above confidential data could look like.
Fully synthetic data synthesizes all values in the dataset with imputed amounts. Fully synthetic data no longer directly map onto the confidential records, but remain statistically representative. Since fully synthetic data does not contain any actual observations, it protects against both attribute and identity disclosure. Below, we see an example of what a fully synthesized version of the confidential data shown above could look like.
Imagine you are running a conference with 80 attendees. You are collecting names and ages of all your attendees. Unfortunately, when the conference is over, you realize that only about half of the attendees listed their ages. One common imputation technique is to just replace the missing values with the mean age of those in the data.
Shown below is the distribution of the 40 age observations that are not missing.
And after imptuation, the histogram looks like this:
A more advanced implementation of synthetic data generation estimates models for each predictor with previously synthesized variables used as predictors. This iterative process is called sequential synthesis. This allows us to easily model multivariate relationships (or joint distributions) without being computationally expensive
The process described above may be easier to understand with the following table:
Step | Outcome | Modelled with | Predicted with |
---|---|---|---|
1 | Sex | — | Random sampling with replacement |
2 | Age | Sex | Sampled Sex |
3 | Social Security Benefits | Sex, Age | Sampled Sex, Sampled Age |
— | — | — | — |
Parametric data synthesis is the process of data generation based on a parametric distribution or generative model.
Parametric models assume a finite number of parameters that capture the complexity of the data.
They are generally less flexible, but more interpretable than nonparametric models.
Examples: regression to assign an age variable, sampling from a probability distribution, Bayesian models, copula based models.
Nonparametric data synthesis is the process of data generation that is not based on assumptions about an underlying distribution or model.
Often, nonparametric methods use frequency proportions or marginal probabilities as weights for some type of sampling scheme.
They are generally more flexible, but less interpretable than parametric models.
Examples: assigning gender based on underlying proportions, CART (Classification and Regression Trees) models, RNN models, etc.
Important: Synthetic data are only as good as the models used for imputation!
Researchers can create any number of versions of a partially synthetic or fully synthetic dataset. Each version of the dataset is called an implicate. These can also be referred to as replicates or simply “synthetic datasets”
Multiple implicates are useful for understanding the uncertainty added by imputation and are required for calculating valid standard errors
More than one implicate can be released for public use; each new release, however, increases disclosure risk (but allows for more complete analysis and better inferences, provided users use the correct combining rules)
Implicates can also be analyzed internally to find which version(s) of the dataset provide the most utility in terms of data quality
You have a confidential dataset that contains information about dogs’ weight
and their height
. You decide to sequentially synthesize these two variables and write up your method below. Can you spot the mistake in writing up your method?
To create a synthetic record, first synthetic pet weight is assigned based on a random draw from a normal distribution with mean equal to the average of confidential weights, and standard deviation equal to the standard deviation of confidential weights. Then the confidential height is regressed on the synthetic weights. Using the resulting regression coefficients, a synthetic height variable is generated for each row in the data using just the synthetic
weight
s as an input.
You have a confidential dataset that contains information about dogs’ weight
and their height
. You decide to sequentially synthesize these two variables and write up your method below. Can you spot the mistake in writing up your method?
To create a synthetic record, first synthetic pet weight is assigned based on a random draw from a normal distribution with mean equal to the average of confidential weights, and standard deviation equal to the standard deviation of confidential weights. Then the confidential height is regressed on the synthetic weights. Using the resulting regression coefficients, a synthetic height variable is generated for each row in the data using just the synthetic
weight
s as an input.
Height
should be regressed on the confidential values for weight
, rather than the synthetic values for weight
What are the privacy implications for releasing multiple versions of a synthetic dataset (implicates)? Do these implications change for partially vs. fully synthetic data?
What are the privacy implications for releasing multiple versions of a synthetic dataset (implicates)? Do these implications change for partially vs. fully synthetic data?
Releasing multiple implicates improves transparency and analytical value, but increases disclosure risk (violates “security through obscurity”)
It is more risky to release partially synthetic implicates, since non-synthesized records are the same across each dataset and there remains a 1-to-1 relationship between confidential and synthesized records
What are the trade-offs of a partially synthetic dataset compared to a fully synthetic dataset?
What are the trade-offs of a partially synthetic dataset compared to a fully synthetic dataset?
Changing only some variables (partial synthesis) in general leads to higher utility in analysis since the relationships between variables are by definition unchanged (Drechsler et al, 2008).
Disclosure in fully synthetic data is nearly impossible because all values are imputed, while partial synthesis has higher disclosure risk since confidential values remain in the dataset (Drechsler et al, 2008).
Accurate and exhaustive specification of variable relationships and constraints in fully synthetic data is difficult and if done incorrectly can lead to bias (Drechsler, Bender, and Rässler 2008).
Partially synthetic data may be publicly perceived as more reliable than fully synthetic data.
palmerpenguins
DataThe palmerpenguins
package contains data about three species of penguins collected from three islands in the Palmer Archipelago, Antarctica. We will use an adapted version of the dataset to demonstrate some of the concepts discussed above.
# create dataset we will be using
penguins <- penguins %>%
filter(species == "Adelie") %>%
select(
sex,
bill_length_mm,
flipper_length_mm
) %>%
drop_na()
penguins %>%
head() %>%
create_table()
sex | bill_length_mm | flipper_length_mm |
---|---|---|
male | 39.1 | 181 |
female | 39.5 | 186 |
female | 40.3 | 195 |
female | 36.7 | 193 |
male | 39.3 | 190 |
female | 38.9 | 181 |
The above code simplifies the dataset to only three variables and removes missing values in those variables. We will synthesize the sex
, bill_length_mm
, and flipper_length_mm
variables in this dataset using some of the methods discussed above. Since we are synthesizing all three variables, our final version of the dataset is considered fully synthetic.
sex
variableLet’s start by synthesizing sex
, which is a binary variable that can take a value of either “male” or “female”. To synthesize this variable, we will identify the underlying percentages of the data that fall into each category and use it to generate records that mimic the properties of the confidential data.
# identify percentage of total that each category (sex) makes up
penguins %>%
count(sex) %>%
mutate(relative_frequency = n / sum(n)) %>%
create_table()
sex | n | relative_frequency |
---|---|---|
female | 73 | 0.5 |
male | 73 | 0.5 |
Using these proportions, we will now randomly sample with replacement to mimic the underlying distribution of gender.
# set a seed so pseudo-random processes are reproducible
set.seed(20220301)
# vector of gender categories
sex_categories <- c('female', 'male')
# size of sample to generate
synthetic_data_size <- nrow(penguins)
# probability weights
sex_probs <- penguins %>%
count(sex) %>%
mutate(relative_frequency = n / sum(n)) %>%
pull(relative_frequency)
# use sample function to generate synthetic vector of genders
sex_synthetic <- sample(
x = sex_categories,
size = synthetic_data_size,
replace = TRUE,
prob = sex_probs
)
Our new sex_synthetic
variable will form the foundation of our synthesized data.
# set a seed so pseudo-random processes are reproducible
set.seed(20220301)
# use vector to generate synthetic gender column
penguins_synthetic <- tibble(
sex = sex_synthetic
)
penguins_synthetic %>%
head() %>%
create_table()
sex |
---|
female |
male |
female |
male |
female |
male |
bill_length_mm
variableUnlike sex
, bill_length_mm
is numeric.
summary(penguins$bill_length_mm)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 32.10 36.73 38.85 38.82 40.77 46.00
To synthesize this variable, we are going to predict the bill_length_mm
for each penguin using a linear regression with sex
as a predictor.
Note that sex
is a factor variable with possible values of “male” and “female” which can’t directly be used in a regression. So under the hood, R converts that factor variable into a binary numeric variables (i.e. 0
or 1
), and then runs the regression.
# linear regression
bill_length_lm <- lm(
formula = bill_length_mm ~ sex,
data = penguins
)
summary(bill_length_lm)
##
## Call:
## lm(formula = bill_length_mm ~ sex, data = penguins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.790 -1.357 0.076 1.393 5.610
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.2575 0.2524 147.608 < 2e-16 ***
## sexmale 3.1329 0.3570 8.777 4.44e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.157 on 144 degrees of freedom
## Multiple R-squared: 0.3485, Adjusted R-squared: 0.344
## F-statistic: 77.03 on 1 and 144 DF, p-value: 4.44e-15
Now that we have coefficients for the linear regression, we can generate our synthetic values. First, we try a straightforward prediction of bill lengths using the synthetic sex
variable.
# predict bill length with model coefficients
bill_length_synthetic_method1 <- predict(
object = bill_length_lm,
newdata = penguins_synthetic
)
# add predictions to synthetic data as bill_length
penguins_synthetic_method1 <- penguins_synthetic %>%
mutate(bill_length_mm = bill_length_synthetic_method1)
penguins_synthetic_method1 %>%
head() %>%
create_table()
sex | bill_length_mm |
---|---|
female | 37.25753 |
male | 40.39041 |
female | 37.25753 |
male | 40.39041 |
female | 37.25753 |
male | 40.39041 |
And now we compare the univariate distributions of the confidential data to our newly synthesized variable with a graph.
# create dataframe with both confidential and synthesized values
compare_penguins <- bind_rows(
"confidential" = penguins,
"synthetic" = penguins_synthetic_method1,
.id = "data_source"
)
# plot comparison of bill_length_mm distributions
compare_penguins %>%
select(data_source, bill_length_mm) %>%
pivot_longer(-data_source, names_to = "variable") %>%
ggplot(aes(x = value, fill = data_source)) +
geom_density(alpha = 0.3) +
labs(title = "Comparison of Univariate Distributions",
subtitle = "Method 1") +
scatter_grid()
Simply using the predicted values from our linear regression does not give us enough variation in the synthetic variable.
To understand more about the predictions made from the linear regression model, let’s dig into the predicted values for the first five rows of data and the corresponding synthetic sex
.
# Look at first few rows of synthetic data
penguins_synthetic_method1 %>%
head() %>%
create_table()
sex | bill_length_mm |
---|---|
female | 37.25753 |
male | 40.39041 |
female | 37.25753 |
male | 40.39041 |
female | 37.25753 |
male | 40.39041 |
We know from our regression analysis output from above that the intercept (\(\beta_0\)) is 37.3 and the coefficient for a male penguin (\(\beta_1\)) is 3.1. Therefore, if the penguin is male, we have a predicted value (\(\hat{y}\)) of (37.3 + 3.1 = )
40.4. If the penguin is female, our predicted value (\(\hat{y}\)) is only the intercept, 37.3.
Because sex
will only take two values, our synthetic bill_length_mm
also only takes two values. The model fit limits the variation that is possible, making our synthetic variable significantly less useful.
Instead, we can try a second method. We will create a version of the variable, where the synthetic value is a draw from a normal distribution with a mean of the regression line for the given predictions, and standard deviation equal to the residual standard error.
# set a seed so pseudo-random processes are reproducible
set.seed(20220301)
# predict bill length with model coefficients
bill_length_predicted <- predict(
object = bill_length_lm,
newdata = penguins_synthetic
)
# synthetic column using normal distribution centered on predictions with sd of residual standard error
bill_length_synthetic_method2 <- rnorm(
n = nrow(penguins_synthetic),
mean = bill_length_predicted,
sd = sigma(bill_length_lm)
)
# add predictions to synthetic data as bill_length
penguins_synthetic_method2 <- penguins_synthetic %>%
mutate(bill_length_mm = bill_length_synthetic_method2)
penguins_synthetic_method2 %>%
head() %>%
create_table()
sex | bill_length_mm |
---|---|
female | 38.76954 |
male | 42.85524 |
female | 38.52128 |
male | 38.15742 |
female | 35.04525 |
male | 39.22174 |
Now, we again compare the univariate distributions of the confidential data and the synthetic data we generated with this second method.
# create dataframe with both confidential and synthesized values
compare_penguins <- bind_rows(
"confidential" = penguins,
"synthetic" = penguins_synthetic_method2,
.id = "data_source"
)
# plot comparison of bill_length_mm distributions
compare_penguins %>%
select(data_source, bill_length_mm) %>%
pivot_longer(-data_source, names_to = "variable") %>%
ggplot(aes(x = value, fill = data_source)) +
geom_density(alpha = 0.3) +
labs(title = "Comparison of Univariate Distributions",
subtitle = "Method 2") +
scatter_grid()
We have much more variation with this new method, though the distributions still do not match perfectly. We choose this method’s output to add as the synthetic bill_length_mm
in our final synthesized dataset. And now our synthesized dataset has two columns.
# using method 2 as synthesized variable
penguins_synthetic <- penguins_synthetic %>%
mutate(bill_length_mm = bill_length_synthetic_method2)
penguins_synthetic %>%
head() %>%
create_table()
sex | bill_length_mm |
---|---|
female | 38.76954 |
male | 42.85524 |
female | 38.52128 |
male | 38.15742 |
female | 35.04525 |
male | 39.22174 |
flipper_length_mm
The flipper_length_mm
variable is also numeric, so we can follow the same steps we used to synthesize bill_length_mm
.
summary(penguins$flipper_length_mm)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 172.0 186.0 190.0 190.1 195.0 210.0
This time, we regress flipper_length_mm
on both sex
and bill_length_mm
.
# linear regression
flipper_length_lm <- lm(
formula = flipper_length_mm ~ sex + bill_length_mm,
data = penguins
)
summary(flipper_length_lm)
##
## Call:
## lm(formula = flipper_length_mm ~ sex + bill_length_mm, data = penguins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.0907 -2.8058 -0.0534 3.3284 15.8789
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 170.6183 8.7497 19.500 <2e-16 ***
## sexmale 3.1721 1.2422 2.554 0.0117 *
## bill_length_mm 0.4610 0.2341 1.970 0.0508 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.058 on 143 degrees of freedom
## Multiple R-squared: 0.1492, Adjusted R-squared: 0.1373
## F-statistic: 12.54 on 2 and 143 DF, p-value: 9.606e-06
Since we already know we prefer the method using draws from the normal distribution centered on the mean of the regression line, we will default to that.
# set a seed so pseudo-random processes are reproducible
set.seed(20220301)
# predict flipper length with model coefficients
flipper_length_predicted <- predict(
object = flipper_length_lm,
newdata = penguins_synthetic
)
# synthetic column using normal distribution centered on predictions with sd of residual standard error
flipper_length_synthetic <- rnorm(
n = nrow(penguins_synthetic),
mean = flipper_length_predicted,
sd = sigma(flipper_length_lm)
)
# add predictions to synthetic data as flipper_length
penguins_synthetic <- penguins_synthetic %>%
mutate(flipper_length_mm = flipper_length_synthetic)
With all three synthesized variables, we can compare the the univariate distributions of the confidential data and the synthetic data (we have already done this for bill_length_mm
).
# create dataframe with both confidential and synthesized values
compare_penguins <- bind_rows(
"confidential" = penguins,
"synthetic" = penguins_synthetic,
.id = "data_source"
)
# Write out final compare_penguins df for use in future exercises
dir.create(here::here("data/"), showWarnings = FALSE)
compare_penguins %>%
write_csv(here::here("data", "penguins_synthetic_and_confidential.csv"))
First, we can compare the distributions of the sex
variable in the confidential and synthetic data.
sex_comparison <- compare_penguins %>%
select(data_source, sex) %>%
count(data_source, sex) %>%
group_by(data_source) %>%
mutate(relative_frequency = n / sum(n))
sex_comparison %>%
ggplot(aes(x = n, y = sex, fill = data_source)) +
geom_text(aes(label = n),
position = position_dodge(width = 0.5),
hjust = -0.4) +
geom_col(position = "dodge", alpha = 0.7) +
scale_x_continuous(expand = expansion(add = c(0, 15))) +
labs(y = "", x = "N", title = "Comparison of sex distribution")
# plot comparison of distributions
compare_penguins %>%
select(
data_source,
bill_length_mm,
flipper_length_mm
) %>%
pivot_longer(-data_source, names_to = "variable") %>%
ggplot(aes(x = value, fill = data_source)) +
geom_density(alpha = 0.3) +
facet_wrap(~variable, scales = "free") +
labs(title = "Comparison of Univariate Distributions",
subtitle = "Final synthetic product") +
scatter_grid()
sex
synthesissex
variable that was the same length as the confidential data. Was this necessary for the method to be applied correctly?Was the method we used parametric or nonparametric? Why?
What do we notice about the synthetic variable compared to the original?
We generated a new sex
variable that was the same length as the confidential data. Was this necessary for the method to be applied correctly?
bill_length_mm
synthesisWas the method we just used parametric or nonparametric? Why?
What do we notice about the synthetic variable compared to the original?
Are we synthesizing these data sequentially? How do you know?
Was the method we just used parametric or nonparametric? Why?
What do we notice about the synthetic variable compared to the original?
Are we synthesizing these data sequentially? How do you know?
sex
variable was used as a predictor.flipper_length_mm
synthesisWhat do we notice about the synthetic variable compared to the original?
What are benefits and drawbacks to generating this variable sequentially?
Question 4
Setup
Because the following exercise involves coding, it is optional and to be done on your own time.
To follow along, if you have not already you should sign up for an Rstudio Cloud account and join the Allegheny-count-data-privacy-trainings
space so that you can follow along with our code exercises. You can do that by using this link. You will know you’ve successfully joined the class space when you see the following in the top left of your Rstudio Cloud account:
You can find Intro to R and Rstudio resources here if you’re unfamiliar with R. There will also be support for Python if you wish to use that. But again, all the code examples are optional and to be done on your own time. We will not be going over them in these trainings!
For this exercise, we will use the starwars
dataset from the dplyr
package. We will practice sequentially synthesizing a binary variable (gender
) and a numeric variable (height
).
# run this to get the dataset we will work with
starwars <- dplyr::starwars %>%
select(gender, height) %>%
drop_na()
starwars %>%
head() %>%
create_table()
gender | height |
---|---|
masculine | 172 |
masculine | 167 |
masculine | 96 |
masculine | 202 |
feminine | 150 |
masculine | 178 |
Question 1: Gender synthesis
Fill in the blanks in the following code to synthesize the gender
variable using the underlying distribution present in the data.
# set a seed so pseudo-random processes are reproducible
set.seed(20220301)
# Fill in the blanks!
# vector of gender categories
gender_categories <- c('feminine', 'masculine')
# size of sample to generate
synthetic_data_size <- nrow(starwars)
# probability weights
gender_probs <- starwars %>%
count(gender) %>%
mutate(relative_frequency = ### ______) %>%
pull(relative_frequency)
# use sample function to generate synthetic vector of genders
gender_synthetic <- sample(
x = ###_____,
size = ###_____,
replace = ###_____,
prob = ###_____
)
# create starwars_synthetic dataset using generated variable
starwars_synthetic <- tibble(
gender = gender_synthetic
)
Question 2: Height synthesis
Similarly, fill in the blanks in the code to generate the height
variable using a linear regression with gender
as a predictor.
# set a seed so pseudo-random processes are reproducible
set.seed(20220301)
# Fill in the blanks!
# linear regression
height_lm <- lm(
formula = ###_____,
data = ###______
)
# predict flipper length with model coefficients
height_predicted <- predict(
object = height_lm,
newdata = ###_____
)
# synthetic column using normal distribution centered on predictions with sd of residual standard error
height_synthetic <- rnorm(
n = ###_______,
mean = ###______,
sd = ###______
)
# add new values to synthetic data as height
starwars_synthetic <- starwars_synthetic %>%
mutate(height = height_synthetic)
You can access the feedback form, along with optional homework questions to test your understanding, here. We appreciate your feedback!
Raghunathan, T. E. (2021). Synthetic data. Annual Review of Statistics and Its Application, 8, 129-140.
Little, R. J. (1993). Statistical analysis of masked data. Journal of Official Statistics - Stockholm (9), 407-407.
Rubin, D. B. (1993). Statistical disclosure limitation. Journal of official Statistics, 9(2), 461-468.
Bowen, Claire McKay, Fang Liu, and Bingyue Su. 2021. “Differentially Private Data Release via Statistical Election to Partition Sequentially.” Metron 79 (1): 1–31.
Drechsler, Jörg, Stefan Bender, and Susanne Rässler. 2008. “Comparing Fully and Partially Synthetic Datasets for Statistical Disclosure Control in the German Iab Establishment Panel.” Transactions on Data Privacy 1 (December): 105–30.
Rubin, Donald B. 1977. “Formalizing Subjective Notions About the Effect of Nonrespondents in Sample Surveys.” Journal of the American Statistical Association 72 (359): 538–43.