Logistics and Reminders

Access a recording of Lesson 1 here.
Email Maddie Pickens (mpickens@urban.org) for lesson HTML files.
To follow along with optional code exercises, see “Exercise 3 (Optional)” for instructions on setting up an RStudio Cloud account and accessing the coding exercises (available in R and Python). This week’s lesson contains dropdowns with R code for reference (click boxes labeled “Code” to view). We will not discuss these exercises or the R code in class.

Review of Synthetic Data

Synthetic data consists of pseudo or “fake” records that are statistically representative of the confidential data. Records are considered synthesized when they are replaced with draws from a model fitted to the confidential data.

The goal of most synthesis is to closely mimic the underlying distribution and statistical properties of the real data to preserve data utility while minimizing disclosure risks.
Synthesized values also limit an intruder’s confidence, because they cannot confirm a synthetic value exists in the confidential dataset.
Synthetic data may be used as a “training dataset” to develop programs to run on confidential data via a validation server.

Partially synthetic data only synthesizes some of the variables in the released data (generally those most sensitive to disclosure). In partially synthetic data, there remains a one-to-one mapping between confidential records and synthetic records. Below, we see an example of what a partially synthesized version of the above confidential data could look like.

Fully synthetic data synthesizes all values in the dataset with imputed amounts. Fully synthetic data no longer directly map onto the confidential records, but remain statistically representative. Since fully synthetic data does not contain any actual observations, it protects against both attribute and identity disclosure. Below, we see an example of what a fully synthesized version of the confidential data shown above could look like.

Synthetic Data <-> Imputation Connection

Multiple imputation was originally developed to address non-response problems in surveys (Rubin 1977).
Statisticians created new observations or values to replace the missing data by developing a model based on other available respondent information.
This process of replacing missing data with substituted values is called imputation.

Imputation Example

Imagine you are running a conference with 80 attendees. You are collecting names and ages of all your attendees. Unfortunately, when the conference is over, you realize that only about half of the attendees listed their ages. One common imputation technique is to just replace the missing values with the mean age of those in the data.

Shown below is the distribution of the 40 age observations that are not missing.

And after imptuation, the histogram looks like this:

Using the mean to impute the missing ages removes useful variation and conceals information from the “tails” of the distribution.
Another way to think about it was that we used a really straightforward model (just replace the data with the mean) and sampled from that model to fill in the missing values.
When creating synthetic data, this process is repeated for an entire variable, or set of variables
In a sense, the entire column is treated as missing!

Sequential Synthesis

A more advanced implementation of synthetic data generation estimates models for each predictor with previously synthesized variables used as predictors. This iterative process is called sequential synthesis. This allows us to easily model multivariate relationships (or joint distributions) without being computationally expensive

The process described above may be easier to understand with the following table:


Step	Outcome	Modelled with	Predicted with
1	Sex	—	Random sampling with replacement
2	Age	Sex	Sampled Sex
3	Social Security Benefits	Sex, Age	Sampled Sex, Sampled Age
—	—	—	—

We can select the synthesis order based on the priority of the variables or the relationships between them.
The earlier in the order a variable is synthesized, the better the original information is preserved in the synthetic data usually.
(Bowen, Liu, and Su 2021) proposed a method that ranks variable importance by either practical or statistical utility and sequentially synthesizes the data accordingly.

Parametric vs. Nonparametric Data Generation Process

Parametric data synthesis is the process of data generation based on a parametric distribution or generative model.

Parametric models assume a finite number of parameters that capture the complexity of the data.
They are generally less flexible, but more interpretable than nonparametric models.
Examples: regression to assign an age variable, sampling from a probability distribution, Bayesian models, copula based models.

Nonparametric data synthesis is the process of data generation that is not based on assumptions about an underlying distribution or model.

Often, nonparametric methods use frequency proportions or marginal probabilities as weights for some type of sampling scheme.
They are generally more flexible, but less interpretable than parametric models.
Examples: assigning gender based on underlying proportions, CART (Classification and Regression Trees) models, RNN models, etc.

Important: Synthetic data are only as good as the models used for imputation!

Implicates

Researchers can create any number of versions of a partially synthetic or fully synthetic dataset. Each version of the dataset is called an implicate. These can also be referred to as replicates or simply “synthetic datasets”
- For partially synthetic data, non-synthesized variables are the same across each version of the dataset
Multiple implicates are useful for understanding the uncertainty added by imputation and are required for calculating valid standard errors
More than one implicate can be released for public use; each new release, however, increases disclosure risk (but allows for more complete analysis and better inferences, provided users use the correct combining rules)
Implicates can also be analyzed internally to find which version(s) of the dataset provide the most utility in terms of data quality

Exercise 1

Sequential Synthesis

Question 3

You have a confidential dataset that contains information about dogs’ weight and their height. You decide to sequentially synthesize these two variables and write up your method below. Can you spot the mistake in writing up your method?

To create a synthetic record, first synthetic pet weight is assigned based on a random draw from a normal distribution with mean equal to the average of confidential weights, and standard deviation equal to the standard deviation of confidential weights. Then the confidential height is regressed on the synthetic weights. Using the resulting regression coefficients, a synthetic height variable is generated for each row in the data using just the synthetic weights as an input.

Question 3 Notes

To create a synthetic record, first synthetic pet weight is assigned based on a random draw from a normal distribution with mean equal to the average of confidential weights, and standard deviation equal to the standard deviation of confidential weights. Then the confidential height is regressed on the synthetic weights. Using the resulting regression coefficients, a synthetic height variable is generated for each row in the data using just the synthetic weights as an input.

Height should be regressed on the confidential values for weight, rather than the synthetic values for weight

(Bonus) Implicates

Question 1

What are the privacy implications for releasing multiple versions of a synthetic dataset (implicates)? Do these implications change for partially vs. fully synthetic data?

Question 1 Notes

What are the privacy implications for releasing multiple versions of a synthetic dataset (implicates)? Do these implications change for partially vs. fully synthetic data?

Releasing multiple implicates improves transparency and analytical value, but increases disclosure risk (violates “security through obscurity”)
It is more risky to release partially synthetic implicates, since non-synthesized records are the same across each dataset and there remains a 1-to-1 relationship between confidential and synthesized records

(Bonus) Partial vs. fully synthetic

Question 2

What are the trade-offs of a partially synthetic dataset compared to a fully synthetic dataset?

Question 2 Notes

What are the trade-offs of a partially synthetic dataset compared to a fully synthetic dataset?

Changing only some variables (partial synthesis) in general leads to higher utility in analysis since the relationships between variables are by definition unchanged (Drechsler et al, 2008).
Disclosure in fully synthetic data is nearly impossible because all values are imputed, while partial synthesis has higher disclosure risk since confidential values remain in the dataset (Drechsler et al, 2008).
- Note that while the risk of disclosure for fully synthetic data is very low, it is not zero.
Accurate and exhaustive specification of variable relationships and constraints in fully synthetic data is difficult and if done incorrectly can lead to bias (Drechsler, Bender, and Rässler 2008).
- If a variable is synthesized incorrectly early in a sequential synthesis, all variables synthesized on the basis of that variable will be affected.
Partially synthetic data may be publicly perceived as more reliable than fully synthetic data.

Demo: Creating Fully Synthetic `palmerpenguins` Data

The palmerpenguins package contains data about three species of penguins collected from three islands in the Palmer Archipelago, Antarctica. We will use an adapted version of the dataset to demonstrate some of the concepts discussed above.

# create dataset we will be using
penguins <- penguins %>% 
  filter(species == "Adelie") %>% 
  select(
    sex, 
    bill_length_mm, 
    flipper_length_mm
  ) %>%
  drop_na() 

penguins %>% 
  head() %>% 
  create_table()


sex	bill_length_mm	flipper_length_mm
male	39.1	181
female	39.5	186
female	40.3	195
female	36.7	193
male	39.3	190
female	38.9	181

The above code simplifies the dataset to only three variables and removes missing values in those variables. We will synthesize the sex, bill_length_mm, and flipper_length_mm variables in this dataset using some of the methods discussed above. Since we are synthesizing all three variables, our final version of the dataset is considered fully synthetic.

Synthesize `sex` variable

Let’s start by synthesizing sex, which is a binary variable that can take a value of either “male” or “female”. To synthesize this variable, we will identify the underlying percentages of the data that fall into each category and use it to generate records that mimic the properties of the confidential data.

# identify percentage of total that each category (sex) makes up
penguins %>%
  count(sex) %>%
  mutate(relative_frequency = n / sum(n)) %>% 
  create_table()


sex	n	relative_frequency
female	73	0.5
male	73	0.5

Using these proportions, we will now randomly sample with replacement to mimic the underlying distribution of gender.

# set a seed so pseudo-random processes are reproducible
set.seed(20220301)

# vector of gender categories
sex_categories <- c('female', 'male')

# size of sample to generate
synthetic_data_size <- nrow(penguins)

# probability weights
sex_probs <- penguins %>%
  count(sex) %>%
  mutate(relative_frequency = n / sum(n)) %>%
  pull(relative_frequency)

# use sample function to generate synthetic vector of genders
sex_synthetic <- sample(
  x = sex_categories, 
  size = synthetic_data_size, 
  replace = TRUE, 
  prob = sex_probs
)

Our new sex_synthetic variable will form the foundation of our synthesized data.

# set a seed so pseudo-random processes are reproducible
set.seed(20220301)

# use vector to generate synthetic gender column
penguins_synthetic <- tibble(
  sex = sex_synthetic
)

penguins_synthetic %>% 
  head() %>% 
  create_table()


sex
female
male
female
male
female
male

Synthesize `bill_length_mm` variable

Unlike sex, bill_length_mm is numeric.

summary(penguins$bill_length_mm)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   32.10   36.73   38.85   38.82   40.77   46.00

To synthesize this variable, we are going to predict the bill_length_mm for each penguin using a linear regression with sex as a predictor.

Note that sex is a factor variable with possible values of “male” and “female” which can’t directly be used in a regression. So under the hood, R converts that factor variable into a binary numeric variables (i.e. 0 or 1), and then runs the regression.

# linear regression
bill_length_lm <- lm(
  formula = bill_length_mm ~ sex, 
  data = penguins
)

summary(bill_length_lm)

## 
## Call:
## lm(formula = bill_length_mm ~ sex, data = penguins)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.790 -1.357  0.076  1.393  5.610 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2575     0.2524 147.608  < 2e-16 ***
## sexmale       3.1329     0.3570   8.777 4.44e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.157 on 144 degrees of freedom
## Multiple R-squared:  0.3485, Adjusted R-squared:  0.344 
## F-statistic: 77.03 on 1 and 144 DF,  p-value: 4.44e-15

Now that we have coefficients for the linear regression, we can generate our synthetic values. First, we try a straightforward prediction of bill lengths using the synthetic sex variable.

# predict bill length with model coefficients
bill_length_synthetic_method1 <- predict(
  object = bill_length_lm, 
  newdata = penguins_synthetic
)

# add predictions to synthetic data as bill_length
penguins_synthetic_method1 <- penguins_synthetic %>%
  mutate(bill_length_mm = bill_length_synthetic_method1)

penguins_synthetic_method1 %>% 
  head() %>% 
  create_table()


sex	bill_length_mm
female	37.25753
male	40.39041
female	37.25753
male	40.39041
female	37.25753
male	40.39041

And now we compare the univariate distributions of the confidential data to our newly synthesized variable with a graph.

# create dataframe with both confidential and synthesized values
compare_penguins <- bind_rows(
  "confidential" = penguins,
  "synthetic" = penguins_synthetic_method1,
  .id = "data_source"
)

# plot comparison of bill_length_mm distributions
compare_penguins %>%
  select(data_source, bill_length_mm) %>%
  pivot_longer(-data_source, names_to = "variable") %>%
  ggplot(aes(x = value, fill = data_source)) +
  geom_density(alpha = 0.3) +
  labs(title = "Comparison of Univariate Distributions",
       subtitle = "Method 1") +
  scatter_grid()

What is the problem here?

Simply using the predicted values from our linear regression does not give us enough variation in the synthetic variable.

To understand more about the predictions made from the linear regression model, let’s dig into the predicted values for the first five rows of data and the corresponding synthetic sex.

# Look at first few rows of synthetic data
penguins_synthetic_method1 %>% 
  head() %>% 
  create_table()


sex	bill_length_mm
female	37.25753
male	40.39041
female	37.25753
male	40.39041
female	37.25753
male	40.39041

We know from our regression analysis output from above that the intercept (\(\beta_0\)) is 37.3 and the coefficient for a male penguin (\(\beta_1\)) is 3.1. Therefore, if the penguin is male, we have a predicted value (\(\hat{y}\)) of (37.3 + 3.1 = ) 40.4. If the penguin is female, our predicted value (\(\hat{y}\)) is only the intercept, 37.3.

Because sex will only take two values, our synthetic bill_length_mm also only takes two values. The model fit limits the variation that is possible, making our synthetic variable significantly less useful.

Instead, we can try a second method. We will create a version of the variable, where the synthetic value is a draw from a normal distribution with a mean of the regression line for the given predictions, and standard deviation equal to the residual standard error.

# set a seed so pseudo-random processes are reproducible
set.seed(20220301)

# predict bill length with model coefficients
bill_length_predicted <- predict(
  object = bill_length_lm, 
  newdata = penguins_synthetic
)

# synthetic column using normal distribution centered on predictions with sd of residual standard error
bill_length_synthetic_method2 <- rnorm(
  n = nrow(penguins_synthetic), 
  mean = bill_length_predicted, 
  sd = sigma(bill_length_lm)
)

# add predictions to synthetic data as bill_length
penguins_synthetic_method2 <- penguins_synthetic %>%
  mutate(bill_length_mm = bill_length_synthetic_method2)

penguins_synthetic_method2 %>% 
  head() %>% 
  create_table()


sex	bill_length_mm
female	38.76954
male	42.85524
female	38.52128
male	38.15742
female	35.04525
male	39.22174

Now, we again compare the univariate distributions of the confidential data and the synthetic data we generated with this second method.

# create dataframe with both confidential and synthesized values
compare_penguins <- bind_rows(
  "confidential" = penguins,
  "synthetic" = penguins_synthetic_method2,
  .id = "data_source"
)

# plot comparison of bill_length_mm distributions
compare_penguins %>%
  select(data_source, bill_length_mm) %>%
  pivot_longer(-data_source, names_to = "variable") %>%
  ggplot(aes(x = value, fill = data_source)) +
  geom_density(alpha = 0.3) +
  labs(title = "Comparison of Univariate Distributions",
       subtitle = "Method 2") +
  scatter_grid()

We have much more variation with this new method, though the distributions still do not match perfectly. We choose this method’s output to add as the synthetic bill_length_mm in our final synthesized dataset. And now our synthesized dataset has two columns.

# using method 2 as synthesized variable
penguins_synthetic <- penguins_synthetic %>%
  mutate(bill_length_mm = bill_length_synthetic_method2)


penguins_synthetic %>% 
  head() %>% 
  create_table()


sex	bill_length_mm
female	38.76954
male	42.85524
female	38.52128
male	38.15742
female	35.04525
male	39.22174

Synthesize `flipper_length_mm`

The flipper_length_mm variable is also numeric, so we can follow the same steps we used to synthesize bill_length_mm.

summary(penguins$flipper_length_mm)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   172.0   186.0   190.0   190.1   195.0   210.0

This time, we regress flipper_length_mm on both sex and bill_length_mm.

# linear regression
flipper_length_lm <- lm(
  formula = flipper_length_mm ~ sex + bill_length_mm, 
  data = penguins
)

summary(flipper_length_lm)

## 
## Call:
## lm(formula = flipper_length_mm ~ sex + bill_length_mm, data = penguins)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.0907  -2.8058  -0.0534   3.3284  15.8789 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    170.6183     8.7497  19.500   <2e-16 ***
## sexmale          3.1721     1.2422   2.554   0.0117 *  
## bill_length_mm   0.4610     0.2341   1.970   0.0508 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.058 on 143 degrees of freedom
## Multiple R-squared:  0.1492, Adjusted R-squared:  0.1373 
## F-statistic: 12.54 on 2 and 143 DF,  p-value: 9.606e-06

Since we already know we prefer the method using draws from the normal distribution centered on the mean of the regression line, we will default to that.

# set a seed so pseudo-random processes are reproducible
set.seed(20220301)

# predict flipper length with model coefficients
flipper_length_predicted <- predict(
  object = flipper_length_lm, 
  newdata = penguins_synthetic
)

# synthetic column using normal distribution centered on predictions with sd of residual standard error
flipper_length_synthetic <- rnorm(
  n = nrow(penguins_synthetic), 
  mean = flipper_length_predicted, 
  sd = sigma(flipper_length_lm)
)

# add predictions to synthetic data as flipper_length
penguins_synthetic <- penguins_synthetic %>%
  mutate(flipper_length_mm = flipper_length_synthetic)

Final (fully synthesized) product and discussion

With all three synthesized variables, we can compare the the univariate distributions of the confidential data and the synthetic data (we have already done this for bill_length_mm).

# create dataframe with both confidential and synthesized values
compare_penguins <- bind_rows(
  "confidential" = penguins,
  "synthetic" = penguins_synthetic,
  .id = "data_source"
)

# Write out final compare_penguins df for use in future exercises
dir.create(here::here("data/"), showWarnings = FALSE)
compare_penguins %>% 
    write_csv(here::here("data", "penguins_synthetic_and_confidential.csv"))

First, we can compare the distributions of the sex variable in the confidential and synthetic data.

sex_comparison <- compare_penguins %>%
  select(data_source, sex) %>%
  count(data_source, sex) %>%
  group_by(data_source) %>%
  mutate(relative_frequency = n / sum(n))

sex_comparison %>% 
  ggplot(aes(x = n, y = sex, fill = data_source)) +
  geom_text(aes(label = n), 
            position = position_dodge(width = 0.5),
            hjust = -0.4) + 
  geom_col(position = "dodge", alpha = 0.7) +
  scale_x_continuous(expand = expansion(add = c(0, 15))) +
  labs(y = "", x = "N", title = "Comparison of sex distribution")

# plot comparison of distributions
compare_penguins %>%
  select(
    data_source, 
    bill_length_mm, 
    flipper_length_mm
  ) %>%
  pivot_longer(-data_source, names_to = "variable") %>%
  ggplot(aes(x = value, fill = data_source)) +
  geom_density(alpha = 0.3) +
  facet_wrap(~variable, scales = "free") +
  labs(title = "Comparison of Univariate Distributions",
       subtitle = "Final synthetic product") +
  scatter_grid()

Exercise 2

(Bonus) Discuss `sex` synthesis

Question 1

Was the method we used parametric or nonparametric? Why?
What do we notice about the synthetic variable compared to the original?
We generated a new sex variable that was the same length as the confidential data. Was this necessary for the method to be applied correctly?

Question 1 Notes

Was the method we used parametric or nonparametric? Why?
- Nonparametric; the data generation process was based on underlying frequencies rather than a distribution or generative model.
What do we notice about the synthetic variable compared to the original?
We generated a new sex variable that was the same length as the confidential data. Was this necessary for the method to be applied correctly?
- No; the number of rows in the dataset does not matter as long as the underlying frequencies are preserved

(Bonus) Discuss `bill_length_mm` synthesis

Question 2

Was the method we just used parametric or nonparametric? Why?
What do we notice about the synthetic variable compared to the original?
Are we synthesizing these data sequentially? How do you know?

Question 2 Notes

Was the method we just used parametric or nonparametric? Why?
- Parametric; the data generation process was based on a (linear) model.
What do we notice about the synthetic variable compared to the original?
Are we synthesizing these data sequentially? How do you know?
- Yes; the previously synthesized sex variable was used as a predictor.

(Bonus) Discuss `flipper_length_mm` synthesis

Question 3

What do we notice about the synthetic variable compared to the original?
What are benefits and drawbacks to generating this variable sequentially?

Question 3 Notes

What do we notice about the synthetic variable compared to the original?
What are benefits and drawbacks to generating this variable sequentially?
- Benefits: generating this variable sequentially allows us to preserve more of the multivariate relationships present in the data
- Drawbacks: if we synthesized previous variables poorly, our synthesis of this variable will also be affected

Overall Discussion

Question 4

Would you feel comfortable using this version of the dataset for analysis? Why or why not? What additional tests might you run to determine this?

Exercise 3 (Optional)

Setup

Because the following exercise involves coding, it is optional and to be done on your own time.

To follow along, if you have not already you should sign up for an Rstudio Cloud account and join the Allegheny-count-data-privacy-trainings space so that you can follow along with our code exercises. You can do that by using this link. You will know you’ve successfully joined the class space when you see the following in the top left of your Rstudio Cloud account:

You can find Intro to R and Rstudio resources here if you’re unfamiliar with R. There will also be support for Python if you wish to use that. But again, all the code examples are optional and to be done on your own time. We will not be going over them in these trainings!

For this exercise, we will use the starwars dataset from the dplyr package. We will practice sequentially synthesizing a binary variable (gender) and a numeric variable (height).

# run this to get the dataset we will work with
starwars <- dplyr::starwars %>%
  select(gender, height) %>%
  drop_na()

starwars %>% 
  head() %>% 
  create_table()


gender	height
masculine	172
masculine	167
masculine	96
masculine	202
feminine	150
masculine	178

Question 1: Gender synthesis

Fill in the blanks in the following code to synthesize the gender variable using the underlying distribution present in the data.

# set a seed so pseudo-random processes are reproducible
set.seed(20220301)

# Fill in the blanks!

# vector of gender categories
gender_categories <- c('feminine', 'masculine')

# size of sample to generate
synthetic_data_size <- nrow(starwars)

# probability weights
gender_probs <- starwars %>%
  count(gender) %>%
  mutate(relative_frequency = ### ______) %>%
  pull(relative_frequency)

# use sample function to generate synthetic vector of genders
gender_synthetic <- sample(
    x = ###_____, 
    size = ###_____,
    replace = ###_____, 
    prob = ###_____
)
                          
# create starwars_synthetic dataset using generated variable
starwars_synthetic <- tibble(
  gender = gender_synthetic
)

Question 2: Height synthesis

Similarly, fill in the blanks in the code to generate the height variable using a linear regression with gender as a predictor.

# set a seed so pseudo-random processes are reproducible
set.seed(20220301)

# Fill in the blanks!

# linear regression
height_lm <- lm(
  formula = ###_____,
  data = ###______
)

# predict flipper length with model coefficients
height_predicted <- predict(
  object = height_lm, 
  newdata = ###_____
)

# synthetic column using normal distribution centered on predictions with sd of residual standard error
height_synthetic <- rnorm(
  n = ###_______, 
  mean = ###______, 
  sd = ###______
)

# add new values to synthetic data as height
starwars_synthetic <- starwars_synthetic %>%
  mutate(height = height_synthetic)

Feedback and (Optional) Homework

You can access the feedback form, along with optional homework questions to test your understanding, here. We appreciate your feedback!

References

Little, R. J. (1993). Statistical analysis of masked data. Journal of Official Statistics - Stockholm (9), 407-407.

Rubin, D. B. (1993). Statistical disclosure limitation. Journal of official Statistics, 9(2), 461-468.

Bowen, Claire McKay, Fang Liu, and Bingyue Su. 2021. “Differentially Private Data Release via Statistical Election to Partition Sequentially.” Metron 79 (1): 1–31.

Drechsler, Jörg, Stefan Bender, and Susanne Rässler. 2008. “Comparing Fully and Partially Synthetic Datasets for Statistical Disclosure Control in the German Iab Establishment Panel.” Transactions on Data Privacy 1 (December): 105–30.

Rubin, Donald B. 1977. “Formalizing Subjective Notions About the Effect of Nonrespondents in Sample Surveys.” Journal of the American Statistical Association 72 (359): 538–43.

Lesson 2: Synthetic Data Methods

July 20, 2022

Logistics and Reminders

Review of Synthetic Data

Synthetic Data <-> Imputation Connection

Imputation Example

Sequential Synthesis

Parametric vs. Nonparametric Data Generation Process

Implicates

Exercise 1

Sequential Synthesis

Question 3

Question 3 Notes

(Bonus) Implicates

Question 1

Question 1 Notes

(Bonus) Partial vs. fully synthetic

Question 2

Question 2 Notes

Demo: Creating Fully Synthetic `palmerpenguins` Data

Synthesize `sex` variable

Synthesize `bill_length_mm` variable

What is the problem here?

Synthesize `flipper_length_mm`

Final (fully synthesized) product and discussion

Exercise 2

(Bonus) Discuss `sex` synthesis

Question 1

Question 1 Notes

(Bonus) Discuss `bill_length_mm` synthesis

Question 2

Question 2 Notes

(Bonus) Discuss `flipper_length_mm` synthesis

Question 3

Question 3 Notes

Overall Discussion

Exercise 3 (Optional)

Feedback and (Optional) Homework

Suggested Reading

References

Lesson 2: Synthetic Data Methods

July 20, 2022

Logistics and Reminders

Review of Synthetic Data

Synthetic Data <-> Imputation Connection

Imputation Example

Sequential Synthesis

Parametric vs. Nonparametric Data Generation Process

Implicates

Exercise 1

Sequential Synthesis

Question 3

Question 3 Notes

(Bonus) Implicates

Question 1

Question 1 Notes

(Bonus) Partial vs. fully synthetic

Question 2

Question 2 Notes

Demo: Creating Fully Synthetic palmerpenguins Data

Synthesize sex variable

Synthesize bill_length_mm variable

What is the problem here?

Synthesize flipper_length_mm

Final (fully synthesized) product and discussion

Exercise 2

(Bonus) Discuss sex synthesis

Question 1

Question 1 Notes

(Bonus) Discuss bill_length_mm synthesis

Question 2

Question 2 Notes

(Bonus) Discuss flipper_length_mm synthesis

Question 3

Question 3 Notes

Overall Discussion

Exercise 3 (Optional)

Feedback and (Optional) Homework

Suggested Reading

References

Demo: Creating Fully Synthetic `palmerpenguins` Data

Synthesize `sex` variable

Synthesize `bill_length_mm` variable

Synthesize `flipper_length_mm`

(Bonus) Discuss `sex` synthesis

(Bonus) Discuss `bill_length_mm` synthesis

(Bonus) Discuss `flipper_length_mm` synthesis