Introduction to Synthetic Data

This 3-course training series will provide an overview of data privacy and synthetic data methods with the goal of protecting privacy while maintaining data utility. These trainings will cover the theory and concepts behind synthetic data and equip you with tools to apply these data privacy techniques to datasets. Specifically:

  • Week 1 will serve as an overview of data privacy concepts and techniques,
  • Week 2 will cover synthetic data creation and use, and
  • Week 3 will cover evaluating synthetic data using key privacy and utility metrics.


Motivation

This training series is provided in partnership with Allegheny County and the Western Pennsylvania Regional Data Center (WPRDC). While synthetic data has been used at the federal level, local governments and organizations often do not have the human or computational resources required to implement synthetic data as a privacy-preserving technique. As part of a pilot program intended to understand and target the specific privacy-related needs of local governments, the Urban Institute is offering these trainings to any local stakeholders wishing to learn more about creation, applications, and limitations of synthetic data.


Prerequisites

This course assumes some knowledge of general statistical concepts such as summary statistics and basic regression. No coding background is needed, but optional coding exercises will be provided in R and Python for interested users.


Instructors and Contact Information


Lesson Recordings

Day 1: Intro to Data Privacy and Data Synthesis
Day 2: Synthetic Data Methods
Day 3: Disclosure Risk and Utility Metrics and more Case Studies

Unfortunately our Day 3 recording was not saved properly. So we have linked a recording from another training session we gave with mostly identical content.


Code Requirements

In order to run the R scripts in lessons_follow_along_code, you will need the following R packages installed:

  • tidyverse
  • palmerpenguins
  • kableExtra
  • gt
  • here
  • smoothmest
  • urbnthemes (can be installed with remotes::install_github("UrbanInstitute/urbnthemes"))

In order to run the python scripts in lessons_follow_along_code, you will need the following python modules installed:

  • pandas
  • numpy
  • seaborn
  • palmerpenguins
  • scipy
  • statsmodels