Code
options(scipen = 999)
library(tidyverse)
library(urbnthemes)
set_urbn_defaults()
options(scipen = 999)
library(tidyverse)
library(urbnthemes)
set_urbn_defaults()
We will cover two case studies, where we will…
Note: Although the disclosure risk measures are different than synthetic data, we can use the same utility metrics as before to evaluate the quality of the formally private synthetic data.
The 2020 Census affects how the United States apportion the 435 seats for the United States House of Representatives, redistrict voting lines, plan for natural disasters, and allocate the $1.5 trillion budget, among many other things.
Since the 1920 Privacy Act, decennial census data have been altered with a privacy-preserving method. Several laws now require the U.S. Census Bureau to protect census data products, but the most cited law is Title 13, which protects individual level data. In addition to the legal requirements, ethically, some might not be comfortable with people knowing there is a high presence of certain racial groups, such as Asian Americans, considering the lingering legacy of internment camps during World War II.
Why is the U.S. Census Bureau updating their disclosure avoidance system (DAS)?
Note: DAS is the overall statistical disclosure control methodology that the Census Bureau applies to protect their data products.
For more information about the reconstruction attack: “The Census Bureau’s Simulated Reconstruction-Abetted Re-identification Attack on the 2010 Census” webinar materials.
Note: The U.S. Census Bureau has received criticism for their reconstruction attack. Ruggles, Cleveland, and Van Riper (2021) claim that the Census Bureau did not test whether identifying individuals through their reconstruction attack is greater than random guess. Another way to think of it is the analogy of clinical trials, where you must have a control group to see if people get better or not with a treatment. The authors describe the U.S. Census Bureau’s reconstruction attack as just the “treatment” group without a good comparison of a control group. In other words, one would expect that some people in the treatment group would get better regardless of getting a treatment or not.
The entire population of the United States of America. “The goal is to count everyone once, only once, and in the right place.” The image above shows the Census Bureau geographic levels for the U.S.
The Census Bureau checked for several things, such as, but not limited to:
Below is a summary of the method (Abowd et al. 2022).
The TDA finds the optimal distribution of counts starting from the state down to census block. Other hierarchies can be applied, such as botton-up from census block to state or from 4-digit NAICS code to 2-digit NAICS code.
Suppose we have establishment data that has the number of establishments at the state and county level with total net profit/loss and the associated 3-digit NAICS code (i.e., four variable data).
What hierarchy do you think would be best for this dataset and why?
The National Institute of Standards and Technology Public Safety Communications Research Division (NIST PSCR) hosted two Differential Privacy Data Challenges to encourage new innovations in formally private synthetic data methodologies (Ridgeway et al. 2020).
2018 Differential Privacy Synthetic Data Challenge used emergency response data and census public micro use data.
2020 Differential Privacy Temporal Map Challenge used 911 incidences data, American Community Survey demographic data, and Chicago taxi ride data.
We will focus on the 2018 challenge, because NIST and the contestants have released more publications that are associated with the challenge.
The data challenged had three matches (Bowen and Snoke 2021):
A subset of the 2017 San Francisco Fire Department’s Call for Service data
# A tibble: 313,607 × 5
`ALS Unit` `Final Priority` `Call Type Group` `Original Priority` Priority
<dbl> <dbl> <dbl> <dbl> <dbl>
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 0
5 0 0 0 0 0
6 0 0 1 0 0
7 0 0 0 0 0
8 1 0 0 0 0
9 1 0 1 0 0
10 0 0 0 0 0
# ℹ 313,597 more rows
A subset of the Vermont PUMS data
# A tibble: 211,228 × 10
METAREA METAREAD SPLIT METRO SCHOOL SEX RESPONDT SLREC LABFORCE SSENROLL
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0 0 0 1 1 1 1 1 2 0
2 0 0 0 1 1 2 2 1 1 0
3 0 0 0 1 1 2 1 1 2 0
4 0 0 0 1 2 1 1 1 1 0
5 0 0 0 1 2 1 1 1 0 0
6 0 0 0 1 2 2 1 1 0 0
7 0 0 0 1 2 1 1 1 0 0
8 0 0 0 1 1 2 1 1 0 0
9 0 0 0 1 1 2 1 1 0 0
10 0 0 0 1 1 1 2 1 2 0
# ℹ 211,218 more rows
Note: Some contestants chose not to use \(\delta\), because they used \(\epsilon\)-DP.
NIST PSCR created a “clustering task”, “classification task”, and “regression task” (Bowen and Snoke 2021).
The “clustering” analysis compared the 3-way marginal density distributions between the original and synthetic data sets, where the utility score was the absolute difference in the density distributions. NIST PSCR repeated this calculation 100 times on randomly selected variables, and then averaged for the final clustering score.
The “classification” analysis at a high level tests similarity between the original and synthetic data in the randomly selected subsets of the joint distributions, so the term “classification” is slightly misleading.
The “regression” analysis used a two-part score system.
For all three NIST PSCR scoring metrics, a larger value in the challenge indicated that the synthetic data preserved the original data well.
We will look at two of the finalist submissions to compare their approaches that performed well overall in the contest.
DPFieldGroups - nonparametric formally private method.
Maximum Spanning Tree (NIST-MST) - a parametric formally private method (McKenna, Miklau, and Sheldon 2021).
NIST PSCR announced the metric criteria at the start of each match, so the competitors could modify their approach based on the scoring metrics. This is a known issue in most data challenges.
Assumed the data are counts and treated the training data as public.
Which method do you think performed the best for the NIST Data Challenge when applied to all three datasets?
NIST-MST won the entire competition and went on to win the 2020 Differential Privacy Temporal Map Challenge.
Does the result from the previous question surprise you? Why or why not?
Think about the metrics NIST PSCR created for the challenge.
Which method do you think performed the best for wider range of utility metrics?
Bowen and Snoke (2021) used the formally private synthetic datasets from the 2018 competition and applied various other utiity metrics. They found that DPFieldGroups performed better overall compared to the other methods. The reasons are:
Google created the COVID-19 Community Mobility Reports to provide movement trends over time at different geographic regions for various categories, such as transit stations and workplaces (Aktay et al. 2020).
The figure below is a screenshot of the Google COVID-19 Community Mobility Report for Santa Fe County from December 4, 2020 to January 15, 2021. The plots show average movement increase or decrease for each category from the baseline.
Note: Google calculated the baseline as the median value for the corresponding day of the week, during the 5-week period January 3 to February 6, 2020. This is fixed.
*Note: For most location types, the Google researchers focused on the 12 active hours.
The Google researchers simplified the geospatial problem by releasing the information at larger geography levels (e.g., county) and with no other types of information (e.g., demographic information).
Figure below is a diagram of Google’s approach (Aktay et al. 2020).
Figure below shows the privacy or noise parameters used for the Google COVID-19 Mobility Reports at each geographic level. Granularity level 0 corresponds to country / region, level 1 corresponds to level geopolitical subdivsions (e.g., U.S. States), and level 2 corresponds to higher-resolution granularity (e.g., U.S. counties).
Note: The Google researchers did not publish any target statistics for geographic regions smaller than 3km\(^2\).
The figure below is a screenshot of the Google COVID-19 Community Mobility Report for Los Alamos County from December 4, 2020 to January 15, 2021. The plots show average movement increase or decrease for each category from the baseline. The * indicates that “The data doesn’t meet quality and privacy thresholds for every day in the chart.”
Do you agree or disagree with the Google researchers’ choice of utility metrics and privacy parameters?
What other utility metrics would you use to assess the quality of location and time data?
Which of these is a step (and not part of a step) in the overall statistical disclosure control framework?
Which of these is a step (and not part of a step) in the overall statistical disclosure control framework?
The U.S. Census Bureau used what privacy definition for the 2020 DAS?
The U.S. Census Bureau used what privacy definition for the 2020 DAS?
For the NIST PSCR Differential Privacy Data Challenge, what was NOT an assumption when developing their formally private methods prior to scoring?
For the NIST PSCR Differential Privacy Data Challenge, what was NOT an assumption when developing their formally private methods prior to scoring?