Lesson 1: Intro to Data Privacy and Data Synthesis

July 13, 2022

Why is Data Privacy important?

Modern computing and technology has made collecting and processing large amounts of data easy.
But, with that computing power, malicious actors can now easily re-identify individuals by linking supposedly anonymized records with public databases.
This kind of attack is called a “record linkage” attack. The following are some examples of famous record linkage attacks.
- In 1997, MA Gov. Bill Weld announced the public release of insurance data for researchers. He assured the public that PII had been deleted. A few days later Dr. Latanya Sweeney, then a MIT graduate student, mailed to Weld’s office his personal medical information. She purchased voter data and linked Weld’s birth date, gender and zip code to his health records. And this was back in 1997, when computing power was miniscule, and social media didn’t exist!
- A study by Dr. Latayna Sweeney based on the 1990 Census (Sweeney 2000) found that the 87% of the US population had reported characteristics that likely made them unique based only on ZIP, gender, and date of birth.
- In 2019, the NYT got access to a private company’s database of cell phone locations. They were able to identify a Microsoft engineer who went between his house and Microsoft’s Seattle campus every work day. Then one day he went into the Amazon’s Seattle campus and then a month later started going into Amazon’s Seattle campus every work day.
At the same time, releasing granular data can be of immense value to researchers. For example, cell phone data are invaluable for contact tracing, and granular medical data will lead to better treatment and development of cures.
More granular data are also important for understanding equity, particularly for smaller populations and subgroups.

What is Data Privacy?

Data Privacy: The right of individuals to have have control over their sensitive information.
- There are differing notions of what should and shouldn’t be private, which may include being able to opt out of privacy protections.
Data Privacy is a broad topic, which includes data security, encryption, access to data, etc.
- We will not be covering privacy breaches from unauthorized access to a database (i.e., hackers).
- We are instead focused on privacy preserving access to data.
There is also a slight difference between Privacy and Confidentiality.
- Privacy: The ability “to determine what information about ourselves we will share with others” (Fellegi 1972).
- Confidentiality: “the agreement, explicit or implicit, between data subject and data collector regarding the extent to which access by others to personal information is allowed” (Fienberg and Jin 2018).
There is often a tension between privacy and utility of the data, known in the literature as the privacy-utility trade-off.
- Data utility: How useful or accurate the data are for research and analysis (aka data quality/accuracy/usefulness).
- Generally higher utility = more privacy risks and vice versa.
In the data privacy ecosystem there are the following stakeholders:
- Data users: Those who consume the data, e.g., researchers, decisionmakers, etc.
- Data maintainers (aka data stewards/data owners): Those who own the data and are responsible for its safekeeping.
In addition there are many version of the data we should define:
- Original dataset - uncleaned version (raw census microdata).
- Confidential dataset (aka gold standard/actual data) - cleaned version (cleaned census microdata).
- Public dataset - publicly released version (Census’ public tables/datasets).

Data Privacy Workflow

Data users have traditionally gotten access to data via:
1. direct access to the confidential data if they are trusted users (e.g., Census Research Data Centers).
2. access to the public data/statistics (microdata, summary tables, or statistics) produced by data maintainers. This is how most people access the data, and what we’ll focus on.
Data maintainers have relied on Statistical Disclosure Control (SDC) methods to preserve data confidentiality when releasing public data.
Generally releasing confidential data involves the following steps to protect privacy:

Exercise 1

Question

Fill in the grey and green blanks in the following diagram with the terms you learned above. There’s also 1 mistake, see if you can spot it.

Solution

Overview of SDC

Statistical Disclosure Control (SDC) or Statistical Disclosure Limitation (SDL) is a field of study that aims to develop methods for releasing high-quality data products while preserving the confidentiality of sensitive data.
Have existed within statistics and the social sciences since the mid-twentieth century.
Below is an opinionated, and incomplete overview of various SDC methods. For this set of trainings, we will focus in-depth on the methods in yellow.

Definitions of a few methods we won’t cover in detail. See Matthews and Harel (2011) for more information.
- Suppression: Not releasing data about certain subgroups.
- Swapping: The exchange of sensitive values among sample units with similar characteristics.
- Generalization: Aggregating variables into larger units (e.g., reporting state rather than zip code) or top/bottom coding (limiting values below/above a threshold to the threshold value).
- Noise Infusion: Adding random noise, often to continuous variables which can maintain univariate distributions.
- Sampling: Only releasing a sample of individual records.
Problem with the above approaches is that it really limits the utility of the data.
- Mitra and Reiter (2006) found that a 5 percent swapping of 2 identifying variables in the 1987 Survey of Youth in Custody invalidated statistical hypothesis tests in regression.
- Top/bottom coding eliminates information at the tails of the distributions, degrading analyses that depend on the entire distribution (Fuller 1993; Reiter, Wang, and Zhang, 2014).
A newer development in the SDC field with the potential to overcome some of these problems is Synthetic Data.

Synthetic Data

Synthetic data replaces actual records with statistically representative pseudo-records.
Typically generated from probability distributions, or models that are identified as being representative of the confidential data.
Generally you can produce 2 kinds of synthetic data:
- Partially Synthetic: Choosing a subset of sensitive variables to synthesize, and keeping the other non-sensitive variables the same. This means there is 1:1 mapping between actual records and the synthetic records.
- Fully Synthetic: Synthesizing all variables in the data, so no confidential data are released. There is no longer a 1:1 mapping between the actual and synthetic records, but in the aggregate the synthetic dataset should be statistically representative of the actual datasets.

Formal Privacy

Formal privacy centers around a formal mathematical definition of privacy that measures how much privacy leakage occurs.
A formally private method provides mathematically provable protection to respondents and allows policy makers to manage the trade-off between data accuracy and privacy protection through tunable parameters for multiple statistics.
Any SDC method/algorithm can be formally private if it meets the specific mathematical definition.
Formal privacy is a definition/threshold for methods to meet, not a method in and of itself.
Below in green are the SDC methods which have formally private versions.

There are two common kinds of formal privacy definitions.
1. Differential Privacy (DP): A mathematical definition of what it means to have privacy. The definition has a parameter called epsilon that control how much privacy loss occurs.
2. Approximate Differential Privacy: Similar to DP in the same framework, but it eases the privacy protection as less noise is added.
Difference between traditional SDC methods and formally private SDC methods is existence of a threat model.
- Threat Model: A set of assumptions for how data intruders will attack the data, including computing power, access to external data sources, etc.
Formal privacy definitions assume the worst possible scenario, where an attacker has all the information but 1 record, has every possible version of the dataset (accounts for all future datasets), and unlimited computing power.
Formally private methods don’t label particular variables as confidential/sensitive, but instead treats them all as sensitive due to the possibility of reidentifcation.

Measuring Utility Metrics and Disclosure Risks

Disclosure Risks

Generally there are 3 kinds of disclosure risk:
1. Identity disclosure risk: Occurs if the intruder associates a known individual with a released data record.
2. Attribute disclosure risk: Occurs if the intruder is able to determine some new characteristics of an individual based on the information available in the released data.
  - Can happen without identity disclosure, for example if a dataset shows that all people aged 50+ in a city have cancer, then an intruder knows the medical conditions all people above 50.
3. Inferential disclosure risk: Occurs if the intruder is able to determine the value of some characteristic of an individual more accurately with the released data than would otherwise have been possible.
  - For example, an attacker with a highly predictive model for income, may be able to infer a respondent’s sensitive income information using attributes recorded in the data (e.g., sex, age, education, industry, etc).
Important note: acceptable disclosure risks are usually determined by law.

Utility Measures

Generally there are 2 ways to measure utility of the data:
1. General Utility (aka global utility): Measures the distributional similarity between the confidential and synthetic data. An example is summary statistics (e.g., mean, variance, skewness, etc).
2. Specific Utility (aka outcome specific utility): Measures the similarity of results for specific analyses using both the confidential and synthetic data. An example is comparing coefficients in regression models.
Higher utility = higher accuracy and usefulness of the data, so this is a key part of selecting an appropriate SDC method.

Exercise 2

Let’s say a researcher generates a synthetic version of a dataset on penguins species. The first 5 rows of the gold standard dataset looks like this:


species	bill_length_mm	sex
Chinstrap	51.3	male
Gentoo	44.0	female
Chinstrap	51.4	male
Chinstrap	45.4	female
Adelie	36.2	female

One of the metrics to assess data utility was the overall counts of penguin species across the synthetic and gold standard data, which look like this:


species	# conf. data	# synthetic data
Adelie	152	138
Chinstrap	68	68
Gentoo	124	116

Question 1: Would the above counts be considered a global utility metric or a specific utility metric and why?

Question 2: Let’s say that researchers decide that the sex of the penguins in the data are not confidential, but the species and bill length are. So, they develop regression models that predict species and bill length from sex, use those models to estimate species and bill lengths for each row in the data and then release it publicly. What specific Statistical Disclosure Control method are these researchers using?

Question 3: Researchers (Mayer, Mutchler, and Mitchell 2016) looked at telephone metadata, which included times, duration and outgoing numbers of telephone calls. They found that 1 of the records in the data placed frequent calls to a local firearm dealer that prominently advertises a specialty in the AR semiautomatic rifle platform. The participant also placed lengthy calls to the customer support hotline for a major firearm manufacturer which produces a popular AR line of rifles. Using publicly available data, they were able to confirm the participants identity and confirm that he owned an AR-15. In this example what kinds of disclosures happened? (Hint: there were two!)

Interested in optional code exercises?

If you want to follow along with the optional code exercises starting next week, you should sign up for an Rstudio Cloud account and join the Allegheny-count-data-privacy-trainings space so that you can follow along with our code exercises. You can do that by using this link. You will know you’ve successfully joined the class space when you see the following in the top left of your Rstudio Cloud account:

You can find Intro to R and Rstudio resources here if you’re unfamiliar with R. There will also be support for Python if you wish to use that. But again, all the code examples are optional and to be done on your own time. We will not be going over them in these trainings!

Optional HW

You can access the optional HW form here. We promise this won’t take more than 5 minutes and will give us valuable feedback on how we’re doing.

References

Fellegi, Ivan P. 1972. “On the Question of Statistical Confidentiality.” Journal of the American Statistical Association 67 (337): 7–18.

Fienberg, Stephen E, and Jiashun Jin. 2018. “Statistical Disclosure Limitation For~ Data~ Access.” In Encyclopedia of Database Systems (2nd Ed.).

Matthews, Gregory J, and Ofer Harel. 2011. “Data Confidentiality: A Review of Methods for Statistical Disclosure Limitation and Methods for Assessing Privacy.” Statistics Surveys 5: 1–29.

Mayer, Jonathan, Patrick Mutchler, and John C Mitchell. 2016. “Evaluating the Privacy Properties of Telephone Metadata.” Proceedings of the National Academy of Sciences 113 (20): 5536–41.

Sweeney, Latanya. 2000. “Simple Demographics Often Identify People Uniquely.” Health (San Francisco) 671 (2000): 1–34.