Cleaning NTIS Data: Before and After ntistools
Source:vignettes/before-and-after.Rmd
before-and-after.RmdWhy ntistools?
NTIS data-cleaning scripts tend to repeat the same
case_when() patterns over and over: combining binary
indicators, removing sentinels, collapsing Likert scales, labeling
columns, and so on. A typical script might contain 20+ blocks of nearly
identical code like this:
# This same pattern, repeated for every binary variable...
ProgDem_Veterans = case_when(
ProgDem_Veterans == 1 ~ "Serving veterans",
ProgDem_Veterans == 0 ~ "Not serving veterans",
.default = NA_character_
)ntistools replaces each pattern with a single, purpose-built function so your scripts are shorter, easier to read, and less error-prone. This vignette walks through each function, showing the verbose “before” code alongside the concise “after” equivalent, all using a small synthetic dataset.
Setup
We start by creating a synthetic dataset that mimics the structure of real NTIS survey data. Each column represents a common pattern you’ll encounter:
library(ntistools)
library(dplyr)
survey <- data.frame(
# Binary geographic indicators (0 = no, 1 = yes)
GeoAreas_Local = c(1, 0, 0, NA, 0, 1),
GeoAreas_MultipleLocal = c(0, 0, 1, NA, 0, 0),
GeoAreas_RegionalWithin = c(0, 0, 0, NA, 0, 0),
GeoAreas_MultipleState = c(0, NA, 1, NA, 0, 0),
GeoAreas_RegionalAcross = c(0, NA, 0, NA, 0, 1),
# Disruption indicators (0 = no, 1 = yes)
LLLost = c(1, 0, 1, NA, 0, 1),
LLDelay = c(1, 1, 0, NA, 0, 1),
LLStop = c(0, 0, 0, NA, 0, 1),
# ProgDem-style columns (0 = no, 1 = yes primary, 2 = yes secondary, 98 = NA)
ProgDem_Children = c(0, 1, 2, 98, 0, 1),
ProgDem_Elders = c(1, 0, 98, 2, 0, 1),
# Sentinel values (98 = "don't know")
DonImportance = c(3, 1, 98, 2, 98, 4),
# Flag-based imputation columns
# PplSrv_NumWait_NA_X = 1 means "respondent confirmed they had zero people
# waiting", so NA should be replaced with 0
PplSrv_NumWait = c(50, NA, 10, NA, 30, NA),
PplSrv_NumWait_NA_X = c(0, 1, 0, 0, 0, 1),
# Imputation with year-suffixed variable names
# The variable has a year suffix but the flag column does not
Staff_RegVlntr_2023 = c(NA, 5, NA, 10, NA, 3),
Staff_RegVlntr_2024 = c(2, NA, NA, 7, NA, 4),
Staff_RegVlntr_NA = c(1, 0, 0, 0, 1, 0),
# Likert scales (1-5, with 97 = "not applicable")
FinanceChng_Benefits = c(1, 2, 3, 4, 5, 97),
FinanceChng_Salaries = c(5, 4, 3, 2, 1, 97),
FinanceChng_TotExp = c(2, 3, 97, 5, 1, 4),
# Finance change indicators (1 = yes, 0 = no, 97 = not applicable)
FinanceChng_Reserves = c(1, 0, 97, 1, 0, 1),
PrgSrvc_Suspend = c(1, 0, 97, 1, 0, 0),
# Parent/child regulation pattern
Regulations = c(1, 0, 97, 1, 0, 1),
Regulations_Federal = c(1, NA, NA, 0, NA, 1),
Regulations_State = c(0, NA, NA, 1, NA, 1),
# Staff filter columns
HaveStaff = c(1, 1, 0, 1, 0, 1),
StaffVacancies = c(3, 1, 2, 0, 4, 1),
BenefitsImpact = c(2, 3, 1, 4, 2, 3)
)1. combine_binary() — Grouping binary indicators
The problem: You have several related 0/1 columns
and need a single column that is 1 if any of them are
1. For example, in the NTIS extraction script,
GeoAreas_Local, GeoAreas_MultipleLocal, and
GeoAreas_RegionalWithin are combined into
GeoAreas_ServedLocal.
Before
before <- survey %>%
mutate(
GeoAreas_Locally = case_when(
GeoAreas_Local == 1 | GeoAreas_MultipleLocal == 1 |
GeoAreas_RegionalWithin == 1 ~ 1,
is.na(GeoAreas_Local) & is.na(GeoAreas_MultipleLocal) &
is.na(GeoAreas_RegionalWithin) ~ NA,
.default = 0
)
)
before %>% select(starts_with("GeoAreas_L"), GeoAreas_MultipleLocal,
GeoAreas_RegionalWithin)
#> GeoAreas_Local GeoAreas_Locally GeoAreas_MultipleLocal
#> 1 1 1 0
#> 2 0 0 0
#> 3 0 1 1
#> 4 NA NA NA
#> 5 0 0 0
#> 6 1 1 0
#> GeoAreas_RegionalWithin
#> 1 0
#> 2 0
#> 3 0
#> 4 NA
#> 5 0
#> 6 0After
after <- survey %>%
combine_binary(GeoAreas_Locally,
GeoAreas_Local, GeoAreas_MultipleLocal,
GeoAreas_RegionalWithin)
after %>% select(starts_with("GeoAreas_L"), GeoAreas_MultipleLocal,
GeoAreas_RegionalWithin)
#> GeoAreas_Local GeoAreas_Locally GeoAreas_MultipleLocal
#> 1 1 1 0
#> 2 0 0 0
#> 3 0 1 1
#> 4 NA NA NA
#> 5 0 0 0
#> 6 1 1 0
#> GeoAreas_RegionalWithin
#> 1 0
#> 2 0
#> 3 0
#> 4 NA
#> 5 0
#> 6 0Option: strict_na for stricter NA handling
By default, combine_binary() only returns NA when
all source columns are NA. If a row has
c(0, NA), the result is 0.
Sometimes you need stricter behavior: if any source
is NA (and none are 1), the result should be NA. This matches the
binary_flag() function used in older NTIS scripts. Set
strict_na = TRUE to get this behavior.
# Data where one source is 0 and the other is NA
strict_example <- data.frame(
a = c(0, 1, NA, 0),
b = c(NA, NA, NA, 0)
)
# Default: lenient (0 + NA = 0)
combine_binary(strict_example, result_lenient, a, b) %>%
select(a, b, result_lenient)
#> a b result_lenient
#> 1 0 NA 0
#> 2 1 NA 1
#> 3 NA NA NA
#> 4 0 0 0
# Strict: any NA poisons the result (0 + NA = NA), unless a 1 is present
combine_binary(strict_example, result_strict, a, b, strict_na = TRUE) %>%
select(a, b, result_strict)
#> a b result_strict
#> 1 0 NA NA
#> 2 1 NA 1
#> 3 NA NA NA
#> 4 0 0 0When to use strict_na = TRUE: Use it
when a missing value means you genuinely don’t know if the respondent
would have answered “yes.” In the NTIS extraction script, the
GeoAreas_ServedLocal and
GeoAreas_Servedmultistate variables use strict mode.
2. count_binary() — Any / count / all from binary
indicators
Creates three summary columns at once from a set of binary columns:
-
{prefix}_any: 1 if any source is 1 -
{prefix}_count: how many sources are 1 -
{prefix}_all: 1 if every source is 1
All three are NA when every source column is NA.
Before
before <- survey %>%
mutate(
LLDisruption_any = case_when(
LLLost == 1 | LLDelay == 1 | LLStop == 1 ~ 1,
is.na(LLLost) & is.na(LLDelay) & is.na(LLStop) ~ NA,
.default = 0
),
LLDisruption_count = case_when(
is.na(LLLost) & is.na(LLDelay) & is.na(LLStop) ~ NA_real_,
.default = rowSums(across(c(LLLost, LLDelay, LLStop),
~ . == 1), na.rm = TRUE)
),
LLDisruption_all = case_when(
LLLost == 1 & LLDelay == 1 & LLStop == 1 ~ 1,
is.na(LLLost) & is.na(LLDelay) & is.na(LLStop) ~ NA,
.default = 0
)
)
before %>% select(starts_with("LL"))
#> LLLost LLDelay LLStop LLDisruption_any LLDisruption_count LLDisruption_all
#> 1 1 1 0 1 2 0
#> 2 0 1 0 1 1 0
#> 3 1 0 0 1 1 0
#> 4 NA NA NA NA NA NA
#> 5 0 0 0 0 0 0
#> 6 1 1 1 1 3 1After
after <- survey %>%
count_binary("LLDisruption", LLLost, LLDelay, LLStop)
after %>% select(starts_with("LL"))
#> LLLost LLDelay LLStop LLDisruption_any LLDisruption_count LLDisruption_all
#> 1 1 1 0 1 2 0
#> 2 0 1 0 1 1 0
#> 3 1 0 0 1 1 0
#> 4 NA NA NA NA NA NA
#> 5 0 0 0 0 0 0
#> 6 1 1 1 1 3 1Tip: If you only need the _any column
(not _count or _all), use
combine_binary() instead — it creates a single column and
lets you name it whatever you want.
3. recode_binary() — Collapse multi-level to 0/1
The problem: NTIS demographic columns use 0 = “no”,
1 = “yes (primary)”, 2 = “yes (secondary)”, and 98 = “don’t know.” You
need to collapse these to simple 0/1. In the extraction script, all
ProgDem_ columns go through this step.
Before
recode_progdem <- function(x) {
case_when(
x == 0 ~ 0, x == 1 ~ 1, x == 2 ~ 1,
x == 98 ~ NA_real_, TRUE ~ NA_real_
)
}
before <- survey %>%
mutate(across(c(ProgDem_Children, ProgDem_Elders),
recode_progdem))
before %>% select(starts_with("ProgDem"))
#> ProgDem_Children ProgDem_Elders
#> 1 0 1
#> 2 1 0
#> 3 1 NA
#> 4 NA 1
#> 5 0 0
#> 6 1 1After
after <- survey %>%
recode_binary(c("ProgDem_Children", "ProgDem_Elders"))
after %>% select(starts_with("ProgDem"))
#> ProgDem_Children ProgDem_Elders
#> 1 0 1
#> 2 1 0
#> 3 1 NA
#> 4 NA 1
#> 5 0 0
#> 6 1 1Customizing value mappings
The defaults (ones = c(1, 2), zeros = 0,
na_values = 98) work for ProgDem_ columns, but
you can customize them for other patterns:
# Example: a column where 1-3 mean "yes" and 4-5 mean "no"
custom_example <- data.frame(rating = c(1, 2, 3, 4, 5, 98))
recode_binary(custom_example, "rating",
ones = c(1, 2, 3),
zeros = c(4, 5),
na_values = 98) %>%
pull(rating)
#> [1] 1 1 1 0 0 NA4. recode_sentinel() — Remove sentinel values
Replace survey sentinel values (e.g., 98 = “don’t know”) with
NA. The extraction script removes 98s from dozens of
columns before analysis.
After
after <- survey %>%
recode_sentinel("DonImportance")
after %>% select(DonImportance)
#> DonImportance
#> 1 3
#> 2 1
#> 3 NA
#> 4 2
#> 5 NA
#> 6 4Removing different sentinel values
By default, recode_sentinel() removes 98. You can change
which values to remove:
# Remove both 97 and 98
survey %>%
recode_sentinel("DonImportance", values = c(97, 98)) %>%
select(DonImportance)
#> DonImportance
#> 1 3
#> 2 1
#> 3 NA
#> 4 2
#> 5 NA
#> 6 45. impute_from_flag() — Fill NA when a flag says “skip
was valid”
The problem: When a respondent legitimately skipped
a numeric question (e.g., “How many people are on your waitlist?” when
they have no waitlist), the value is NA but a companion
flag column (e.g., PplSrv_NumWait_NA_X) is set to 1. These
NAs should be replaced with 0 (not left as missing data).
After
after <- survey %>%
impute_from_flag("PplSrv_NumWait")
after %>% select(starts_with("PplSrv"))
#> PplSrv_NumWait PplSrv_NumWait_NA_X
#> 1 50 0
#> 2 0 1
#> 3 10 0
#> 4 NA 0
#> 5 30 0
#> 6 0 1Option: flag_map for non-standard flag column
names
Sometimes the variable name includes a year suffix (e.g.,
Staff_RegVlntr_2023) but the flag column does not (e.g.,
Staff_RegVlntr_NA). The default
paste0(var, flag_suffix) logic would look for
Staff_RegVlntr_2023_NA_X, which doesn’t exist.
Use flag_map to tell the function exactly which flag
column to use for each variable:
# Before: six separate case_when blocks in the extraction script
before <- survey %>%
mutate(
Staff_RegVlntr_2023 = case_when(
is.na(Staff_RegVlntr_2023) & Staff_RegVlntr_NA == 1 ~ 0,
.default = Staff_RegVlntr_2023
),
Staff_RegVlntr_2024 = case_when(
is.na(Staff_RegVlntr_2024) & Staff_RegVlntr_NA == 1 ~ 0,
.default = Staff_RegVlntr_2024
)
)
before %>% select(starts_with("Staff_RegVlntr"))
#> Staff_RegVlntr_2023 Staff_RegVlntr_2024 Staff_RegVlntr_NA
#> 1 0 2 1
#> 2 5 NA 0
#> 3 NA NA 0
#> 4 10 7 0
#> 5 0 0 1
#> 6 3 4 0
# After: one call handles both year columns
after <- survey %>%
impute_from_flag(
vars = c("Staff_RegVlntr_2023", "Staff_RegVlntr_2024"),
flag_map = c(Staff_RegVlntr_2023 = "Staff_RegVlntr_NA",
Staff_RegVlntr_2024 = "Staff_RegVlntr_NA")
)
after %>% select(starts_with("Staff_RegVlntr"))
#> Staff_RegVlntr_2023 Staff_RegVlntr_2024 Staff_RegVlntr_NA
#> 1 0 2 1
#> 2 5 NA 0
#> 3 NA NA 0
#> 4 10 7 0
#> 5 0 0 1
#> 6 3 4 0Tip: You can mix flag_map and the
default suffix in the same call. Variables listed in
flag_map use their mapped flag; all others fall back to
paste0(var, flag_suffix).
6. collapse_likert() — Likert 5-point to 3-point
Collapse a 5-point Likert scale (1–5) into 3 categories (1 = low, 2 =
mid, 3 = high), with 97 set to NA. The default
mapping = c(1, 1, 2, 3, 3) means:
| Original | Maps to | Meaning |
|---|---|---|
| 1 | 1 | Low (e.g., “Significant decrease”) |
| 2 | 1 | Low (e.g., “Some decrease”) |
| 3 | 2 | Mid (e.g., “No change”) |
| 4 | 3 | High (e.g., “Some increase”) |
| 5 | 3 | High (e.g., “Significant increase”) |
| 97 | NA | Not applicable |
Before
before <- survey %>%
mutate(
FinanceChng_Benefits = case_match(
FinanceChng_Benefits,
1 ~ 1, 2 ~ 1, 3 ~ 2, 4 ~ 3, 5 ~ 3, 97 ~ NA
)
)
#> Warning: There was 1 warning in `mutate()`.
#> ℹ In argument: `FinanceChng_Benefits = case_match(...)`.
#> Caused by warning:
#> ! `case_match()` was deprecated in dplyr 1.2.0.
#> ℹ Please use `recode_values()` instead.
before %>% select(FinanceChng_Benefits)
#> FinanceChng_Benefits
#> 1 1
#> 2 1
#> 3 2
#> 4 3
#> 5 3
#> 6 NAAfter
after <- survey %>%
collapse_likert("FinanceChng_Benefits")
after %>% select(FinanceChng_Benefits)
#> FinanceChng_Benefits
#> 1 1
#> 2 1
#> 3 2
#> 4 3
#> 5 3
#> 6 NAApplying to multiple columns at once
In the extraction script, many Likert columns are collapsed using the same mapping. Pass a character vector of column names:
likert_cols <- c("FinanceChng_Benefits", "FinanceChng_Salaries",
"FinanceChng_TotExp")
after <- survey %>%
collapse_likert(likert_cols)
after %>% select(all_of(likert_cols))
#> FinanceChng_Benefits FinanceChng_Salaries FinanceChng_TotExp
#> 1 1 3 1
#> 2 1 3 2
#> 3 2 2 NA
#> 4 3 1 3
#> 5 3 1 1
#> 6 NA NA 37. propagate_parent() — Push parent value to
children
When a parent question (e.g., “Does your organization face government regulations?”) is 0 (“no”) or 97 (“N/A”), the follow-up questions should inherit that value instead of being left as NA.
Before
before <- survey %>%
mutate(across(
c(Regulations_Federal, Regulations_State),
~ case_when(
Regulations == 0 ~ 0,
Regulations == 97 ~ 97,
TRUE ~ .
)
))
before %>% select(starts_with("Regulations"))
#> Regulations Regulations_Federal Regulations_State
#> 1 1 1 0
#> 2 0 0 0
#> 3 97 97 97
#> 4 1 0 1
#> 5 0 0 0
#> 6 1 1 1After
after <- survey %>%
propagate_parent(c("Regulations_Federal", "Regulations_State"),
parent_var = "Regulations")
after %>% select(starts_with("Regulations"))
#> Regulations Regulations_Federal Regulations_State
#> 1 1 1 0
#> 2 0 0 0
#> 3 97 97 97
#> 4 1 0 1
#> 5 0 0 0
#> 6 1 1 1Customizing which parent values propagate
By default, values = c(0, 97) (both “no” and “not
applicable” propagate). You can change this:
# Only propagate 0 (not 97)
survey %>%
propagate_parent(c("Regulations_Federal", "Regulations_State"),
parent_var = "Regulations",
values = 0) %>%
select(starts_with("Regulations"))
#> Regulations Regulations_Federal Regulations_State
#> 1 1 1 0
#> 2 0 0 0
#> 3 97 NA NA
#> 4 1 0 1
#> 5 0 0 0
#> 6 1 1 18. apply_filter() — Set variables to NA for ineligible
respondents
The problem: Some survey questions only apply to certain organizations. For example, questions about staff vacancies and benefits only make sense for organizations that have paid staff. For orgs without staff, these values should be set to NA so they don’t affect analysis.
After
after <- survey %>%
apply_filter(c("StaffVacancies", "BenefitsImpact"),
condition = HaveStaff == 1)
after %>% select(HaveStaff, StaffVacancies, BenefitsImpact)
#> HaveStaff StaffVacancies BenefitsImpact
#> 1 1 3 2
#> 2 1 1 3
#> 3 0 NA NA
#> 4 1 0 4
#> 5 0 NA NA
#> 6 1 1 3Tip: The condition argument accepts any
expression you could use in dplyr::filter(). For example,
condition = HaveStaff == 1 & year == "2024" would only
keep values for orgs with staff in 2024.
9. label_binary() — Convert 0/1 to descriptive
labels
The problem: After recoding to 0/1, you often need
to convert columns to descriptive strings for reporting. The NTIS
extraction script has 25+ blocks of identical case_when()
code like this, one per variable. label_binary() replaces
all of them in a single call.
Before
before <- survey %>%
mutate(
FinanceChng_Reserves = case_when(
FinanceChng_Reserves == 1 ~ "Drew on cash reserves",
FinanceChng_Reserves %in% c(0, 97) ~ "Did not draw on cash reserves",
.default = NA_character_
),
PrgSrvc_Suspend = case_when(
PrgSrvc_Suspend == 1 ~ "Paused or suspended services",
PrgSrvc_Suspend %in% c(0, 97) ~ "Did not pause or suspend services",
.default = NA_character_
)
)
before %>% select(FinanceChng_Reserves, PrgSrvc_Suspend)
#> FinanceChng_Reserves PrgSrvc_Suspend
#> 1 Drew on cash reserves Paused or suspended services
#> 2 Did not draw on cash reserves Did not pause or suspend services
#> 3 Did not draw on cash reserves Did not pause or suspend services
#> 4 Drew on cash reserves Paused or suspended services
#> 5 Did not draw on cash reserves Did not pause or suspend services
#> 6 Drew on cash reserves Did not pause or suspend servicesAfter
after <- survey %>%
label_binary(
labels = list(
FinanceChng_Reserves = c("Drew on cash reserves",
"Did not draw on cash reserves"),
PrgSrvc_Suspend = c("Paused or suspended services",
"Did not pause or suspend services")
),
na_values = 97
)
after %>% select(FinanceChng_Reserves, PrgSrvc_Suspend)
#> FinanceChng_Reserves PrgSrvc_Suspend
#> 1 Drew on cash reserves Paused or suspended services
#> 2 Did not draw on cash reserves Did not pause or suspend services
#> 3 Did not draw on cash reserves Did not pause or suspend services
#> 4 Drew on cash reserves Paused or suspended services
#> 5 Did not draw on cash reserves Did not pause or suspend services
#> 6 Drew on cash reserves Did not pause or suspend servicesHow labels works
The labels argument is a named list. Each name is a
column in your data, and each value is a length-2 character vector:
c("label when 1", "label when 0").
Option: na_values for sentinel-as-false
Some NTIS variables treat 97 (“not applicable”) the same as 0 (“no”).
For example, if an organization answered “not applicable” to whether
they drew on cash reserves, that effectively means they did not. Set
na_values = 97 to map 97 to the false (second) label
instead of to NA.
# Without na_values: 97 becomes NA
survey %>%
label_binary(
labels = list(FinanceChng_Reserves = c("Drew", "Did not draw"))
) %>%
select(FinanceChng_Reserves)
#> FinanceChng_Reserves
#> 1 Drew
#> 2 Did not draw
#> 3 <NA>
#> 4 Drew
#> 5 Did not draw
#> 6 Drew
# With na_values = 97: 97 maps to the false label
survey %>%
label_binary(
labels = list(FinanceChng_Reserves = c("Drew", "Did not draw")),
na_values = 97
) %>%
select(FinanceChng_Reserves)
#> FinanceChng_Reserves
#> 1 Drew
#> 2 Did not draw
#> 3 Did not draw
#> 4 Drew
#> 5 Did not draw
#> 6 DrewLabeling many columns at once
In the extraction script, 16 ProgDem_ columns are each
labeled with a “Serving…” / “Not serving…” pair. You can handle them all
in one call:
# First recode to 0/1 (2 = secondary yes), then label
after <- survey %>%
recode_binary(c("ProgDem_Children", "ProgDem_Elders")) %>%
label_binary(labels = list(
ProgDem_Children = c("Serving children and youth",
"Not serving children and youth"),
ProgDem_Elders = c("Serving seniors",
"Not serving seniors")
))
after %>% select(starts_with("ProgDem"))
#> ProgDem_Children ProgDem_Elders
#> 1 Not serving children and youth Serving seniors
#> 2 Serving children and youth Not serving seniors
#> 3 Serving children and youth <NA>
#> 4 <NA> Serving seniors
#> 5 Not serving children and youth Not serving seniors
#> 6 Serving children and youth Serving seniorsOption: custom true_values and
false_values
For variables that don’t use the standard 0/1 coding:
# A column where 1,2 = "yes" and 3,4,5 = "no"
age_example <- data.frame(Dem_BChair_Age = c(1, 2, 3, 5, 97, NA))
label_binary(age_example,
labels = list(Dem_BChair_Age = c("Under 35", "35 or older")),
true_values = c(1, 2),
false_values = c(3, 4, 5, 6, 7, 8, 9),
na_values = 97
) %>%
pull(Dem_BChair_Age)
#> [1] "Under 35" "Under 35" "35 or older" "35 or older" "35 or older"
#> [6] NA10. label_likert() — Label Likert scales with
descriptive strings
The problem: After (or instead of) collapsing Likert
values to integers, you need descriptive string labels for reporting.
The extraction script has 8+ across() blocks doing this for
different column groups, each with slightly different labels.
label_likert() uses the same mapping vector
approach as collapse_likert(), then looks up the
corresponding label. Values in na_values get
na_label (default "Unsure") instead of
NA.
Before
before <- survey %>%
mutate(
FinanceChng_Benefits = case_when(
FinanceChng_Benefits %in% c(1, 2) ~ "Decrease",
FinanceChng_Benefits == 3 ~ "No change",
FinanceChng_Benefits %in% c(4, 5) ~ "Increase",
FinanceChng_Benefits == 97 ~ "Unsure",
.default = NA_character_
)
)
before %>% select(FinanceChng_Benefits)
#> FinanceChng_Benefits
#> 1 Decrease
#> 2 Decrease
#> 3 No change
#> 4 Increase
#> 5 Increase
#> 6 UnsureAfter
after <- survey %>%
label_likert("FinanceChng_Benefits",
labels = c("Decrease", "No change", "Increase"))
after %>% select(FinanceChng_Benefits)
#> FinanceChng_Benefits
#> 1 Decrease
#> 2 Decrease
#> 3 No change
#> 4 Increase
#> 5 Increase
#> 6 UnsureApplying different labels to different column groups
The extraction script uses label_likert() for several
groups of columns, each with different label text. You can chain
multiple calls:
after <- survey %>%
label_likert("FinanceChng_Benefits",
labels = c("Decrease", "No change", "Increase")) %>%
label_likert("FinanceChng_Salaries",
labels = c("Decrease in salaries", "No change",
"Increase in salaries")) %>%
label_likert("FinanceChng_TotExp",
labels = c("Decrease in expenses", "No change",
"Increase in expenses"))
after %>% select(starts_with("FinanceChng"))
#> FinanceChng_Benefits FinanceChng_Salaries FinanceChng_TotExp
#> 1 Decrease Increase in salaries Decrease in expenses
#> 2 Decrease Increase in salaries No change
#> 3 No change No change Unsure
#> 4 Increase Decrease in salaries Increase in expenses
#> 5 Increase Decrease in salaries Decrease in expenses
#> 6 Unsure Unsure Increase in expenses
#> FinanceChng_Reserves
#> 1 1
#> 2 0
#> 3 97
#> 4 1
#> 5 0
#> 6 1Option: custom na_label
The default na_label is "Unsure", but you
can change it:
survey %>%
label_likert("FinanceChng_Benefits",
labels = c("Decrease", "No change", "Increase"),
na_label = "Not applicable") %>%
select(FinanceChng_Benefits)
#> FinanceChng_Benefits
#> 1 Decrease
#> 2 Decrease
#> 3 No change
#> 4 Increase
#> 5 Increase
#> 6 Not applicableOption: custom mapping
The default mapping = c(1, 1, 2, 3, 3) collapses a
5-point scale into 3 categories. You can provide a different mapping if
your scale is different:
# 3-point scale: keep each value as its own category
three_point <- data.frame(satisfaction = c(1, 2, 3, 97))
label_likert(three_point, "satisfaction",
labels = c("Dissatisfied", "Neutral", "Satisfied"),
mapping = c(1L, 2L, 3L)) %>%
pull(satisfaction)
#> [1] "Dissatisfied" "Neutral" "Satisfied" "Unsure"Putting it all together
In a real NTIS cleaning script, you chain multiple ntistools calls into a single pipeline. Here’s a condensed example showing how the functions work together:
cleaned <- survey %>%
# Step 1: Combine related binary indicators into summary columns
combine_binary(GeoAreas_Locally,
GeoAreas_Local, GeoAreas_MultipleLocal,
GeoAreas_RegionalWithin) %>%
combine_binary(GeoAreas_Multistate,
GeoAreas_MultipleState, GeoAreas_RegionalAcross,
strict_na = TRUE) %>%
# Step 2: Create any/count/all summaries for disruption indicators
count_binary("LLDisruption", LLLost, LLDelay, LLStop) %>%
# Step 3: Recode multi-level demographics to simple 0/1
recode_binary(c("ProgDem_Children", "ProgDem_Elders")) %>%
# Step 4: Remove sentinel values
recode_sentinel("DonImportance") %>%
# Step 5: Fill in valid-skip NAs using flag columns
impute_from_flag("PplSrv_NumWait") %>%
impute_from_flag(
vars = c("Staff_RegVlntr_2023", "Staff_RegVlntr_2024"),
flag_map = c(Staff_RegVlntr_2023 = "Staff_RegVlntr_NA",
Staff_RegVlntr_2024 = "Staff_RegVlntr_NA")
) %>%
# Step 6: Propagate parent answers to children
propagate_parent(c("Regulations_Federal", "Regulations_State"),
parent_var = "Regulations") %>%
# Step 7: Zero out variables for ineligible respondents
apply_filter(c("StaffVacancies", "BenefitsImpact"),
condition = HaveStaff == 1) %>%
# Step 8: Label binary columns for reporting
label_binary(
labels = list(
GeoAreas_Locally = c("Serving locally", "Not serving locally"),
FinanceChng_Reserves = c("Drew on reserves", "Did not draw"),
PrgSrvc_Suspend = c("Paused services", "Did not pause")
),
na_values = 97
) %>%
label_binary(
labels = list(
ProgDem_Children = c("Serving children", "Not serving children"),
ProgDem_Elders = c("Serving seniors", "Not serving seniors")
)
) %>%
# Step 9: Label Likert scales for reporting
label_likert(c("FinanceChng_Benefits", "FinanceChng_TotExp"),
labels = c("Decrease", "No change", "Increase")) %>%
label_likert("FinanceChng_Salaries",
labels = c("Decrease in wages", "No change",
"Increase in wages"))
glimpse(cleaned)
#> Rows: 6
#> Columns: 32
#> $ GeoAreas_Local <dbl> 1, 0, 0, NA, 0, 1
#> $ GeoAreas_MultipleLocal <dbl> 0, 0, 1, NA, 0, 0
#> $ GeoAreas_RegionalWithin <dbl> 0, 0, 0, NA, 0, 0
#> $ GeoAreas_MultipleState <dbl> 0, NA, 1, NA, 0, 0
#> $ GeoAreas_RegionalAcross <dbl> 0, NA, 0, NA, 0, 1
#> $ LLLost <dbl> 1, 0, 1, NA, 0, 1
#> $ LLDelay <dbl> 1, 1, 0, NA, 0, 1
#> $ LLStop <dbl> 0, 0, 0, NA, 0, 1
#> $ ProgDem_Children <chr> "Not serving children", "Serving children", "S…
#> $ ProgDem_Elders <chr> "Serving seniors", "Not serving seniors", NA, …
#> $ DonImportance <dbl> 3, 1, NA, 2, NA, 4
#> $ PplSrv_NumWait <dbl> 50, 0, 10, NA, 30, 0
#> $ PplSrv_NumWait_NA_X <dbl> 0, 1, 0, 0, 0, 1
#> $ Staff_RegVlntr_2023 <dbl> 0, 5, NA, 10, 0, 3
#> $ Staff_RegVlntr_2024 <dbl> 2, NA, NA, 7, 0, 4
#> $ Staff_RegVlntr_NA <dbl> 1, 0, 0, 0, 1, 0
#> $ FinanceChng_Benefits <chr> "Decrease", "Decrease", "No change", "Increase…
#> $ FinanceChng_Salaries <chr> "Increase in wages", "Increase in wages", "No …
#> $ FinanceChng_TotExp <chr> "Decrease", "No change", "Unsure", "Increase",…
#> $ FinanceChng_Reserves <chr> "Drew on reserves", "Did not draw", "Did not d…
#> $ PrgSrvc_Suspend <chr> "Paused services", "Did not pause", "Did not p…
#> $ Regulations <dbl> 1, 0, 97, 1, 0, 1
#> $ Regulations_Federal <dbl> 1, 0, 97, 0, 0, 1
#> $ Regulations_State <dbl> 0, 0, 97, 1, 0, 1
#> $ HaveStaff <dbl> 1, 1, 0, 1, 0, 1
#> $ StaffVacancies <dbl> 3, 1, NA, 0, NA, 1
#> $ BenefitsImpact <dbl> 2, 3, NA, 4, NA, 3
#> $ GeoAreas_Locally <chr> "Serving locally", "Not serving locally", "Ser…
#> $ GeoAreas_Multistate <int> 0, NA, 1, NA, 0, 1
#> $ LLDisruption_any <int> 1, 1, 1, NA, 0, 1
#> $ LLDisruption_count <int> 2, 1, 1, NA, 0, 3
#> $ LLDisruption_all <int> 0, 0, 0, NA, 0, 1What used to be hundreds of lines of repetitive
case_when() calls becomes a single, readable pipeline. Each
step has a clear name that tells you what it does, and the options are
explicit rather than buried in copy-pasted code.
Quick reference
| Function | Input | Output | Key options |
|---|---|---|---|
combine_binary() |
0/1 columns | Single 0/1 column | strict_na |
count_binary() |
0/1 columns |
_any, _count, _all
columns |
— |
recode_binary() |
Multi-level numeric | 0/1 integer |
ones, zeros, na_values
|
recode_sentinel() |
Numeric with sentinels | Numeric with NA | values |
impute_from_flag() |
Numeric + flag column | Numeric (NAs filled) |
flag_suffix, flag_map,
impute_value
|
collapse_likert() |
5-point Likert | 3-category integer |
mapping, na_values
|
propagate_parent() |
Parent + child columns | Children updated | values |
apply_filter() |
Any columns + condition | Columns with NAs added | condition |
label_binary() |
0/1 columns | Character columns |
true_values, false_values,
na_values
|
label_likert() |
Likert numeric | Character columns |
mapping, labels, na_values,
na_label
|