4 Transforms Reference
The six pure column-transform functions in R/transforms/. Each:
- Takes a
data.table(and column names or list of column names for the multi-column transforms). - Mutates the data.table in place via
:=ordata.table::set, returns it invisibly. - Takes an optional
log4rlogger; logs warnings on per-row failures, no logging ifNULL. - Is unit-tested in
tests/test_transforms.R.
4.1 R/transforms/tax_period.R
transform_tax_period(dt, logger = NULL, tax_year_bounds = tax_year_range())
Parses the 6-char YYYYMM source tax_period field into integer tax_year and tax_month derived columns. Keeps tax_period as character (preserves leading zeros).
Validates: 6 digits, tax_year in [1989, current_year+1], tax_month in [1, 12]. Invalid rows → NA on the derived cols (the source tax_period is preserved as-is), counted in the log warning.
4.2 R/transforms/ein.R
transform_ein(dt, logger = NULL)
Normalizes EINs to the canonical IRS display format XX-XXXXXXX (10 chars, hyphen between digits 2 and 3). Strips any embedded hyphens and whitespace in the source, zero-pads to 9 digits, then inserts the hyphen.
The hyphen forces character typing on CSV re-read (so fread can’t auto-detect as integer and strip leading zeros). Invalid sources (non-numeric, empty) → NA.
4.3 R/transforms/subsection.R
transform_subsection_cd(dt, logger = NULL)
Coerces subsection_cd to integer, validates against KNOWN_SUBSECTION_CODES (loaded from data/lookups/subsection_codes.csv). Unknown codes → NA on subsection_cd.
Also derives the is_501c3 boolean column: TRUE iff subsection_cd == 3, strict (NA subsection → is_501c3 = FALSE, never NA). Strictness avoids three-valued logic for downstream filters.
4.4 R/transforms/financial_amounts.R
transform_financial_amounts(dt, cols, logger = NULL)
Coerces a vector of column names to numeric. Per cell: strips $ and commas, converts parens-wrapped values (e.g., (500)) to negatives, treats empty / "-" as NA.
Parse failures per column are counted and logged. Already-numeric columns are skipped (the harmonized columns coming from fread are usually already numeric; this transform handles the 2012 .dat case where everything reads as character).
System-boundary check: stops with an error if any column name in cols isn’t present in dt.
4.5 R/transforms/indicators.R
transform_indicators(dt, cols, logger = NULL)
Coerces a vector of column names to logical. Per cell:
{Y, y, 1, T, TRUE, true, True}→TRUE{N, n, 0, 2, F, FALSE, false, False}→FALSE- Anything else (including empty, NA) →
NA
"2" is in the FALSE set because IRS shifted some binary _cd columns to a 1 = yes, 2 = no encoding in recent vintages (e.g. hospital_audited_attached_cd in py2022, qualified_health_plan_multi_state_cd in py2023). The widening is safe because real integer-code columns in the SOI extract are renamed off the _cd suffix at the crosswalk stage (e.g. prgmservcode2acd → program_revenue_code_2a) and therefore bypass this transform. See docs/11-upstream-source-quirks.qmd → “1/2 binary encoding on selected indicator columns (py2022+)” for the trigger that would require revisiting this rule.
Used by R/03_harmonize.R to sweep all harmonized columns ending in _cd (the IRS convention for yes/no indicators), except for subsection_cd which is an integer-coded categorical, not a binary indicator.
4.6 R/transforms/efile_indicator.R
transform_efile_indicator(dt, logger = NULL)
The IRS e-file indicator uses inconsistent values across vintages. The harmonized source-column name also drifts: 990 + 990-EZ use elf in py2015–py2019, e-file (hyphenated) in py2020 of 990-EZ only, and efile from py2020/py2021 onward. 990-PF uses ELF everywhere except py2016, which uses ELFCD. All variants are mapped to efile_indicator in the crosswalk OVERRIDES.
Value encoding by vintage:
- py2015 990 + 990-EZ:
E(electronic) /P(paper) - py2016–py2017 990 + 990-EZ:
Y/N - py2018+:
E/Pagain
The transform accepts all observed encodings plus boolean-like spellings:
- TRUE set:
{E, e, Y, y, 1, T, TRUE, true, True} - FALSE set:
{P, p, N, n, 0, F, FALSE, false, False} - Anything else →
NA
Wired into apply_transforms() by column name (alongside ein, tax_period, subsection_cd) so it bypasses the generic _cd indicator sweep — efile_indicator doesn’t carry the _cd suffix in our harmonized naming.
4.7 Dispatch in R/03_harmonize.R
After the crosswalk renames source vars to harmonized names, apply_transforms() dispatches in this order:
transform_tax_period(dt)— by nametransform_ein(dt)— by nametransform_subsection_cd(dt)— by name; derivesis_501c3transform_efile_indicator(dt)— by nametransform_indicators(dt, cols = grep("_cd$", names(dt)))minussubsection_cd— by suffixtransform_financial_amounts(dt, cols = everything_else)— everything not consumed above
Order matters only for step 5/6 (suffix sweep must exclude the identity cols handled by 1–4) and for step 3 (which produces is_501c3, used downstream by quality checks).
4.8 Testing
Rscript tests/test_transforms.R runs all 32 unit tests against synthetic data.tables (no real CSV reads). Targets: each transform’s happy path, NA propagation, edge cases (empty, whitespace, malformed values), and boundary errors for multi-column transforms.