4 Transforms Reference

The six pure column-transform functions in R/transforms/. Each:

Takes a data.table (and column names or list of column names for the multi-column transforms).
Mutates the data.table in place via := or data.table::set, returns it invisibly.
Takes an optional log4r logger; logs warnings on per-row failures, no logging if NULL.
Is unit-tested in tests/test_transforms.R.

4.1 `R/transforms/tax_period.R`

transform_tax_period(dt, logger = NULL, tax_year_bounds = tax_year_range())

Parses the 6-char YYYYMM source tax_period field into integer tax_year and tax_month derived columns. Keeps tax_period as character (preserves leading zeros).

Validates: 6 digits, tax_year in [1989, current_year+1], tax_month in [1, 12]. Invalid rows → NA on the derived cols (the source tax_period is preserved as-is), counted in the log warning.

4.2 `R/transforms/ein.R`

transform_ein(dt, logger = NULL)

Normalizes EINs to the canonical IRS display format XX-XXXXXXX (10 chars, hyphen between digits 2 and 3). Strips any embedded hyphens and whitespace in the source, zero-pads to 9 digits, then inserts the hyphen.

The hyphen forces character typing on CSV re-read (so fread can’t auto-detect as integer and strip leading zeros). Invalid sources (non-numeric, empty) → NA.

4.3 `R/transforms/subsection.R`

transform_subsection_cd(dt, logger = NULL)

Coerces subsection_cd to integer, validates against KNOWN_SUBSECTION_CODES (loaded from data/lookups/subsection_codes.csv). Unknown codes → NA on subsection_cd.

Also derives the is_501c3 boolean column: TRUE iff subsection_cd == 3, strict (NA subsection → is_501c3 = FALSE, never NA). Strictness avoids three-valued logic for downstream filters.

4.4 `R/transforms/financial_amounts.R`

transform_financial_amounts(dt, cols, logger = NULL)

Coerces a vector of column names to numeric. Per cell: strips $ and commas, converts parens-wrapped values (e.g., (500)) to negatives, treats empty / "-" as NA.

Parse failures per column are counted and logged. Already-numeric columns are skipped (the harmonized columns coming from fread are usually already numeric; this transform handles the 2012 .dat case where everything reads as character).

System-boundary check: stops with an error if any column name in cols isn’t present in dt.

4.5 `R/transforms/indicators.R`

transform_indicators(dt, cols, logger = NULL)

Coerces a vector of column names to logical. Per cell:

{Y, y, 1, T, TRUE, true, True} → TRUE
{N, n, 0, 2, F, FALSE, false, False} → FALSE
Anything else (including empty, NA) → NA

"2" is in the FALSE set because IRS shifted some binary _cd columns to a 1 = yes, 2 = no encoding in recent vintages (e.g. hospital_audited_attached_cd in py2022, qualified_health_plan_multi_state_cd in py2023). The widening is safe because real integer-code columns in the SOI extract are renamed off the _cd suffix at the crosswalk stage (e.g. prgmservcode2acd → program_revenue_code_2a) and therefore bypass this transform. See docs/11-upstream-source-quirks.qmd → “1/2 binary encoding on selected indicator columns (py2022+)” for the trigger that would require revisiting this rule.

Used by R/03_harmonize.R to sweep all harmonized columns ending in _cd (the IRS convention for yes/no indicators), except for subsection_cd which is an integer-coded categorical, not a binary indicator.

4.6 `R/transforms/efile_indicator.R`

transform_efile_indicator(dt, logger = NULL)

The IRS e-file indicator uses inconsistent values across vintages. The harmonized source-column name also drifts: 990 + 990-EZ use elf in py2015–py2019, e-file (hyphenated) in py2020 of 990-EZ only, and efile from py2020/py2021 onward. 990-PF uses ELF everywhere except py2016, which uses ELFCD. All variants are mapped to efile_indicator in the crosswalk OVERRIDES.

Value encoding by vintage:

py2015 990 + 990-EZ: E (electronic) / P (paper)
py2016–py2017 990 + 990-EZ: Y / N
py2018+: E / P again

The transform accepts all observed encodings plus boolean-like spellings:

TRUE set: {E, e, Y, y, 1, T, TRUE, true, True}
FALSE set: {P, p, N, n, 0, F, FALSE, false, False}
Anything else → NA

Wired into apply_transforms() by column name (alongside ein, tax_period, subsection_cd) so it bypasses the generic _cd indicator sweep — efile_indicator doesn’t carry the _cd suffix in our harmonized naming.

4.7 Dispatch in `R/03_harmonize.R`

After the crosswalk renames source vars to harmonized names, apply_transforms() dispatches in this order:

transform_tax_period(dt) — by name
transform_ein(dt) — by name
transform_subsection_cd(dt) — by name; derives is_501c3
transform_efile_indicator(dt) — by name
transform_indicators(dt, cols = grep("_cd$", names(dt))) minus subsection_cd — by suffix
transform_financial_amounts(dt, cols = everything_else) — everything not consumed above

Order matters only for step 5/6 (suffix sweep must exclude the identity cols handled by 1–4) and for step 3 (which produces is_501c3, used downstream by quality checks).

4.8 Testing

Rscript tests/test_transforms.R runs all 32 unit tests against synthetic data.tables (no real CSV reads). Targets: each transform’s happy path, NA propagation, edge cases (empty, whitespace, malformed values), and boundary errors for multi-column transforms.

4.1 R/transforms/tax_period.R

4.2 R/transforms/ein.R

4.3 R/transforms/subsection.R

4.4 R/transforms/financial_amounts.R

4.5 R/transforms/indicators.R

4.6 R/transforms/efile_indicator.R

4.7 Dispatch in R/03_harmonize.R