4  Transforms Reference

The six pure column-transform functions in R/transforms/. Each:

4.1 R/transforms/tax_period.R

transform_tax_period(dt, logger = NULL, tax_year_bounds = tax_year_range())

Parses the 6-char YYYYMM source tax_period field into integer tax_year and tax_month derived columns. Keeps tax_period as character (preserves leading zeros).

Validates: 6 digits, tax_year in [1989, current_year+1], tax_month in [1, 12]. Invalid rows → NA on the derived cols (the source tax_period is preserved as-is), counted in the log warning.

4.2 R/transforms/ein.R

transform_ein(dt, logger = NULL)

Normalizes EINs to the canonical IRS display format XX-XXXXXXX (10 chars, hyphen between digits 2 and 3). Strips any embedded hyphens and whitespace in the source, zero-pads to 9 digits, then inserts the hyphen.

The hyphen forces character typing on CSV re-read (so fread can’t auto-detect as integer and strip leading zeros). Invalid sources (non-numeric, empty) → NA.

4.3 R/transforms/subsection.R

transform_subsection_cd(dt, logger = NULL)

Coerces subsection_cd to integer, validates against KNOWN_SUBSECTION_CODES (loaded from data/lookups/subsection_codes.csv). Unknown codes → NA on subsection_cd.

Also derives the is_501c3 boolean column: TRUE iff subsection_cd == 3, strict (NA subsection → is_501c3 = FALSE, never NA). Strictness avoids three-valued logic for downstream filters.

4.4 R/transforms/financial_amounts.R

transform_financial_amounts(dt, cols, logger = NULL)

Coerces a vector of column names to numeric. Per cell: strips $ and commas, converts parens-wrapped values (e.g., (500)) to negatives, treats empty / "-" as NA.

Parse failures per column are counted and logged. Already-numeric columns are skipped (the harmonized columns coming from fread are usually already numeric; this transform handles the 2012 .dat case where everything reads as character).

System-boundary check: stops with an error if any column name in cols isn’t present in dt.

4.5 R/transforms/indicators.R

transform_indicators(dt, cols, logger = NULL)

Coerces a vector of column names to logical. Per cell:

  • {Y, y, 1, T, TRUE, true, True}TRUE
  • {N, n, 0, 2, F, FALSE, false, False}FALSE
  • Anything else (including empty, NA) → NA

"2" is in the FALSE set because IRS shifted some binary _cd columns to a 1 = yes, 2 = no encoding in recent vintages (e.g. hospital_audited_attached_cd in py2022, qualified_health_plan_multi_state_cd in py2023). The widening is safe because real integer-code columns in the SOI extract are renamed off the _cd suffix at the crosswalk stage (e.g. prgmservcode2acdprogram_revenue_code_2a) and therefore bypass this transform. See docs/11-upstream-source-quirks.qmd → “1/2 binary encoding on selected indicator columns (py2022+)” for the trigger that would require revisiting this rule.

Used by R/03_harmonize.R to sweep all harmonized columns ending in _cd (the IRS convention for yes/no indicators), except for subsection_cd which is an integer-coded categorical, not a binary indicator.

4.6 R/transforms/efile_indicator.R

transform_efile_indicator(dt, logger = NULL)

The IRS e-file indicator uses inconsistent values across vintages. The harmonized source-column name also drifts: 990 + 990-EZ use elf in py2015–py2019, e-file (hyphenated) in py2020 of 990-EZ only, and efile from py2020/py2021 onward. 990-PF uses ELF everywhere except py2016, which uses ELFCD. All variants are mapped to efile_indicator in the crosswalk OVERRIDES.

Value encoding by vintage:

  • py2015 990 + 990-EZ: E (electronic) / P (paper)
  • py2016–py2017 990 + 990-EZ: Y / N
  • py2018+: E / P again

The transform accepts all observed encodings plus boolean-like spellings:

  • TRUE set: {E, e, Y, y, 1, T, TRUE, true, True}
  • FALSE set: {P, p, N, n, 0, F, FALSE, false, False}
  • Anything else → NA

Wired into apply_transforms() by column name (alongside ein, tax_period, subsection_cd) so it bypasses the generic _cd indicator sweep — efile_indicator doesn’t carry the _cd suffix in our harmonized naming.

4.7 Dispatch in R/03_harmonize.R

After the crosswalk renames source vars to harmonized names, apply_transforms() dispatches in this order:

  1. transform_tax_period(dt) — by name
  2. transform_ein(dt) — by name
  3. transform_subsection_cd(dt) — by name; derives is_501c3
  4. transform_efile_indicator(dt) — by name
  5. transform_indicators(dt, cols = grep("_cd$", names(dt))) minus subsection_cd — by suffix
  6. transform_financial_amounts(dt, cols = everything_else) — everything not consumed above

Order matters only for step 5/6 (suffix sweep must exclude the identity cols handled by 1–4) and for step 3 (which produces is_501c3, used downstream by quality checks).

4.8 Testing

Rscript tests/test_transforms.R runs all 32 unit tests against synthetic data.tables (no real CSV reads). Targets: each transform’s happy path, NA propagation, edge cases (empty, whitespace, malformed values), and boundary errors for multi-column transforms.