12 Upstream Source-File Quirks
The IRS Statistics of Income (SOI) Tax-Exempt Organization extracts have accumulated systematic oddities over the years — header typos, file-format changes, missing publications, single-vintage coverage gaps. The pipeline compensates for each of them, but downstream users of CORE outputs benefit from knowing what was wrong upstream so that (a) anomalies in the harmonized data can be traced to a known cause rather than assumed to be a pipeline bug, and (b) longitudinal comparisons across affected years can be reasoned about explicitly.
Entries here describe what the source file does, how the pipeline handles it, and where to look in the code/config to adjust if the situation changes.
12.1 2015 header typos: Y substituted for 1, N substituted for 2
In the py2015 IRS source files, several column-name headers have the letters Y and N substituted for the digits 1 and 2, consistently in both directions. The substitution affects 6 columns in 15eofinextract990.dat and 1 column in 15eofinextractEZ.dat. Mechanism is unclear (plausibly a typed-on-keypad data-entry artifact or a font-mapping bug at IRS); the substitution is internally consistent and one-to-one, which is why harmonization can recover the affected columns without ambiguity.
| Form | Typo’d source header | Canonical header (other years) | Harmonized name |
|---|---|---|---|
| 990 | s50Yc3or4947aYcd |
s501c3or4947a1cd |
s501c3_or_4947a1_cd |
| 990 | operateschoolsY70cd |
operateschools170cd |
operates_school_sec170_cd |
| 990 | filedf8N8Ncd |
filedf8282cd |
property_disposition_8282_cd |
| 990 | filedfY098ccd |
filedf1098ccd |
filed_form_1098c_cd |
| 990 | filedlieufY04Ycd |
filedlieuf1041cd |
filed_in_lieu_1041_cd |
| 990 | filedf7N0cd |
filedf720cd |
filed_form_720_cd |
| 990-EZ | filedfYYN0polcd |
filedf1120polcd |
filed_form_1120pol_cd |
The pipeline handles these by adding the typo’d source variants as additional rows in soi_990_crosswalk_OVERRIDES.csv and soi_990ez_crosswalk_OVERRIDES.csv, each pointing at the same harmonized name as the canonical (other-vintage) source variant. The harmonize step’s coalescing logic (multiple source vars → same harmonized name → first non-null wins) merges them into a single column in the output. Rows present only in py2015 take the typo’d value; rows from other vintages take the canonical value. Both end up populated in the tax_year output regardless of which processing-year extract supplied them.
12.2 Source-file format boundary: .dat (space-delimited) → .csv (comma-delimited) at 2017→2018
py2012–py2017 extracts are space-delimited .dat files; py2018+ extracts are comma-delimited .csv files. The pipeline detects which by file extension, not by year, so the cutover is robust to IRS changing the year-of-cutover convention again. See R/data.R → SOURCE_FORMAT_FROM_PATH().
12.3 Missing 990-PF publications for py2017, py2018, py2019
The IRS did not publish a 990-PF extract for processing years 2017, 2018, or 2019. This produces expected low row counts in 990-PF tax-year outputs for tax_years 2016–2018, because most 990-PF forms for those tax years would have come from the missing processing years. The pipeline treats absent extracts as SKIP no published extract (visible in phase 1 logs) and continues; downstream tax-year outputs for the affected years are smaller than the surrounding years but otherwise correct.
12.4 tax_period source column name drifts across years
The source column carrying tax period varies erratically: 990 cycles between tax_prd (py2012, py2014, py2015) and tax_pd (everywhere else); 990-EZ has five variants across years (tax_prd, tax_pd, taxprd, a_tax_prd, taxpd); 990-PF stays consistent at TAX_PRD. All variants map to the same tax_period harmonized column in the OVERRIDES files.
12.5 elf (e-file indicator) source column name drifts
Three different names in three consecutive years of 990-EZ:
- py2019:
elf - py2020:
e-file(hyphenated) - py2021–py2024:
efile
All three map to efile_indicator in the OVERRIDES. The py2020 hyphenated variant was missed in the initial crosswalk and added retrospectively after surfacing in a full pipeline run.
A separate one-off rename: py2016 990-PF uses ELFCD for the same indicator (every other 990-PF vintage uses ELF).
12.6 Single-vintage coverage gap: 28 indicator columns in py2021 990
The py2021 IRS 990 extract under-populates a cluster of yes/no indicator (_cd) columns. Coverage recovers partially in py2022 and fully in py2023+. See docs/08-output-schema.qmd → “Tax_year 2021 990 has a single-vintage gap on 28 indicator columns” for the full column list and patterns.
12.7 Malformed-row truncation in py2014 990
The py2014 990 .dat file has one malformed row around row 2,980. fread default behavior would truncate the read at the bad row, returning ~2,978 of the 299,405 rows. The pipeline sets fill = TRUE in read_source() to tolerate this without truncating; the malformed row’s values become NA but the remaining 299K rows are read intact.
12.8 non_pf_status_reason uses pre-2016 Schedule A line numbering
The non_pf_status_reason column (source nonpfrea) is SOI’s encoding of which line a filer checked in Schedule A Part I (“Reason for Public Charity Status”). The IRS revised Schedule A in 2016 to add “agricultural research organization” at line 9, shifting 509(a)(2) from line 9 to line 10 and the rest of the list downstream. SOI did not re-encode nonpfrea after the form revision. Code 9 in the data continues to mean 509(a)(2) (gross-receipts public charity), not “agricultural research organization” as the current form’s line numbering would suggest.
The fingerprint of this in the data: code 9 grows smoothly from ~34% of 990-EZ filers in tax_year 2010 to ~50% in tax_year 2024, with no step change at the 2016 form-revision boundary. If SOI had migrated to the new numbering in 2016, we’d see a discontinuity. Code 10 (which would be the new home of 509(a)(2) under the post-2016 numbering) stays at near-zero rows throughout the entire range — fully consistent with the pre-2016 interpretation where code 10 = public-safety testing (509(a)(4)), which is genuinely rare.
The full code mapping (verified against the 2011 Schedule A instructions in data/raw/forms/i990sa_2011.pdf and the 2010–2024 data distribution) lives in data/lookups/non_pf_status_reason_codes.csv. Code 16 (≤1,089 rows across 13 years) lacks a clean Schedule A line mapping and is flagged confidence = unknown.
12.9 1/2 binary encoding on selected indicator columns (py2022+)
Two binary indicator columns shifted their value encoding from Y/N (or 1/0) to 1 = yes, 2 = no in recent vintages:
hospital_audited_attached_cd(source:hospaudfinstmtcd) — surfaced in py2022 990 with 2,355"1"and 1,245"2"values, ~97% empty.qualified_health_plan_multi_state_cd(source:qualhlthplncd) — surfaced in py2023 990 with 20"1"and 4,788"2"values, ~98.5% empty.
Both are unambiguously binary questions in the underlying form (audited financial statements attached? multi-state health plan?), and the directional yes/no skew in each is consistent with 1 = yes, 2 = no (audit attachment skews yes; multi-state health plans skew overwhelmingly no).
The pipeline handles this by including "2" in the INDICATOR_FALSE set of R/transforms/indicators.R. This is a global widening — it applies to every _cd-suffixed harmonized column — rather than a per-column override, on the rationale that:
- The two affected columns are not unique cases; IRS may extend the 1/2 encoding to additional binary
_cdcolumns in future vintages without notice. A global rule covers them automatically. - Real integer-code columns in the SOI extract (e.g. program service codes, miscellaneous revenue codes) are renamed off the
_cdsuffix at the crosswalk stage (prgmservcode2acd→program_revenue_code_2a, etc.) and therefore bypasstransform_indicatorsentirely. So no existing column is at risk of being miscoded by the widened rule.
Revisit if: a _cd-suffixed harmonized column is ever introduced that uses trinary encoding (e.g. 0 = no, 1 = yes, 2 = unknown). The rule above would then silently miscode "2" as FALSE for that column. The fix in that case would be either (a) rename the new column off the _cd suffix at the crosswalk stage, matching how integer-code columns are already handled, or (b) switch to a per-column override list.
12.10 Filer-supplied balance-sheet identity failures (~12% of 990-EZ rows)
About 11.8–12.1% of 990-EZ rows in tax_years 2013–2020 violate the identity total_assets_eoy = total_liabilities_eoy + total_net_assets_eoy. The rate trends down to 6.6–7.2% in 2021–2023 and 3.8% in 2024. Pre-2013 vintages (sourced from the 2012 SOI extract) show 5.5–9.3% on the bulk-row years (2011, 2012) and noisier rates on the small legacy CORE samples that contribute to 2000–2010.
In 2017 990-EZ (representative year, 218,552 rows, 25,914 failures):
- Sign distribution of
assets − liabilities − net_assetsis mixed: 59% positive, 41% negative. Not a single systematic offset. - Median
|diff|is $568; long tail to $713,640. - Among failing rows, only 0.9% have
net_assets == assets— i.e. this is not the “liabilities subtraction missing during harmonization” signature.
The reviewer-suggested hypothesis (“swap between networthend and totnetassetsend at the 2012→2013 boundary causing systematic harmonization failures”) is not supported by the data: a wrong column mapping from 2013+ would produce ~100% failures from a fixed year onward, not a 12% plateau drifting downward. The three harmonized columns (total_assets_eoy, total_liabilities_eoy, total_net_assets_eoy) each map to a single SOI source variant that is stable across 2012+ extracts.
The likeliest causes are upstream: small-org filers using fund-accounting conventions, accumulated-deficit reporting outside the simple identity, restated values across periods, or genuine filing errors. The pipeline does not recompute or override these — doing so would silently overwrite filer-supplied figures and erase real signal. A derived balance_sheet_identity_ok diagnostic flag would be the right approach for downstream filtering; one is not currently emitted.
12.11 Schedule A _sec170 / _sec509 blocks are filer-test-specific, zeros are not necessarily missing
501(c)(3) public-support tests come in two flavors — Schedule A Part II (§170(b)(1)(A)(vi), “one-third public support” test) and Part III (§509(a)(2), “gross-receipts” test). Each filer uses one test; the other block is irrelevant to that filer.
IRS-supplied extracts fill the unused block with literal 0, not NULL. In 2017 990-EZ:
gifts_grants_received_sec170is 0 on 74.8% of rows (NA count: 0).total_gifts_grants_received_sec509is 0 on 60.9% of rows (NA count: 0).- 819 rows have both blocks nonzero — orgs that switched tests mid-period or filed inconsistently.
The filer’s chosen test is encoded in non_pf_status_reason (see data/lookups/non_pf_status_reason_codes.csv). Cross-referencing that column gives an analyst the basis to decide which block’s zeros are meaningful and which mean “did not use this test.” The pipeline does not coerce these zeros to NA — that would require interpreting filer intent (a research decision, not a transformation), and would silently erase legitimate zero-value filings on whichever test the filer did use.
12.12 EZ filers exceeding the 990-EZ filing thresholds
Form 990-EZ is restricted to organizations under $200k gross receipts and $500k assets. In tax_year 2017, 174 EZ filers report total_revenue > $200k and 81 report total_assets_eoy > $500k.
Most are legitimate: the IRS threshold is on gross receipts, not net revenue, so a fundraiser-heavy org can clear $200k in revenue (net of large fundraising expenses) while staying under the gross-receipts cap. The remainder are orgs that filed the wrong form; IRS accepts these rather than rejecting. These rows are not pipeline artifacts and should be treated as legitimate signal that a small fraction of EZ filings would, by analyst convention, belong in the 990-full series.
12.13 Filer-supplied negatives on conceptually non-negative fields
In tax_year 2017 990-EZ:
total_contributions: 12 negatives, min −$100,000program_service_revenue: 54 negatives, min −$955,195gifts_grants_received_sec170: 8 negatives, min −$137,821
None of these are amendments. Only 3 rows in the entire 218,552-row file have is_amendment == TRUE, and zero of them overlap with the negatives. The values are IRS-supplied as-filed — plausibly refunded contributions exceeding receipts in a period, restated values, filer data-entry artifacts, or year-end netting that produces a negative aggregate. The pipeline does not sign-flip, zero out, or clip these; they are passed through faithfully.
12.14 Schedule B’s filename is f990ezb, not f990sb
IRS publishes blank Schedule B (Schedule of Contributors) under the basename f990ezb for all three forms it accompanies — 990, 990-EZ, and 990-PF. The ez in the filename is misleading; the same PDF is the official contributor schedule for every 990-series filer that has reportable contributions. The naming likely dates back to when Schedule B was introduced for 990-EZ first and the filename was never normalized when the schedule was extended.
All other schedules follow the regular f990s<letter> pattern (f990sa, f990sc, f990sd, …). scripts/download_irs_forms.R hard-codes f990ezb as the canonical basename so the archive picks up Schedule B for every year that publishes it.
12.15 Trailing-empty header columns in py2020+ CSVs
py2020+ .csv extracts have trailing commas in their header rows, which fread interprets as additional empty-named columns. For 990 py2020, that’s 5 trailing empties (v247–v251); for 990-EZ py2020, 1 trailing empty (v73). Pre-checks ignore empty-named columns when counting against the expected vintage column count; the harmonize step drops them as unmapped. They are not real data columns.
12.16 Adding a new entry
When a pipeline run surfaces a new upstream quirk, add a section here describing:
- What the source file does — concrete observation (specific column name / value / row count, with year / form).
- How the pipeline handles it — which file got the fix (crosswalk OVERRIDES, transform, config, reader option), and the principle behind it.
- How to widen the handling if the same quirk recurs in a future vintage.
Keep each entry self-contained — a user troubleshooting an anomaly should be able to land on one section and confirm whether their observation matches a known cause without reading the whole chapter.