9 Output Schema

Note

TODO: Auto-generated harmonized column reference per form, populated from FINAL crosswalks at book build time.

9.1 Per-output dictionary

Each (tax_year, form) output has a companion dictionary at processed/core/{tax_year}/{form}/core_{tax_year}_{form}_dictionary.csv:

col	source
`harmonized_name`	FINAL crosswalk
`description`	FINAL crosswalk
`source_var`	FINAL crosswalk (synonyms joined with `\\|`)
`source_location`	FINAL crosswalk
`data_type`	inferred from harmonized output
`n_rows`	output
`n_nonnull`	output
`null_pct`	derived
`n_distinct`	output
`min_value`	numeric only
`max_value`	numeric only
`years_present`	FINAL crosswalk

9.2 Lookup tables for coded columns

Categorical-coded columns (integer codes like subsection_cd, non_pf_status_reason) have companion lookup CSVs in data/lookups/. Each lookup file has columns code, label, irc_ref, source, confidence.

subsection_codes.csv — IRS-internal EO subsection codes, verified against IRM 25.7.1 Exhibit 25.7.1-4. Maintained independently from the parallel lookup in nccs-data-bmf; both should derive from the same IRM upstream. Cross-check periodically — divergences should be investigated, not silently retained.
non_pf_status_reason_codes.csv — codes 0–16 for the non_pf_status_reason column. SOI’s encoding is locked to the pre-2016 Schedule A Part I line numbering, not the current form: code 9 is 509(a)(2) (gross-receipts public charity, ~42% of small filers), not “agricultural research organization” as the 2016+ form numbering would suggest. The 2016 Schedule A revision inserted “agricultural research organization” at line 9 and shifted everything downstream, but SOI never re-encoded. Code distribution across 2010–2024 confirms this: code 9 grows smoothly (34% → 50% of 990-EZ filers) with no step change at the 2016 form-revision boundary. Code 10 (public-safety testing, 509(a)(4)) stays at near-zero throughout, consistent with how rare those organizations actually are. Code 16 (≤1,089 rows across all 13 years) lacks a clean Schedule A line mapping and is flagged confidence = unknown in the lookup.

When adding a new coded column, follow the same pattern: build the lookup from an authoritative source (IRM, IRS Pub 557, or the form’s instructions PDF) and mark each row’s confidence as verified / inferred / unknown.

9.3 NA semantics

Two important conventions when interpreting NA in CORE outputs:

Financial fields use 0 instead of NA. The IRS SOI extract fills 0 for every financial column on every row, regardless of whether the organization actually engaged in the activity. So gross_income_gaming_activities = 0 could mean either “didn’t run gaming” or “ran gaming and grossed $0.” Do not assume is.na(x) finds missing data on financial columns — use x > 0 to identify orgs that reported activity.

Vintage-padded columns are 100% NA in pre-vintage tax-year files. When a harmonized column doesn’t exist in the source extract for a given vintage (e.g., efile_indicator for 2012 990-EZ), the pipeline NA-fills it for schema stability. Within a single (tax_year, form) output file, vintage-padded columns are either 0% NA (vintage has the field) or 100% NA (vintage doesn’t). Use the years_present column in the crosswalk to predict which.

The 2012 IRS 990 SOI extract is unusually sparse. Per the IRS variable matrix:

Processing year	990 source col count
2012	62
2013	245
2014+	245-246

The 2012 extract was the inaugural release of the SOI EO program and IRS published with a limited initial schema. As a result, tax_year 2008–2012 returns harmonized from the 2012 processing-year extract will have heavy NAs on most 990 columns. Tax_year 2013+ outputs sourced from 2013+ extracts will have the full ~246-col schema. This 4× schema jump between the inaugural year and the second year is much larger than typical year-over-year drift, so it’s worth flagging to downstream users.

A single tax-year file is built from multiple processing-year extracts. A given (tax_year, form) output is the union of all rows for that tax year across every processing-year extract the pipeline has ingested. Each contributing row’s columns are filled based on the processing-year schema, not the tax_year. So NA rates for vintage-dependent columns within a single tax-year file depend on the mix of processing-year extracts that supplied rows to it. Pre-2013 tax years on columns added in 2013+ will have substantial NA rates that cannot be fully resolved — the IRS extracts that captured most of those filings simply didn’t include the column. This is a real limitation of the source data, not a pipeline bug.

Tax_year 2021 990 has a single-vintage gap on 28 indicator columns. The py2021 IRS 990 extract is populated for fewer rows than usual on a cluster of yes/no indicator (_cd) columns. Coverage recovers partially in py2022 and fully in py2023+, so most affected rows in the tax_year 2021 output (which is built mainly from py2021) carry NA where py2020 and py2023 would carry Y/N. Two patterns within the cluster:

Deep drop, then recovery — ~17% complete in 2021, ~88% in 2022, 100% in 2023. 18 columns including donor_restricted_endowments_cd, business_txn_with_35pct_entity_cd, excess_business_holdings_cd, filed_form_8886t_cd, filed_form_720_cd, filed_in_lieu_1041_cd, hospital_audited_attached_cd, nondeductible_disclosure_cd, act_on_behalf_of_issuer_cd, escrow_account_cd, bond_proceeds_invested_cd, notify_donor_value_cd, distribution_to_donor_cd, s4966_distributions_cd, filed_form_990t_cd, filed_form_8899_cd, filed_form_1098c_cd.
Shallow drop, then recovery — ~65–85% complete in 2021, ~94–97% in 2022, 100% in 2023. 9 columns: received_funds_for_premiums_cd, paid_premiums_personal_benefit_cd, backup_withholding_compliance_cd, filed_employment_tax_returns_cd, property_disposition_8282_cd, quid_pro_quo_contributions_cd, transfer_to_noncharitable_org_cd, lobbying_activities_cd, excess_benefit_transaction_cd, aware_prior_excess_benefit_cd.

One outlier, qualified_health_plan_multi_state_cd, drops in 2021 (~14%) and stays sparse in 2022 (~3%), recovering only partially in 2023 (~88%) — a different shape that suggests a column rename or definition change in the IRS extract rather than a publication gap.

This is an upstream SOI-extract pattern, not a harmonization bug. Downstream uses of 2021 990 indicator columns should treat NA on these fields as “value not published in the extract,” not as “filer left it blank.” For longitudinal comparisons spanning 2021, prefer columns that don’t appear in the affected cluster, or aggregate across multiple tax years to dilute the gap.

9.4 Three “year” columns

Every CORE output has at least two year columns, and 990-PF has a third. They mean different things — pick deliberately when grouping or joining.

Column	Source	Meaning	Forms
`tax_year`	Pipeline-derived (`substr(tax_period, 1, 4)`)	Calendar year the fiscal year ended. The partition key.	All
`soi_year`	IRS-assigned (`TAX_YR` / `tax_yr`)	Calendar year covering most of the filer’s activity per IRS SOI convention. Differs from `tax_year` by 1 for non-calendar-year filers ending before December.	990-PF only
`extract_year`	Pipeline-added	Year the IRS SOI extract that contained this row was processed and published. Provenance, not a fiscal concept.	All

For calendar-year filers (~84% of 990-PF in tax_year 2011), soi_year = tax_year. For fiscal-year filers ending Jan–Nov, soi_year = tax_year - 1.

Use tax_year for strict fiscal-period-end groupings; use soi_year for IRS’s calendar-year activity proxy.

9.5 Universal columns

Present in every output:

ein — 9-digit EIN, hyphenated as XX-XXXXXXX. The hyphen forces character typing on CSV re-read so leading zeros survive.
tax_period — 6-char YYYYMM (preserved as character).
tax_year — int, derived from substr(tax_period, 1, 4).
tax_month — int 1–12.
subsection_cd — IRC EO subsection code (1–93 per IRM 25.7.1). See data/lookups/subsection_codes.csv.
is_501c3 — boolean, subsection_cd == 3. Replaces the legacy 501C3/501CE partition.
extract_year — int, processing year of the IRS SOI extract that supplied this row.
is_amendment — boolean, TRUE if the same (ein, tax_period) appears in an earlier extract_year.

9.6 Merged-panel columns

Output files under data/processed_merged/ (produced by R/run_build_panel.R) carry two additional columns that the SOI-current and legacy single-pipeline outputs do not:

source_pipeline — "legacy" or "soi_current". The primary origin of the row. Overlap rows (same (ein, tax_period) in both pipelines) are tagged "soi_current" because SOI-precedence resolved the shared columns.
has_legacy_augment — boolean. TRUE iff the row’s (ein, tax_period) appeared in both pipelines, i.e. the SOI row was augmented by legacy-only columns. FALSE on single-pipeline rows.

Together these two columns let analysts filter “as-filed only” (source_pipeline == "soi_current" & !has_legacy_augment), “NCCS legacy only” (source_pipeline == "legacy"), or “rows where both sources had something to say” (has_legacy_augment). Where SOI-precedence overwrote a non-NA legacy value on a shared column, the discarded legacy value is recorded in data/logs/merge_disagreements_{year}_{form}.csv — see docs/09-legacy-harmonization.qmd (“Merge phase”) for the rule + the as-filed-vs-imputed caveat.

9.7 Natural primary key

The natural primary key is (ein, tax_period). Duplicates on this key are exactly the rows where is_amendment == TRUE — the same filer submitted the same tax period in a later processing year (typically an amended return). Deduplication policy is the consumer’s choice: keep only is_amendment == FALSE for the original filings, or keep the latest extract_year per (ein, tax_period) for amendment-superseded values.

9.8 `tax_year` is constant within a single output file

Each CORE output is partitioned by tax_year (the first four characters of tax_period). Within a single (tax_year, form) file every row has the same tax_year value by construction. The column is retained — rather than dropped — so that downstream code stacking multiple-year outputs gets a working group/join key without re-derivation, and so that schemas match across years. Treat the redundancy within one file as a feature, not waste.

9.9 No top-coding or winsorization

CORE outputs are not top-coded, winsorized, censored, or otherwise capped. Maximum values reflect the largest filer-supplied figure in the IRS SOI extract for that (tax_year, form, column) cell, unchanged. Identical maxes across related columns (e.g. total_contributions and total_revenue both maxing at the same value in a given year) reflect real-world equality for one filer (revenue = contributions when contributions are the only revenue source), not capping. The dictionary’s max_value field is the raw extract maximum.

9.10 `is_501c3` is not always TRUE in 990-PF

About 6% of 990-PF rows are §4947(a)(1) non-exempt charitable trusts treated as private foundations (subsection_cd = 92, is_501c3 = FALSE). They file 990-PF because IRC treats them as private foundations for excise-tax purposes, but they are not 501(c)(3) exempt organizations. The is_501c3 column makes this distinction queryable without requiring users to memorize subsection codes.

§4947(a) trusts do not appear in 990 or 990-EZ in practice. Verified empirically against the full harmonized output (4.0M 990 rows across 25 tax_years, 2.8M 990-EZ rows across 27 tax_years): zero rows with subsection_cd == 92 in either form. The schema accommodates the code in case a future vintage introduces one, but is_501c3 == FALSE & form %in% c("990", "990ez") should not return §4947(a) trusts on any data currently in scope. If you need every §4947(a) trust regardless of form, query the 990pf series alone.

9.1 Per-output dictionary

9.2 Lookup tables for coded columns

9.3 NA semantics

9.4 Three “year” columns

9.5 Universal columns

9.6 Merged-panel columns

9.7 Natural primary key

9.8 tax_year is constant within a single output file

9.9 No top-coding or winsorization

9.10 is_501c3 is not always TRUE in 990-PF

9.8 `tax_year` is constant within a single output file

9.10 `is_501c3` is not always TRUE in 990-PF