10  Legacy Harmonization

This chapter covers the legacy CORE pipeline for pre-2012 NCCS files. It is currently planned, not built — the SOI-current pipeline (R/run_pipeline.R) is what actually runs today. The eventual entry point will be R/run_legacy_pipeline.R. This chapter documents the scope, the source-data shape, and the design decisions already locked in so that the build, when it happens, doesn’t relitigate them.

10.1 Why a separate pipeline

The pre-2012 NCCS files have a different structural shape from the IRS SOI extracts:

  • Different partition scheme. Legacy files split charities into 501C3 (501(c)(3) only) and 501CE (everything else), with parallel *-PZ (full-form 990) and *-PF (private foundation 990-PF) variants. The SOI-current pipeline uses subsection_cd and is_501c3 as data columns rather than partition keys, so partitioning has to be unwound at ingest.
  • Different schema. Legacy column inventory (per the 2026-05-08 audit) varies from 104–276 columns per file depending on vintage, with significant schema expansions in 1997 and 2008. Column names are heavily abbreviated NCCS legacy strings (e.g. ACCNTSPAYABLEEND, EXGRROOT, LBEBOYEstimate) rather than IRS SOI names. The crosswalk surface is therefore disjoint from the SOI crosswalks; reusing the existing FINAL crosswalks won’t work.
  • No 990-EZ source. The IRS did not separately publish a 990-EZ extract before 2012. Pre-2012 990-EZ filings exist in the NCCS PZ data interleaved with 990 filings, identifiable only by the EZ form-type indicator in a metadata column. Whether to surface them as part of the 990combined series or as their own series is an open design call.

The two pipelines write to the same (tax_year, form) output grain so analysts can read across them, but they cannot share an orchestrator or a crosswalk.

10.2 Source-data inventory

The audit captured in the project_legacy_inventory memory entry found:

Series Years available Source Note
PZ (501c3 + 501ce) 1989–2019 s3://nccsdata/legacy/core/ 2012+ are NCCS+SOI hybrids — skip
PF 1989–2015, 2019 s3://nccsdata/legacy/core/ 2016/2017/2018 missing from bucket
PC 2012–2019 only s3://nccsdata/legacy/core/ PC scope didn’t exist pre-2012 — out of legacy scope

Column-count history confirms the schema instability:

Era PZ cols PF cols What changed
1989–1996 104–117 105–121 Compact original schema
1997 ~164–167 Major schema expansion
1998–2011 140–276 (variable) ~170–223 Year-to-year drift; needs per-vintage crosswalk
2012+ 152–168 (PZ) ~170 (PF) NCCS + SOI hybrid — handled by SOI-current pipeline

The 2012+ files in the bucket are not raw legacy data — they’re NCCS metadata merged with SOI-derived financial columns. The SOI side of that merge is reproducible from irs.gov/pub/irs-soi/, so the legacy pipeline scope strictly stops at 2011.

10.3 Scope decisions (locked in)

  • In scope: raw legacy *-PZ (union of 501C3 + 501CE) and *-PF files, 1989–2011 only.
  • Out of scope: the 2012+ NCCS+SOI hybrid files in the same bucket. Those years are covered by the SOI-current pipeline, which goes back to the IRS source instead of the hybrid intermediary.
  • Unrecoverable: standalone 990-EZ panel pre-2012. The IRS never published one; the EZ rows that exist sit inside the PZ files.
  • Earliest year: 1989. Pre-1989 panels exist for some subsets but are not in s3://nccsdata/legacy/core/. Sourcing them would be a separate project; 1989+ already gives 35+ years of coverage. Defer indefinitely.

10.4 Schema-instability decisions (locked in)

The 1997 expansion and the 1998–2011 column drift both produce vintage-padded columns in the harmonized output: columns that exist in only a subset of source files and are NA in the rest. Two policy decisions govern how the pipeline handles this.

10.4.1 1997 schema expansion: NA-pad pre-1997

Columns introduced in 1997 (which roughly doubled PZ schema width from ~117 to ~165 columns) are NA-padded for 1989–1996 rows in the harmonized output, rather than dropped from those years’ files.

Rationale. This matches the SOI-current pipeline’s vintage-padding convention (see 08-output-schema.qmd → “Vintage-padded columns are 100% NA in pre-vintage tax-year files”). A single unified schema across all years lets analysts stack legacy outputs without per-year column reconciliation, and the NA pattern is itself a useful signal — n_nonnull in the dictionary will be zero for pre-1997 rows on those columns, surfacing the cutover automatically.

Trade-off. Pre-1997 files carry ~50 extra all-NA columns each. Storage cost is small (NA serializes cheaply in CSV); analyst friction is lower than the partition-drop alternative, which would require year-aware joins.

10.4.2 1998–2011 column drift: opt-in threshold ≥3 tax years

The BASELINE crosswalk builder includes a legacy source column in the harmonized schema only if it appears in source files for at least 3 distinct tax years within the 1998–2011 window. Columns appearing in only 1 or 2 vintages are dropped at the crosswalk stage — they would otherwise produce harmonized columns that are 95–99% NA and offer no longitudinal signal.

Rationale. A column that appears in only one year is overwhelmingly noise: a vintage-specific NCCS field, an experimental schema addition that didn’t stick, or a typo that the crosswalk-auditor missed. A column that appears in 3+ years is plausibly a real schema element worth preserving. The threshold is conservative (filters out roughly the noisiest 5–10% of legacy columns) without being so high that it cuts genuinely sparse but real fields.

Trade-off. Some real-but-short-lived schema elements (e.g. a field present only in 2007 and 2008 before being dropped) get cut. Affected columns can be opted back in via the OVERRIDES crosswalk if a specific analysis needs them — same escape valve as on the SOI-current side.

Threshold reference. “3 tax years” means 3 distinct values of tax_year derived from the source filename, irrespective of how many 501C3/501CE/PF variants of that year contain the column. The threshold is hard-coded in the BASELINE-builder script and may be tunable later if downstream needs surface.

10.4.3 Schema parity with SOI-current: drop BMF-origin columns

Legacy CORE files carry NCCS-appended BMF metadata columns (NCCSKEY, EOSTATUS, NTEE*, ACCPER, INC_CODE, FNDCD, AFFILCD, …) that the SOI-current CORE files do not carry. These columns are dropped from the legacy harmonized output.

Rationale. Two reasons:

  1. Cross-pipeline parity. If legacy output carried BMF columns and SOI-current didn’t, analysts stacking the panels would see a column footprint that depends on tax year — exactly the friction harmonization is meant to eliminate.
  2. Freshness. BMF columns frozen into a 1995 CORE file are a 1995 snapshot. Analysts who want EOSTATUS for a 1995 row are better served joining against the current BMF master, which has lineage information the frozen snapshot lacks.

Mechanism. Columns missing a description in the CORE dictionary (i.e., not part of the CORE financial-data schema, only present as appended BMF metadata) are pre-marked in the OVERRIDES crosswalk with harmonized_name = "". The harmonize step skips empty harmonized names. The row is retained in OVERRIDES so the decision is auditable and reversible — a specific column can be re-added by setting a real name.

How to get BMF context downstream. Analysts join the harmonized legacy output against the BMF master (s3://nccsdata/geocoding/bmf-master/merged/bmf_master_geocoded.parquet, refreshed yearly). Primary join key: ein. Secondary key for subsection-level filtering: CORE subsection_cd ↔︎ BMF subsection_code. The two pipelines maintain parallel copies of data/lookups/subsection_codes.csv, both derived from IRS IRM 25.7.1 Exhibit 25.7.1-4; cross-check periodically.

Trade-off. Legacy rows lose direct access to NCCS-historical BMF snapshots (the value of EOSTATUS as recorded in the 1995 NCCS CORE build, specifically). For most analyses this is fine — current BMF is more accurate. For longitudinal studies of organizational status changes it matters, but those analyses should pull from a versioned BMF history rather than CORE-side frozen snapshots anyway.

10.5 Series mapping (planned)

Series Pre-2012 source 2012+ source
990 (none — pre-2012 990 rows live inside legacy PZ along with 990-EZ rows) IRS eoextract990 (SOI-current)
990ez (none as standalone) IRS eoextract990EZ (SOI-current)
990pf raw legacy *-PF IRS eoextract990PF (SOI-current)
990combined raw legacy *-PZ (501C3 ∪ 501CE) derived from 990 + 990-EZ on shared 53 cols

The legacy pipeline does not attempt to demultiplex pre-2012 PZ rows into separate 990 and 990ez series — there isn’t enough signal in the data to do so reliably across vintages. Instead, all pre-2012 full-form filings land in 990combined regardless of whether they were filed on Form 990 or 990-EZ.

10.6 Schema width: why legacy and SOI-current 990combined differ

A naive column-count comparison — legacy 990combined ≈ 180 cols vs SOI-current 990combined 53 cols — looks like a yawning gap. It is mostly an artifact of three orthogonal factors, only one of which is “NCCS derivations.” Understanding them matters for any merge / dedup decision because some legacy-only columns are real form fields with no SOI-current analog, while others are reasonably-droppable noise.

Factor Approximate cols Nature
SOI-current 990combined is an intersect of SOI 990 + SOI 990-EZ ~67 cols would be retained if we compared against SOI 990 (full, ~255 cols) instead of the 53-col intersect Pipeline-design choice, not a data limitation
Form redesign in 2008: Schedule A Parts IV/VI (public-support tests, lobbying detail) and pre-2008 line items present in legacy but consolidated / removed post-2008 ~25-40 cols Genuinely upstream — IRS changed the form
NCCS derivations*Estimate cols, NCCS-imputed values for missing fields ~5-10 cols NCCS curation choices

Two consequences worth pinning down:

  1. Most of the legacy “extra” columns are real form fields, not derivation noise. A column-level merge that preserves legacy-only cols is preserving genuine pre-2008 Form 990 detail that has no post-2008 analog. Dropping them on dedup loses analytical fidelity for any longitudinal study spanning the 2008 redesign.

  2. The 53-col narrowness of SOI-current 990combined is partly downstream of the legacy pipeline’s own design constraint. Legacy can’t demultiplex pre-2012 PZ rows into separate 990 vs 990-EZ series (no FORMTYPE column at row level in early vintages), so SOI-current’s 990combined was defined as the intersect of 990 + 990-EZ to give analysts a panel that stacks cleanly across the 2011 / 2012 boundary. If legacy could demux, SOI-current would publish 990 and 990-EZ separately and the apparent width gap would close.

This framing should guide the legacy / SOI-current merge: legacy is wider than SOI 990combined for reasons that don’t reduce to “NCCS curation noise,” so a row-level dedup that drops the legacy view on overlap rows is throwing away real form-evolution detail. A column-level merge with a SOI-precedence rule for shared columns preserves both provenance and width.

10.7 subsection as column, not partition

Legacy files use the *-PZ filename split to separate 501C3 from 501CE filings. The harmonized output drops that filename-level distinction and keeps subsection_cd + is_501c3 as ordinary columns. The legacy pipeline at ingest time will:

  1. Read both 501C3-* and 501CE-* files for the year.
  2. Union them with data.table::rbindlist(..., fill = TRUE, use.names = TRUE).
  3. Map the source SUBSECCD (or analog) into the harmonized subsection_cd integer column.
  4. Derive is_501c3 := subsection_cd == 3L strict-boolean (matching SOI-current semantics).

The result is a single panel per tax_year, queryable by subsection without needing to know which legacy file the row originated from.

10.8 Planned design (mirror BMF’s legacy adapter pattern)

The pipeline will follow the same architecture as the SOI-current side:

  • Per-field pure transforms under R/transforms/ — most are already implemented and reusable as-is. transform_ein, transform_subsection_cd, transform_financial_amounts, transform_indicators apply unchanged. transform_tax_period and transform_efile_indicator may need legacy-specific synonyms added (legacy column names differ).
  • Per-form crosswalks under data/crosswalks/ with the same BASELINE / OVERRIDES / FINAL split as the SOI-current files. Names: legacy_pz_crosswalk_*.csv and legacy_pf_crosswalk_*.csv. The BASELINE will be drafted from a regenerated data/raw/legacy_inventory/headers_by_file.tsv once the inventory script is rewritten (it was deleted in the 2026-05-12 cleanup; the conclusions are captured in memory).
  • Quality framework under R/quality/run_pre_checks_one() and the check_* validators in post_checks.R are pipeline-agnostic and work as-is against legacy outputs. The YoY tripwire (check_row_count_vs_prior) also works without modification.
  • Orchestrator as R/run_legacy_pipeline.R, modeled after R/run_pipeline.R with the same phased structure: download/inventory → unpack → pre-checks → harmonize → quality → dictionary → render → upload. No derive_combined phase needed because 990combined IS the legacy pipeline’s primary output, not a derivation from it.

The reusable transforms and quality validators are the main payoff of having built the SOI-current pipeline first — they collapse what could have been a long second build into mostly a crosswalk-authoring exercise.

10.9 Output integration with SOI-current

The two pipelines write to the same (tax_year, form) output grain. Where they overlap, SOI wins — the IRS extracts are the authoritative source for 2012+ and have been since the SOI EO program launched.

The “clean cutover at 2011/2012” framing from the planning phase turned out to be wrong in practice: both pipelines partition by the first 4 chars of TAXPER (per CLAUDE.md), not by publication year. The 2011 NCCS legacy file contains late filers with TAXPER in 2012, and SOI 2012+ extracts contain late filers with TAXPER going back into the 1990s. Real (ein, tax_period) overlap exists across the boundary in both directions. The merge phase below was built to handle this.

Pre-2012 990-PF output from the legacy pipeline fills the gap below the SOI-current 990pf series. The combined panel runs 1989–present.

10.10 Merge phase (Option D, shipped 2026-05-15)

R/04_legacy_merge.R + R/run_build_panel.R produce a third tree, data/intermediate/harmonized_merged/, by column-level merging of harmonized_legacy/ (legacy pipeline) and harmonized/ (SOI-current pipeline) per (tax_year, form). Output forms are 990combined and 990pf — the two forms the legacy crosswalks target.

Rule per overlap row (matched on (ein, tax_period)):

  • Shared harmonized columns: SOI value wins where non-NA; legacy fills where SOI is NA. NA defers rather than overwrites.
  • Legacy-only columns: legacy value (NA on SOI-only rows).
  • SOI-only columns: SOI value (NA on legacy-only rows).

Row tags: source_pipeline ∈ {legacy, soi_current} (primary origin) + has_legacy_augment: bool (TRUE iff overlap row).

Disagreement audit: per (year, form) CSV at data/logs/merge_disagreements_{year}_{form}.csv capturing every shared-column row where both sides are non-NA but values disagree. The audit is the only place the legacy value survives an overwrite — analysts who need it can join the audit back onto the merged panel.

10.10.1 As-filed vs. imputed semantics (important caveat)

The merged panel reports as-filed IRS values for shared columns where both pipelines populate them. This is mechanical (SOI-precedence applied uniformly), but the consequence is worth knowing: where NCCS’s legacy CORE files carried imputed or derived values that the IRS extract reports as 0 or blank, the merged panel takes the IRS value.

The clearest example is program_service_revenue for 990EZ filers. In 2011, ~66% of SOI 990EZ rows have program_service_revenue = $0 because filers leave Part I Line 2 blank and aggregate revenue elsewhere on the form. NCCS legacy may have backfilled a value for those rows from other lines. The merged 2011 panel reports the SOI $0. The legacy non-zero value is still recoverable via the disagreement audit (data/logs/merge_disagreements_2011_990combined.csv).

The same applies to categorical columns like subsection_cd: where legacy and SOI disagree on subsection classification (legacy=4 → SOI=3 is the most common pattern in 2011, with 893 rows), the merged panel takes SOI. This is generally correct because SOI reflects what the org wrote on its return header, while legacy may carry an older BMF-snapshot classification.

Crosswalk consistency was verified during this design call: PROGREV (legacy), totprgmrevnue (SOI 990), and prgmservrev (SOI 990EZ) all correctly map to harmonized program_service_revenue from semantically equivalent IRS lines. The disagreements are real cross-source data divergence, not a crosswalk bug. See the 2026-05-15 pause memo for the investigation trail.

10.10.2 Why standalone orchestrator

R/run_build_panel.R is its own entry point rather than a tail step of either pipeline because it depends on outputs from both. Wiring it into one would force a specific run order to fire it. Cost: a single extra Rscript invocation. Benefit: rebuilding the merged panel after a single-side change is one command, and the precondition (both harmonized trees exist) is explicit.

10.10.3 Upload target

Merged-panel output writes to data/processed_merged/ locally and s3://nccsdata/processed_merged/core/ on upload via R/08_upload.R::run_upload_merged() (wired into R/run_build_panel.R as phase 8). New prefix, not a replacement for processed/ — this keeps the SOI-current processed/ tier untouched during the merged-panel bake-in period and gives analysts an explicit choice. HTML quality reports for the merged tier are git-tracked under docs/quality-reports/merged/ and served via GitHub Pages, mirroring the SOI-current docs/quality-reports/ flow.

10.10.4 Per-pipeline logs to avoid RDS clobbering

Phase 5 (quality) names its RDS files quality_{form}_{tax_year}.rds with no pipeline tag. SOI-current, legacy, and merge all produce a 990combined/2011 quality report, so writing all three to a shared data/logs/ would clobber. Each orchestrator therefore passes its own logs_dir to run_quality() (and matching logs_dir to run_render_reports()):

Orchestrator logs_dir reports_root
R/run_pipeline.R (SOI-current) data/logs/ docs/quality-reports/
R/run_legacy_pipeline.R data/logs/legacy/ docs/quality-reports/legacy/
R/run_build_panel.R (merge) data/logs/merged/ docs/quality-reports/merged/

The legacy + merged subdirs ride along automatically with the existing recursive aws s3 sync data/logs/ in phase 8 of the SOI-current pipeline.

10.11 Cross-references

  • Reusable transforms: docs/03-transforms-reference.qmd.
  • Crosswalk workflow: docs/04-crosswalks.qmd.
  • Quality validators that apply across both pipelines: docs/05-quality-gates.qmd.
  • Architecture diagram showing both pipelines: docs/01-architecture.qmd.
  • Bucket inventory and column-count details: project_legacy_inventory memory entry; the local data/raw/legacy_inventory/ working files were deleted in the 2026-05-12 cleanup since the conclusions are in memory.