10 Legacy BMF Harmonization

11 Overview

Legacy NCCS 501CX-NONPROFIT-PX BMF files (1989–2022) use NCCS-curated column names that differ from the current IRS BMF schema. The legacy pipeline harmonizes these files, runs the same transforms as the modern pipeline, and writes a slim per-file processed output that contains only columns whose underlying input was actually populated.

Legacy files are NCCS profile extracts, not raw IRS BMF — they include geographic enrichments and analytic flags but drop many raw IRS fields (filing requirements, affiliation, deductibility, ICO, STREET, activity, classification, status, organization type, asset/income codes, PF filing). 14 current-schema columns therefore have no legacy source at all and are excluded from the slim output.

12 Pipeline Differences

R/run_legacy_pipeline.R mirrors R/run_pipeline.R with these changes:

Phase	Modern	Legacy
1 Extraction	Download from `s3://nccsdata/raw/bmf/`	Download from `s3://nccsdata/legacy/bmf/` (set `LEGACY_BMF_YEAR`/`LEGACY_BMF_MONTH`) or read from a local path (`LEGACY_BMF_FILE`)
1.5 Harmonization	n/a	`harmonize_legacy_bmf()` — uppercase, drop, rename, NA-fill
2 Pre-validation	Requires all 31 `BMF_REQUIRED_COLUMNS`	Requires only 7 `BMF_LEGACY_MIN_COLUMNS` populated
3–9 Transforms	unchanged	unchanged (same code path; NA inputs propagate)
10 Intermediate parquet	full schema	full schema (audit/debug)
11 Processed CSV	full schema	slim schema (only populated cols)
S3 upload	`intermediate/bmf/YYYY_MM/`, `processed/bmf/YYYY_MM/`	`intermediate/bmf-legacy/YYYY_MM/`, `processed/bmf-legacy/YYYY_MM/`

13 Harmonization Steps

Implemented in harmonize_legacy_bmf(dt, crosswalk) in R/legacy_bmf_adapter.R:

Uppercase column names — handles case-aliasing across years (e.g., cZFiler vs czFiler, ZFiler vs ZFILER, accper vs ACCPER, RandNum vs randnum).
Reject unknown columns — every column in the input file must have a row in XWALK-BMF-V2.0.csv. Unknown columns trigger a hard stop. This is the safety net against silently leaving columns behind.
Drop columns marked disposition=drop — NCCS analytic flags, NTEE refinements, geographic enrichments, single-year columns.
Rename columns marked disposition=rename to their current_name target.
NA-fill missing required columns — any BMF_REQUIRED_COLUMNS member not produced by step 4 is added as an NA character column, so existing transforms run unmodified. The transforms propagate NA through their outputs.

The output is a data.table with exactly the 31 columns in BMF_REQUIRED_COLUMNS, in arbitrary order.

14 Crosswalk: `XWALK-BMF-V2.0.csv`

Built from two sources:

Scraped dictionaries — 73 of 88 listed 501CX-NONPROFIT-PX dictionary pages from the NCCS catalog (https://nccs.urban.org/nccs/catalogs/catalog-bmf.html). The other 15 are “data unavailable” placeholders. 47 distinct upper-cased columns observed across the 73 dictionaries.
Direct file inspection — the dictionaries undercount real file contents. The 1989-06 raw file has 42 columns vs. the 28 listed in its dictionary; the extras are accumulated NCCS reprocessing artifacts (e.g., Ntee2007a, Confidence2007a, nteeFinal, *Old1 suffixes). When a new legacy file surfaces unknown columns, add their dispositions to the crosswalk before re-running.

14.1 Schema

Column	Description
`legacy_name_upper`	Upper-cased legacy column name (key)
`legacy_case_variants`	All observed source casings (semicolon-separated)
`current_name`	Target column in current schema (blank if `drop`)
`disposition`	`rename` or `drop`
`subcategory`	`identity`, `alias`, `nccs-curated`, `nccs-flag`, `nccs-derived`, `ntee-refinement`, `geographic`, `single-year`, `other`
`n_dicts_observed`	How many of the 73 dictionaries contain this column
`year_month_first`, `year_month_last`	Year-month range observed
`legacy_label`, `legacy_dtype`, `legacy_length`, `legacy_description`	Metadata from the source dictionary
`notes`	Provenance/semantic notes

14.2 Renames

18 legacy columns map to current schema columns. Notable cases:

cTotRev → REVENUE_AMT — NCCS-curated total revenue from Form 990 (Part I, line 12). The only legacy source for revenue; provenance differs from the modern pipeline’s REVENUE_AMT (which is raw BMF). Flagged in the notes field.
zip5 → ZIP — current pipeline’s ZIP transform internally parses 5+4.
INCOME → INCOME_AMT — Gross receipts from Form 990, semantically aligned with INCOME_AMT.
Identity renames (EIN, NAME, CITY, STATE) — same name in both schemas.

14.3 Drops

59 legacy columns are dropped (out of 78 total crosswalk rows; 19 are renames). Categories:

NCCS NTEE refinements (NTEE1, MAJGRPB, ntmaj5/10/12, NTEEconf, LEVEL1-4, NTEEDRAFT, NTEE2006, NTEE2007A, NTEEFINAL, NTEEOLDB, etc.) — current pipeline derives these from NTEE_CD.
NCCS analytic flags (Filer, cFiler, ZFiler, cZFiler, cFinSrc, outnccs, OUTREAS, RandNum, ADDRESSCHANGED) — Core file derivations and per-snapshot diff metadata, not raw fields.
NCCS curated F990 derivations (cAssets, cTaxPer, ASSETS_OLD, INCOME_OLD) — redundant with the raw BMF aliases we already keep (ASSETS, TAXPER) or prior-snapshot history we don’t propagate.
Geographic enrichments (FIPS, MSA_NECH, PMSA, MSA, LATITUDE, LONGITUDE, fipsold, *Old1 historical values, CBSA, DIVCODE, GEOCD, GEOCDF, GEOCODE, GEOEXCLUDE) — current pipeline runs its own geocoding via R/run_geocoding.R.
NCCS-curated activity splits (ACTIV1, ACTIV2, ACTIV3) — per-vintage 3-way splits of the IRS 9-char ACTIVITY field; the current pipeline expects the unsplit form.
Single-year columns (ZIP 1989-only, ORGTYPCD, v_errorCd).
Other (NAICS — derived from NTEE in current pipeline; IRS990N, EPOST, FISYR_IMAGE — NCCS-curated 990-N and image metadata).

15 Current-Schema Columns With No Legacy Source

These 14 columns will always be NA in legacy intermediate output and are dropped from legacy processed output:

ICO, STREET, AFFILIATION, CLASSIFICATION, DEDUCTIBILITY, ACTIVITY, ORGANIZATION (1989 file is the exception via ORGCD), STATUS, ASSET_CD, INCOME_CD, PF_FILING_REQ_CD, REGION, RYEAR, ID.

Several legacy aliases only appear in some years (e.g., TAXPER is 2006-11+, GEN is 2010-08+, FRCD is 2002-01+, SEC_NAME is 2002-07+). Files from earlier years will have additional columns that are NA-filled.

16 Pre-2003 5-character NTEE Codes

Legacy BMF files from roughly 1995–2003 contain ~5 % of NTEE codes in NCCS’s pre-NTEE-CC 5-character form (letter + 4 digits, e.g. A0120, S0241, B1170). The structure is:

Position 1: NTEE major-group letter (A–Z)
Positions 2-3: NTEEv2 organization-type encoding (01=AA, 02=MT, 03=PA, 05=RP, 11=MS, 12=MM, 19=NS, default RG)
Positions 4-5: decile-subdecile

Modern NTEE-CC is at most 4 chars, so the validator in R/transform_ntee_code.R would normally hard-stop on these. In legacy mode (legacy_mode = TRUE, set automatically by run_legacy_pipeline.R):

The 5-char codes pass length validation with a warning rather than an error.
The vendored crosswalk data/lookup/ntee_legacy_5char_lookup.csv (1,597 rows; 942 of length 5) maps codes like A0120 → ART-A20-AA. Matched rows override nteev2_subsector, nteev2_code, and nteev2_org_type directly.
Unmatched 5-char codes (~5 % long-tail) fall back to formulaic derivation: nteev2_code = paste0(major_group, substr(raw, 4, 5)) (e.g. A1120 → A20). Subsector and org_type are still resolved correctly by the standard fcase logic on .first / .char23.

The crosswalk is vendored from Nonprofit-Open-Data-Collective/mission-taxonomies. scripts/check_ntee_legacy_coverage.R reports per-vintage match percentages against any locally-saved Phase 4 checkpoints.

17 Phase 11 Output Schema

compute_legacy_output_columns(populated_raw_columns) in R/legacy_bmf_adapter.R returns the per-file keep-list. The mapping from each current-schema raw column to its transform output columns is encoded in RAW_TO_OUTPUT_MAP. Output columns derived from multiple inputs (org_addr_full, org_addr_is_missing) are kept only when all their inputs were populated (encoded in DERIVED_OUTPUT_COLUMNS).

This means a 1989-06 file with no STREET will:

Keep org_addr_city*, org_addr_state*, org_addr_zip* (city, state, zip were populated)
Drop org_addr_street*, org_addr_full, org_addr_is_missing (depend on STREET)

18 Output Files

For a legacy file BMF-YYYY-MM-501CX-NONPROFIT-PX.csv:

data/intermediate/bmf_legacy_YYYY_MM_intermediate.parquet — full ~110-column schema, includes NA-filled columns. Use for audit and debugging.
data/processed/bmf_legacy_YYYY_MM_processed.csv — slim schema, varies per file based on populated inputs.
data/processed/bmf_legacy_YYYY_MM_data_dictionary.csv — column metadata for the slim schema.
data/quality/bmf_legacy_YYYY_MM_harmonization_report.json — per-file audit: input column count, drops, renames, NA-fills.
data/quality/bmf_legacy_YYYY_MM_quality_report.json — same shape as modern pipeline quality reports.
docs/quality-reports/bmf_legacy_YYYY_MM_quality_report.html — rendered quality report.

19 Known-bad upstream vintages

Three NCCS-published legacy files have a structurally different schema that the harmonization pipeline cannot consume:

BMF-2017-09-501CX-NONPROFIT-PX.csv
BMF-2017-12-501CX-NONPROFIT-PX.csv
BMF-2018-12-501CX-NONPROFIT-PX.csv

These three files contain sequence-ID values in the EIN column (e.g. 000000001 for VOLUNTEERS OF AMERICA INC) instead of real 9-digit IRS EINs, and a non-standard 8-character TAXPER encoding (e.g. 20001050) that doesn’t decode to any valid YYYYMM format. The 2019 and 2020+ vintages produced by NCCS use the standard schema again, so the issue appears scoped to those three months.

scripts/run_all_legacy.sh skips them by default via SKIP_VINTAGES="2017-09,2017-12,2018-12". Override with SKIP_VINTAGES="" if upstream NCCS files have been corrected.

20 Adding New Legacy Files

Upload the CSV to s3://nccsdata/legacy/bmf/ (preferred) or drop it in data/raw/legacy/.

Run:

# From S3:
LEGACY_BMF_YEAR  <- 2013
LEGACY_BMF_MONTH <- 7
source("R/run_legacy_pipeline.R")

# Or from a local path:
LEGACY_BMF_FILE <- "data/raw/legacy/<filename>.csv"
source("R/run_legacy_pipeline.R")

If pre-harmonization fails with “Legacy BMF contains N column(s) not present in XWALK-BMF-V2.0.csv”:
- Add a row to data/crosswalks/XWALK-BMF-V2.0.csv for each unknown column (legacy_name_upper, legacy_case_variants, disposition, subcategory, etc.).
- Set n_dicts_observed=0 and leave year_month_* blank to flag “discovered in raw file, not in dictionary.”
- Re-run.

21 Implementation Notes

paste0(names(dt)) not names(dt) — data.table::names(dt) returns a reference that mutates with setnames/:=NULL operations. The harmonization adapter uses paste0() to force a fresh character vector for the audit trail. See R/legacy_bmf_adapter.R.
validate_join_success() zero-row guard — R/input_validation.R short-circuits when the dim table is empty (e.g., all-NA CLASSIFICATION input → empty dim_classification). Without this, the empty-table case produces NaN that breaks the threshold comparison.
HTML rendering is non-fatal — wrapped in tryCatch so a missing quarto R package or render failure doesn’t abort the pipeline. The JSON quality report is already saved by that point.