10  Legacy BMF Harmonization

11 Overview

Legacy NCCS 501CX-NONPROFIT-PX BMF files (1989–2022) use NCCS-curated column names that differ from the current IRS BMF schema. The legacy pipeline harmonizes these files, runs the same transforms as the modern pipeline, and writes a slim per-file processed output that contains only columns whose underlying input was actually populated.

Legacy files are NCCS profile extracts, not raw IRS BMF — they include geographic enrichments and analytic flags but drop many raw IRS fields (filing requirements, affiliation, deductibility, ICO, STREET, activity, classification, status, organization type, asset/income codes, PF filing). 14 current-schema columns therefore have no legacy source at all and are excluded from the slim output.

12 Pipeline Differences

R/run_legacy_pipeline.R mirrors R/run_pipeline.R with these changes:

Phase Modern Legacy
1 Extraction Download from s3://nccsdata/raw/bmf/ Download from s3://nccsdata/legacy/bmf/ (set LEGACY_BMF_YEAR/LEGACY_BMF_MONTH) or read from a local path (LEGACY_BMF_FILE)
1.5 Harmonization n/a harmonize_legacy_bmf() — uppercase, drop, rename, NA-fill
2 Pre-validation Requires all 31 BMF_REQUIRED_COLUMNS Requires only 7 BMF_LEGACY_MIN_COLUMNS populated
3–9 Transforms unchanged unchanged (same code path; NA inputs propagate)
10 Intermediate parquet full schema full schema (audit/debug)
11 Processed CSV full schema slim schema (only populated cols)
S3 upload intermediate/bmf/YYYY_MM/, processed/bmf/YYYY_MM/ intermediate/bmf-legacy/YYYY_MM/, processed/bmf-legacy/YYYY_MM/

13 Harmonization Steps

Implemented in harmonize_legacy_bmf(dt, crosswalk) in R/legacy_bmf_adapter.R:

  1. Uppercase column names — handles case-aliasing across years (e.g., cZFiler vs czFiler, ZFiler vs ZFILER, accper vs ACCPER, RandNum vs randnum).
  2. Reject unknown columns — every column in the input file must have a row in XWALK-BMF-V2.0.csv. Unknown columns trigger a hard stop. This is the safety net against silently leaving columns behind.
  3. Drop columns marked disposition=drop — NCCS analytic flags, NTEE refinements, geographic enrichments, single-year columns.
  4. Rename columns marked disposition=rename to their current_name target.
  5. NA-fill missing required columns — any BMF_REQUIRED_COLUMNS member not produced by step 4 is added as an NA character column, so existing transforms run unmodified. The transforms propagate NA through their outputs.

The output is a data.table with exactly the 31 columns in BMF_REQUIRED_COLUMNS, in arbitrary order.

14 Crosswalk: XWALK-BMF-V2.0.csv

Built from two sources:

  1. Scraped dictionaries — 73 of 88 listed 501CX-NONPROFIT-PX dictionary pages from the NCCS catalog (https://nccs.urban.org/nccs/catalogs/catalog-bmf.html). The other 15 are “data unavailable” placeholders. 47 distinct upper-cased columns observed across the 73 dictionaries.

  2. Direct file inspection — the dictionaries undercount real file contents. The 1989-06 raw file has 42 columns vs. the 28 listed in its dictionary; the extras are accumulated NCCS reprocessing artifacts (e.g., Ntee2007a, Confidence2007a, nteeFinal, *Old1 suffixes). When a new legacy file surfaces unknown columns, add their dispositions to the crosswalk before re-running.

14.1 Schema

Column Description
legacy_name_upper Upper-cased legacy column name (key)
legacy_case_variants All observed source casings (semicolon-separated)
current_name Target column in current schema (blank if drop)
disposition rename or drop
subcategory identity, alias, nccs-curated, nccs-flag, nccs-derived, ntee-refinement, geographic, single-year, other
n_dicts_observed How many of the 73 dictionaries contain this column
year_month_first, year_month_last Year-month range observed
legacy_label, legacy_dtype, legacy_length, legacy_description Metadata from the source dictionary
notes Provenance/semantic notes

14.2 Renames

18 legacy columns map to current schema columns. Notable cases:

  • cTotRevREVENUE_AMT — NCCS-curated total revenue from Form 990 (Part I, line 12). The only legacy source for revenue; provenance differs from the modern pipeline’s REVENUE_AMT (which is raw BMF). Flagged in the notes field.
  • zip5ZIP — current pipeline’s ZIP transform internally parses 5+4.
  • INCOMEINCOME_AMT — Gross receipts from Form 990, semantically aligned with INCOME_AMT.
  • Identity renames (EIN, NAME, CITY, STATE) — same name in both schemas.

14.3 Drops

59 legacy columns are dropped (out of 78 total crosswalk rows; 19 are renames). Categories:

  • NCCS NTEE refinements (NTEE1, MAJGRPB, ntmaj5/10/12, NTEEconf, LEVEL1-4, NTEEDRAFT, NTEE2006, NTEE2007A, NTEEFINAL, NTEEOLDB, etc.) — current pipeline derives these from NTEE_CD.
  • NCCS analytic flags (Filer, cFiler, ZFiler, cZFiler, cFinSrc, outnccs, OUTREAS, RandNum, ADDRESSCHANGED) — Core file derivations and per-snapshot diff metadata, not raw fields.
  • NCCS curated F990 derivations (cAssets, cTaxPer, ASSETS_OLD, INCOME_OLD) — redundant with the raw BMF aliases we already keep (ASSETS, TAXPER) or prior-snapshot history we don’t propagate.
  • Geographic enrichments (FIPS, MSA_NECH, PMSA, MSA, LATITUDE, LONGITUDE, fipsold, *Old1 historical values, CBSA, DIVCODE, GEOCD, GEOCDF, GEOCODE, GEOEXCLUDE) — current pipeline runs its own geocoding via R/run_geocoding.R.
  • NCCS-curated activity splits (ACTIV1, ACTIV2, ACTIV3) — per-vintage 3-way splits of the IRS 9-char ACTIVITY field; the current pipeline expects the unsplit form.
  • Single-year columns (ZIP 1989-only, ORGTYPCD, v_errorCd).
  • Other (NAICS — derived from NTEE in current pipeline; IRS990N, EPOST, FISYR_IMAGE — NCCS-curated 990-N and image metadata).

15 Current-Schema Columns With No Legacy Source

These 14 columns will always be NA in legacy intermediate output and are dropped from legacy processed output:

ICO, STREET, AFFILIATION, CLASSIFICATION, DEDUCTIBILITY, ACTIVITY, ORGANIZATION (1989 file is the exception via ORGCD), STATUS, ASSET_CD, INCOME_CD, PF_FILING_REQ_CD, REGION, RYEAR, ID.

Several legacy aliases only appear in some years (e.g., TAXPER is 2006-11+, GEN is 2010-08+, FRCD is 2002-01+, SEC_NAME is 2002-07+). Files from earlier years will have additional columns that are NA-filled.

16 Pre-2003 5-character NTEE Codes

Legacy BMF files from roughly 1995–2003 contain ~5 % of NTEE codes in NCCS’s pre-NTEE-CC 5-character form (letter + 4 digits, e.g. A0120, S0241, B1170). The structure is:

  • Position 1: NTEE major-group letter (A–Z)
  • Positions 2-3: NTEEv2 organization-type encoding (01=AA, 02=MT, 03=PA, 05=RP, 11=MS, 12=MM, 19=NS, default RG)
  • Positions 4-5: decile-subdecile

Modern NTEE-CC is at most 4 chars, so the validator in R/transform_ntee_code.R would normally hard-stop on these. In legacy mode (legacy_mode = TRUE, set automatically by run_legacy_pipeline.R):

  1. The 5-char codes pass length validation with a warning rather than an error.
  2. The vendored crosswalk data/lookup/ntee_legacy_5char_lookup.csv (1,597 rows; 942 of length 5) maps codes like A0120ART-A20-AA. Matched rows override nteev2_subsector, nteev2_code, and nteev2_org_type directly.
  3. Unmatched 5-char codes (~5 % long-tail) fall back to formulaic derivation: nteev2_code = paste0(major_group, substr(raw, 4, 5)) (e.g. A1120A20). Subsector and org_type are still resolved correctly by the standard fcase logic on .first / .char23.

The crosswalk is vendored from Nonprofit-Open-Data-Collective/mission-taxonomies. scripts/check_ntee_legacy_coverage.R reports per-vintage match percentages against any locally-saved Phase 4 checkpoints.

17 Phase 11 Output Schema

compute_legacy_output_columns(populated_raw_columns) in R/legacy_bmf_adapter.R returns the per-file keep-list. The mapping from each current-schema raw column to its transform output columns is encoded in RAW_TO_OUTPUT_MAP. Output columns derived from multiple inputs (org_addr_full, org_addr_is_missing) are kept only when all their inputs were populated (encoded in DERIVED_OUTPUT_COLUMNS).

This means a 1989-06 file with no STREET will:

  • Keep org_addr_city*, org_addr_state*, org_addr_zip* (city, state, zip were populated)
  • Drop org_addr_street*, org_addr_full, org_addr_is_missing (depend on STREET)

18 Output Files

For a legacy file BMF-YYYY-MM-501CX-NONPROFIT-PX.csv:

  • data/intermediate/bmf_legacy_YYYY_MM_intermediate.parquet — full ~110-column schema, includes NA-filled columns. Use for audit and debugging.
  • data/processed/bmf_legacy_YYYY_MM_processed.csv — slim schema, varies per file based on populated inputs.
  • data/processed/bmf_legacy_YYYY_MM_data_dictionary.csv — column metadata for the slim schema.
  • data/quality/bmf_legacy_YYYY_MM_harmonization_report.json — per-file audit: input column count, drops, renames, NA-fills.
  • data/quality/bmf_legacy_YYYY_MM_quality_report.json — same shape as modern pipeline quality reports.
  • docs/quality-reports/bmf_legacy_YYYY_MM_quality_report.html — rendered quality report.

19 Known-bad upstream vintages

Three NCCS-published legacy files have a structurally different schema that the harmonization pipeline cannot consume:

  • BMF-2017-09-501CX-NONPROFIT-PX.csv
  • BMF-2017-12-501CX-NONPROFIT-PX.csv
  • BMF-2018-12-501CX-NONPROFIT-PX.csv

These three files contain sequence-ID values in the EIN column (e.g. 000000001 for VOLUNTEERS OF AMERICA INC) instead of real 9-digit IRS EINs, and a non-standard 8-character TAXPER encoding (e.g. 20001050) that doesn’t decode to any valid YYYYMM format. The 2019 and 2020+ vintages produced by NCCS use the standard schema again, so the issue appears scoped to those three months.

scripts/run_all_legacy.sh skips them by default via SKIP_VINTAGES="2017-09,2017-12,2018-12". Override with SKIP_VINTAGES="" if upstream NCCS files have been corrected.

20 Adding New Legacy Files

  1. Upload the CSV to s3://nccsdata/legacy/bmf/ (preferred) or drop it in data/raw/legacy/.

  2. Run:

    # From S3:
    LEGACY_BMF_YEAR  <- 2013
    LEGACY_BMF_MONTH <- 7
    source("R/run_legacy_pipeline.R")
    
    # Or from a local path:
    LEGACY_BMF_FILE <- "data/raw/legacy/<filename>.csv"
    source("R/run_legacy_pipeline.R")
  3. If pre-harmonization fails with “Legacy BMF contains N column(s) not present in XWALK-BMF-V2.0.csv”:

    • Add a row to data/crosswalks/XWALK-BMF-V2.0.csv for each unknown column (legacy_name_upper, legacy_case_variants, disposition, subcategory, etc.).
    • Set n_dicts_observed=0 and leave year_month_* blank to flag “discovered in raw file, not in dictionary.”
    • Re-run.

21 Implementation Notes

  • paste0(names(dt)) not names(dt)data.table::names(dt) returns a reference that mutates with setnames/:=NULL operations. The harmonization adapter uses paste0() to force a fresh character vector for the audit trail. See R/legacy_bmf_adapter.R.
  • validate_join_success() zero-row guardR/input_validation.R short-circuits when the dim table is empty (e.g., all-NA CLASSIFICATION input → empty dim_classification). Without this, the empty-table case produces NaN that breaks the threshold comparison.
  • HTML rendering is non-fatal — wrapped in tryCatch so a missing quarto R package or render failure doesn’t abort the pipeline. The JSON quality report is already saved by that point.