10 Legacy BMF Harmonization
11 Overview
Legacy NCCS 501CX-NONPROFIT-PX BMF files (1989–2022) use NCCS-curated column names that differ from the current IRS BMF schema. The legacy pipeline harmonizes these files, runs the same transforms as the modern pipeline, and writes a slim per-file processed output that contains only columns whose underlying input was actually populated.
Legacy files are NCCS profile extracts, not raw IRS BMF — they include geographic enrichments and analytic flags but drop many raw IRS fields (filing requirements, affiliation, deductibility, ICO, STREET, activity, classification, status, organization type, asset/income codes, PF filing). 14 current-schema columns therefore have no legacy source at all and are excluded from the slim output.
12 Pipeline Differences
R/run_legacy_pipeline.R mirrors R/run_pipeline.R with these changes:
| Phase | Modern | Legacy |
|---|---|---|
| 1 Extraction | Download from s3://nccsdata/raw/bmf/ |
Download from s3://nccsdata/legacy/bmf/ (set LEGACY_BMF_YEAR/LEGACY_BMF_MONTH) or read from a local path (LEGACY_BMF_FILE) |
| 1.5 Harmonization | n/a | harmonize_legacy_bmf() — uppercase, drop, rename, NA-fill |
| 2 Pre-validation | Requires all 31 BMF_REQUIRED_COLUMNS |
Requires only 7 BMF_LEGACY_MIN_COLUMNS populated |
| 3–9 Transforms | unchanged | unchanged (same code path; NA inputs propagate) |
| 10 Intermediate parquet | full schema | full schema (audit/debug) |
| 11 Processed CSV | full schema | slim schema (only populated cols) |
| S3 upload | intermediate/bmf/YYYY_MM/, processed/bmf/YYYY_MM/ |
intermediate/bmf-legacy/YYYY_MM/, processed/bmf-legacy/YYYY_MM/ |
13 Harmonization Steps
Implemented in harmonize_legacy_bmf(dt, crosswalk) in R/legacy_bmf_adapter.R:
- Uppercase column names — handles case-aliasing across years (e.g.,
cZFilervsczFiler,ZFilervsZFILER,accpervsACCPER,RandNumvsrandnum). - Reject unknown columns — every column in the input file must have a row in
XWALK-BMF-V2.0.csv. Unknown columns trigger a hard stop. This is the safety net against silently leaving columns behind. - Drop columns marked
disposition=drop— NCCS analytic flags, NTEE refinements, geographic enrichments, single-year columns. - Rename columns marked
disposition=renameto theircurrent_nametarget. - NA-fill missing required columns — any
BMF_REQUIRED_COLUMNSmember not produced by step 4 is added as an NA character column, so existing transforms run unmodified. The transforms propagate NA through their outputs.
The output is a data.table with exactly the 31 columns in BMF_REQUIRED_COLUMNS, in arbitrary order.
14 Crosswalk: XWALK-BMF-V2.0.csv
Built from two sources:
Scraped dictionaries — 73 of 88 listed 501CX-NONPROFIT-PX dictionary pages from the NCCS catalog (
https://nccs.urban.org/nccs/catalogs/catalog-bmf.html). The other 15 are “data unavailable” placeholders. 47 distinct upper-cased columns observed across the 73 dictionaries.Direct file inspection — the dictionaries undercount real file contents. The 1989-06 raw file has 42 columns vs. the 28 listed in its dictionary; the extras are accumulated NCCS reprocessing artifacts (e.g.,
Ntee2007a,Confidence2007a,nteeFinal,*Old1suffixes). When a new legacy file surfaces unknown columns, add their dispositions to the crosswalk before re-running.
14.1 Schema
| Column | Description |
|---|---|
legacy_name_upper |
Upper-cased legacy column name (key) |
legacy_case_variants |
All observed source casings (semicolon-separated) |
current_name |
Target column in current schema (blank if drop) |
disposition |
rename or drop |
subcategory |
identity, alias, nccs-curated, nccs-flag, nccs-derived, ntee-refinement, geographic, single-year, other |
n_dicts_observed |
How many of the 73 dictionaries contain this column |
year_month_first, year_month_last |
Year-month range observed |
legacy_label, legacy_dtype, legacy_length, legacy_description |
Metadata from the source dictionary |
notes |
Provenance/semantic notes |
14.2 Renames
18 legacy columns map to current schema columns. Notable cases:
cTotRev→REVENUE_AMT— NCCS-curated total revenue from Form 990 (Part I, line 12). The only legacy source for revenue; provenance differs from the modern pipeline’sREVENUE_AMT(which is raw BMF). Flagged in thenotesfield.zip5→ZIP— current pipeline’s ZIP transform internally parses 5+4.INCOME→INCOME_AMT— Gross receipts from Form 990, semantically aligned withINCOME_AMT.- Identity renames (
EIN,NAME,CITY,STATE) — same name in both schemas.
14.3 Drops
59 legacy columns are dropped (out of 78 total crosswalk rows; 19 are renames). Categories:
- NCCS NTEE refinements (
NTEE1,MAJGRPB,ntmaj5/10/12,NTEEconf,LEVEL1-4,NTEEDRAFT,NTEE2006,NTEE2007A,NTEEFINAL,NTEEOLDB, etc.) — current pipeline derives these fromNTEE_CD. - NCCS analytic flags (
Filer,cFiler,ZFiler,cZFiler,cFinSrc,outnccs,OUTREAS,RandNum,ADDRESSCHANGED) — Core file derivations and per-snapshot diff metadata, not raw fields. - NCCS curated F990 derivations (
cAssets,cTaxPer,ASSETS_OLD,INCOME_OLD) — redundant with the raw BMF aliases we already keep (ASSETS,TAXPER) or prior-snapshot history we don’t propagate. - Geographic enrichments (
FIPS,MSA_NECH,PMSA,MSA,LATITUDE,LONGITUDE,fipsold,*Old1historical values,CBSA,DIVCODE,GEOCD,GEOCDF,GEOCODE,GEOEXCLUDE) — current pipeline runs its own geocoding viaR/run_geocoding.R. - NCCS-curated activity splits (
ACTIV1,ACTIV2,ACTIV3) — per-vintage 3-way splits of the IRS 9-char ACTIVITY field; the current pipeline expects the unsplit form. - Single-year columns (
ZIP1989-only,ORGTYPCD,v_errorCd). - Other (
NAICS— derived from NTEE in current pipeline;IRS990N,EPOST,FISYR_IMAGE— NCCS-curated 990-N and image metadata).
15 Current-Schema Columns With No Legacy Source
These 14 columns will always be NA in legacy intermediate output and are dropped from legacy processed output:
ICO, STREET, AFFILIATION, CLASSIFICATION, DEDUCTIBILITY, ACTIVITY, ORGANIZATION (1989 file is the exception via ORGCD), STATUS, ASSET_CD, INCOME_CD, PF_FILING_REQ_CD, REGION, RYEAR, ID.
Several legacy aliases only appear in some years (e.g., TAXPER is 2006-11+, GEN is 2010-08+, FRCD is 2002-01+, SEC_NAME is 2002-07+). Files from earlier years will have additional columns that are NA-filled.
16 Pre-2003 5-character NTEE Codes
Legacy BMF files from roughly 1995–2003 contain ~5 % of NTEE codes in NCCS’s pre-NTEE-CC 5-character form (letter + 4 digits, e.g. A0120, S0241, B1170). The structure is:
- Position 1: NTEE major-group letter (A–Z)
- Positions 2-3: NTEEv2 organization-type encoding (
01=AA,02=MT,03=PA,05=RP,11=MS,12=MM,19=NS, default RG) - Positions 4-5: decile-subdecile
Modern NTEE-CC is at most 4 chars, so the validator in R/transform_ntee_code.R would normally hard-stop on these. In legacy mode (legacy_mode = TRUE, set automatically by run_legacy_pipeline.R):
- The 5-char codes pass length validation with a warning rather than an error.
- The vendored crosswalk
data/lookup/ntee_legacy_5char_lookup.csv(1,597 rows; 942 of length 5) maps codes likeA0120→ART-A20-AA. Matched rows overridenteev2_subsector,nteev2_code, andnteev2_org_typedirectly. - Unmatched 5-char codes (~5 % long-tail) fall back to formulaic derivation:
nteev2_code = paste0(major_group, substr(raw, 4, 5))(e.g.A1120→A20). Subsector and org_type are still resolved correctly by the standard fcase logic on.first/.char23.
The crosswalk is vendored from Nonprofit-Open-Data-Collective/mission-taxonomies. scripts/check_ntee_legacy_coverage.R reports per-vintage match percentages against any locally-saved Phase 4 checkpoints.
17 Phase 11 Output Schema
compute_legacy_output_columns(populated_raw_columns) in R/legacy_bmf_adapter.R returns the per-file keep-list. The mapping from each current-schema raw column to its transform output columns is encoded in RAW_TO_OUTPUT_MAP. Output columns derived from multiple inputs (org_addr_full, org_addr_is_missing) are kept only when all their inputs were populated (encoded in DERIVED_OUTPUT_COLUMNS).
This means a 1989-06 file with no STREET will:
- Keep
org_addr_city*,org_addr_state*,org_addr_zip*(city, state, zip were populated) - Drop
org_addr_street*,org_addr_full,org_addr_is_missing(depend on STREET)
18 Output Files
For a legacy file BMF-YYYY-MM-501CX-NONPROFIT-PX.csv:
data/intermediate/bmf_legacy_YYYY_MM_intermediate.parquet— full ~110-column schema, includes NA-filled columns. Use for audit and debugging.data/processed/bmf_legacy_YYYY_MM_processed.csv— slim schema, varies per file based on populated inputs.data/processed/bmf_legacy_YYYY_MM_data_dictionary.csv— column metadata for the slim schema.data/quality/bmf_legacy_YYYY_MM_harmonization_report.json— per-file audit: input column count, drops, renames, NA-fills.data/quality/bmf_legacy_YYYY_MM_quality_report.json— same shape as modern pipeline quality reports.docs/quality-reports/bmf_legacy_YYYY_MM_quality_report.html— rendered quality report.
19 Known-bad upstream vintages
Three NCCS-published legacy files have a structurally different schema that the harmonization pipeline cannot consume:
BMF-2017-09-501CX-NONPROFIT-PX.csvBMF-2017-12-501CX-NONPROFIT-PX.csvBMF-2018-12-501CX-NONPROFIT-PX.csv
These three files contain sequence-ID values in the EIN column (e.g. 000000001 for VOLUNTEERS OF AMERICA INC) instead of real 9-digit IRS EINs, and a non-standard 8-character TAXPER encoding (e.g. 20001050) that doesn’t decode to any valid YYYYMM format. The 2019 and 2020+ vintages produced by NCCS use the standard schema again, so the issue appears scoped to those three months.
scripts/run_all_legacy.sh skips them by default via SKIP_VINTAGES="2017-09,2017-12,2018-12". Override with SKIP_VINTAGES="" if upstream NCCS files have been corrected.
20 Adding New Legacy Files
Upload the CSV to
s3://nccsdata/legacy/bmf/(preferred) or drop it indata/raw/legacy/.Run:
# From S3: LEGACY_BMF_YEAR <- 2013 LEGACY_BMF_MONTH <- 7 source("R/run_legacy_pipeline.R") # Or from a local path: LEGACY_BMF_FILE <- "data/raw/legacy/<filename>.csv" source("R/run_legacy_pipeline.R")If pre-harmonization fails with “Legacy BMF contains N column(s) not present in XWALK-BMF-V2.0.csv”:
- Add a row to
data/crosswalks/XWALK-BMF-V2.0.csvfor each unknown column (legacy_name_upper,legacy_case_variants,disposition,subcategory, etc.). - Set
n_dicts_observed=0and leaveyear_month_*blank to flag “discovered in raw file, not in dictionary.” - Re-run.
- Add a row to
21 Implementation Notes
paste0(names(dt))notnames(dt)—data.table::names(dt)returns a reference that mutates withsetnames/:=NULLoperations. The harmonization adapter usespaste0()to force a fresh character vector for the audit trail. SeeR/legacy_bmf_adapter.R.validate_join_success()zero-row guard —R/input_validation.Rshort-circuits when the dim table is empty (e.g., all-NA CLASSIFICATION input → emptydim_classification). Without this, the empty-table case produces NaN that breaks the threshold comparison.- HTML rendering is non-fatal — wrapped in
tryCatchso a missingquartoR package or render failure doesn’t abort the pipeline. The JSON quality report is already saved by that point.