12  Master BMF

The Master BMF is a single-row-per-EIN view consolidating every nonprofit ever observed across both the current monthly IRS BMF pipeline and the legacy NCCS 501CX-NONPROFIT-PX pipeline. It is a derived artifact — the per-vintage processed CSVs remain the source of truth for historical detail.

12.1 Motivation

The processed BMF outputs are vintage-specific: there is one CSV per month (current) and one CSV per scraped legacy snapshot (1989–2016). For many downstream uses — particularly geocoding — what matters is “every EIN that has ever been a recognized nonprofit, with its most recent address.” Stacking 100+ vintages and deduplicating to one row per EIN is exactly that view.

12.2 Schema

The Master BMF has the union of the current and legacy processed schemas, plus seven new columns that describe each EIN’s appearance across vintages:

Column Type Description
bmf_source char current or legacy — pipeline that produced the surviving record
bmf_vintage_ym char(7) YYYY-MM of the vintage providing the surviving row (= last_vintage_ym)
first_vintage_ym char(7) Earliest vintage (YYYY-MM) in which the EIN was observed
last_vintage_ym char(7) Latest vintage (YYYY-MM) in which the EIN was observed
first_year_in_bmf int YEAR(first_vintage_ym) — convenience for filtering
last_year_in_bmf int YEAR(last_vintage_ym)
bmf_vintages_observed int Count of vintages in which this EIN appeared (≥ 1)

Columns from the legacy slim per-vintage schema that don’t exist in the current schema are filled with NULL for current-sourced rows, and vice versa. DuckDB’s union_by_name handles this automatically.

12.3 Dedup logic

ROW_NUMBER() OVER (
  PARTITION BY ein
  ORDER BY bmf_vintage_ym DESC, bmf_source ASC
) = 1

For each EIN, keep the row from the newest vintage. When the same vintage_ym appears in both pipelines (the 2014–2016 overlap window where current-IRS extracts and NCCS legacy snapshots both exist), the current pipeline wins because its schema is richer (full IRS BMF fields vs. NCCS-curated subset).

'current' < 'legacy' alphabetically, so bmf_source ASC is the correct tiebreaker.

12.4 Inputs

The build reads CSVs directly from S3 via DuckDB’s httpfs extension:

Pattern Source
s3://nccsdata/processed/bmf/YYYY_MM/bmf_YYYY_MM_processed.csv current
s3://nccsdata/processed/bmf-legacy/YYYY_MM/bmf_legacy_YYYY_MM_processed.csv legacy

Both come from the existing pipelines (run_pipeline.R and run_legacy_pipeline.R respectively) and are not regenerated by the master build — Master is purely a stack-and-dedup over their outputs.

12.5 Outputs

Local path S3 path
data/master/bmf_master.parquet s3://nccsdata/master/bmf/bmf_master.parquet
data/master/bmf_master.csv s3://nccsdata/master/bmf/bmf_master.csv
data/master/bmf_master_data_dictionary.csv s3://nccsdata/master/bmf/bmf_master_data_dictionary.csv
data/quality/bmf_master_quality_report.json s3://nccsdata/master/bmf/bmf_master_quality_report.json

The parquet is zstd-compressed for distribution efficiency. The CSV is the same rows, useful for spreadsheet tools.

12.6 Quality checks

Reported by R/quality/master_post_checks.R:

  • total_rows == distinct_eins (hard gate — must hold or the build fails)
  • Rows by bmf_source
  • Histogram of bmf_vintage_ym (which vintages contributed surviving rows)
  • Year range over first_year_in_bmf and last_year_in_bmf
  • Distribution of bmf_vintages_observed
  • Per-column non-null percentage

12.7 Update cadence

Rebuild the Master BMF after each new monthly current BMF lands. The legacy half is static (no new vintages will be produced), so the incremental cost each month is reading one new monthly CSV plus restacking everything else. With DuckDB on a c5.18xlarge, the full build runs in a few minutes — there is no need for an incremental strategy.

12.8 Geocoding integration

R/run_master_geocoding.R provides export/merge modes that mirror the per-month geocoding workflow but operate on bmf_master.parquet:

# Step 1: write geocoder batches (dedups on org_addr_full first;
# typical ~30-40 % reduction in geocoder rows because many EINs
# share addresses).
MASTER_GEOCODING_MODE <- "export"
source("R/run_master_geocoding.R")

# ... upload each batch CSV to the Urban Institute geocoder,
#     download geocoded outputs to data/geocoding/master/output/ ...

# Step 2: merge geocoded lat/lon back onto the master, propagating
# to all EINs that share each unique address.
MASTER_GEOCODING_MODE <- "merge"
source("R/run_master_geocoding.R")

Outputs: data/geocoding/master/merged/bmf_master_geocoded.{parquet,csv}, a JSON quality report, and bmf_master_geocoded_data_dictionary.csv. Uploads to s3://nccsdata/geocoding/bmf-master/merged/ if ENABLE_S3_UPLOAD.

The geocoded master keeps the all-VARCHAR convention for the BMF columns, but geo_lat and geo_lon are explicitly cast to DOUBLE since they are the highest-value columns of the geocoded output.

12.8.1 Coverage and the legacy-row NA contract

Roughly 40 % of master rows have NA geo_lat/geo_lon. This is by design, not a geocoder failure — the geocoder match rate on the rows we send is 100 %.

The cause: every bmf_source = "legacy" row (~1.48 M of 3.67 M) originates from an NCCS 501CX-NONPROFIT-PX file whose source schema has no STREET column. The legacy slim-output rule in R/legacy_bmf_adapter.R (DERIVED_OUTPUT_COLUMNS) only emits org_addr_full when all four of STREET/CITY/STATE/ZIP were populated in the source, so org_addr_full is dropped from every legacy slim CSV. After union_by_name builds the master, those rows carry NA org_addr_full and are filtered out at export (R/master_geocoding.R requires non-empty org_addr_full).

We deliberately do not pre-bake ZIP- or city-centroid lat/lon for those rows. Pre-baked centroids would (a) commit every downstream consumer to one TIGER vintage, (b) conflate precision tiers in the same geo_lat/geo_lon column, and (c) destroy information vs. exposing the clean components.

For geographic analysis on legacy-source rows, join the already-cleaned address components (org_addr_zip5, org_addr_state, org_addr_city — all present in the geocoded master output via union_by_name) to a TIGER layer at your chosen vintage:

# R
library(tigris)
zcta <- zctas(year = 2020)             # pick the vintage matching your analysis
joined <- merge(bmf, zcta, by.x = "org_addr_zip5", by.y = "ZCTA5CE20")
# Python
import pygris
zcta = pygris.zctas(year=2020)
joined = bmf.merge(zcta, left_on="org_addr_zip5", right_on="ZCTA5CE20")

From there, derive whatever representation the analysis needs (polygon, centroid, population-weighted centroid, county aggregate, …). Street-level geocoding is only available for current-source rows because address-range interpolation needs the Urban geocoder’s lookups; for legacy rows the TIGER join is the equivalent best available precision.

Cadence is ad-hoc: rebuild the master, then re-export and merge. The dedup step uses a manifest (bmf_master_geocoder_addr_lookup.parquet) written by the export step and consumed by the merge step, so don’t delete the input directory between steps.

12.9 Running it

# On EC2 (after setup_ec2.sh and AWS credentials configured):
bash scripts/run_master.sh

In RStudio:

source("R/run_master_pipeline.R")

Configuration knobs at the top of R/run_master_pipeline.R:

  • DUCKDB_MEMORY_LIMIT — default "100GB" (sized for c5.18xlarge, 144 GB RAM). DuckDB spills to disk above the limit, so this is a soft ceiling — undersizing it just makes the build slower, it won’t OOM the host. Tune to leave ~30 % of system RAM free for the OS, R session, and CSV parser buffers.
  • DUCKDB_THREADSNULL uses all cores.
  • DUCKDB_DB_PATHNULL uses an in-memory database; set a path to persist staging tables for debugging.
  • ENABLE_S3_UPLOADTRUE by default.

12.9.1 Sizing for other instance types

The defaults at the top of run_master_pipeline.R are tuned for the c5.18xlarge instance the project runs on (72 vCPU / 144 GB RAM). On any smaller host, drop DUCKDB_MEMORY_LIMIT proportionally before sourcing the script. Suggested starting points:

Instance vCPU / RAM DUCKDB_MEMORY_LIMIT Notes
c5.18xlarge 72 / 144 GB "100GB" (default) What the project is sized for; build is CPU-bound
m6i.4xlarge 16 / 64 GB "45GB" Comfortable margin
m6i.2xlarge 8 / 32 GB "22GB" Fine; CSV parse is the bottleneck, not memory
m6i.xlarge 4 / 16 GB "10GB" Works — DuckDB will spill aggressively to disk
Laptop (8 GB RAM) any / 8 GB "5GB" Will spill heavily; expect noticeably longer runtime

Do not size DUCKDB_MEMORY_LIMIT to consume more than ~70 % of system RAM — the R session, the CSV parser, and the OS need headroom. If the build fails with an out-of-memory error mid-run, lower the limit further; DuckDB will still complete using disk-backed intermediate spills, just more slowly.

If you’re CPU-constrained and want to leave cycles for other work, also set DUCKDB_THREADS <- 4 (or whatever you can spare) before sourcing.

12.10 Caveats

  • Address history is lost. Master keeps only the most recent address per EIN. For longitudinal address research, use the per-vintage processed CSVs.

  • Revoked / terminated orgs are included. The Master BMF is every EIN ever observed; consumers can filter on status_code if only currently-active orgs are wanted.

  • Schema is sparse for legacy-only EINs. EINs that disappeared before 2014 carry only the legacy slim schema; current-pipeline columns (e.g. NTEEv2 fields, modern address quality flags) are NULL. The bmf_source column makes this explicit.

  • All columns are VARCHAR. The master build reads every column as a string (all_varchar = true) because DuckDB’s UNION ALL BY NAME refuses to combine columns where the legacy side inferred VARCHAR (typically because the column was all-empty in the slim schema) and the current side inferred BIGINT or another numeric type. Consumers should cast columns they care about after reading. Recommended casts:

    Column R Python
    asset_amount, income_amount, revenue_amount as.numeric() pd.to_numeric(..., errors='coerce')
    tax_period_ymd, ruling_date as.Date() pd.to_datetime(..., errors='coerce')
    first_year_in_bmf, last_year_in_bmf as.integer() pd.to_numeric(..., downcast='integer')
    bmf_vintages_observed as.integer() same
    All *_is_missing / *_provided flags as.logical() .astype(bool)