13 State Data Marts
The state data marts split the geocoded Master BMF into one file per US state / territory so end users can pull only the rows they need instead of downloading the full ~3 GB unified file.
13.1 Motivation
bmf_master_geocoded.csv is 3.1 GB. Most analyses are state-scoped (an Urban Institute team studying nonprofits in Ohio, a state attorney general’s office, a researcher writing about Texas foundations), so a partitioned distribution layer dramatically cuts download size and load time without touching the underlying master.
13.2 Partition key
Rows are partitioned on org_addr_state — the cleaned mailing-address state from the BMF. This is the org’s reported state (populated for ~99.8 % of rows), not the geocoded state, so coverage is essentially complete. Rows where org_addr_state is missing or empty are bucketed into state=ZZ so they remain discoverable rather than being dropped.
The partition column is renamed state in the parquet output (so queries can WHERE state = 'NY' without remembering the longer underlying name).
13.3 Outputs
Two formats are produced from the same source:
| Format | Layout | Use case |
|---|---|---|
| Parquet | data/master/state_marts/parquet/state=XX/part-0.parquet |
DuckDB / Athena / pandas — Hive-partitioned |
| CSV | data/master/state_marts/csv/bmf_master_XX.csv |
Spreadsheet tools, R/Python users who want a single file |
S3 mirrors the same structure under s3://nccsdata/master/bmf/state_marts/.
The Hive layout means a DuckDB query like
SELECT * FROM read_parquet('s3://nccsdata/master/bmf/state_marts/parquet/**/*.parquet',
hive_partitioning = 1)
WHERE state = 'NY'reads only the state=NY/ partition.
13.4 Coverage
The build produces 63 marts on the current geocoded master:
- 50 US states + DC
- 5 US territories: PR, VI, GU, AS, MP
- 3 APO/FPO military codes: AA, AE, AP
- 3 Compact-of-Free-Association codes: FM, MH, PW
- 1 missing-state bucket: ZZ
13.5 Running it
source("R/run_master_state_marts.R")The script reads data/geocoding/master/merged/bmf_master_geocoded.parquet, writes both formats to data/master/state_marts/, and (if ENABLE_S3_UPLOAD <- TRUE) uploads everything to S3.
13.6 Update cadence
Rebuild the marts whenever the geocoded master is rebuilt — i.e. after a new monthly current BMF lands and the master + master geocoding workflow is re-run. The state marts are a pure derivation of the geocoded master; they have no other inputs.
13.7 Caveats
- Mailing state, not operational state. A nonprofit headquartered in DE that operates nationally appears only in the
DEmart. Use the geocoded master directly for analyses that need the geocoded state (geo_state_abbr). - All-VARCHAR schema. Same convention as the master — consumers should cast numeric and date columns after reading. See Chapter 12 for the recommended cast list.
- No schema changes. The marts contain the exact same columns as the geocoded master plus the
statepartition column. - Counts won’t match sector-in-brief. A nonprofit count from these marts (filtering
last_year_in_bmf, partitioned on the mailing-state keyorg_addr_state) runs slightly below thesector-in-briefnumber_nonprofitsfigure for the same place/year. That product counts by an active-window rule (org_year_first ≤ Y ≤ org_year_last) and by geocoded county/state — both by design, not a data error. For “how many active nonprofits in place X,” cite sector-in-brief; query these marts directly only for EIN-level detail. See the “Counting semantics” note in thenccs-contractssector-in-brief.yml/bmf-master-geocoded.ymlcontracts for the rule and a worked example.
Three defensible queries give three different numbers — all correct for their definition:
| Query | Geography | “Active” rule | Count |
|---|---|---|---|
subsection_code == 3, latest vintage |
mailing (org_addr_state) |
present in 2026-05 file | 10,912 |
| + public-charity filter | mailing | present in 2026-05 file | 10,246 |
| + active-window | mailing | last_year_in_bmf == 2026 |
10,358 |
| sector-in-brief / dashboard | geocoded (Census State) |
active-window | 10,368 |
Two gotchas behind the spread:
- 501(c)(3) ≠ 501(c)(3) public charity. Every private foundation is also a 501(c)(3). Filtering only
subsection_code == 3keeps the ~660 DC private foundations (foundation_codein {2,3,4}) and 4 suspense orgs (code 9). A public-charity count restricts tofoundation_code %in% c(10:18, 21:24). - The +122 vs. the dashboard is ~112 active-window (orgs last seen in an early-2026 vintage but not the May file) + ~10 geocoded-vs-mailing geography. Cite 10,368 (sector-in-brief) for the public figure.