13  State Data Marts

The state data marts split the geocoded Master BMF into one file per US state / territory so end users can pull only the rows they need instead of downloading the full ~3 GB unified file.

13.1 Motivation

bmf_master_geocoded.csv is 3.1 GB. Most analyses are state-scoped (an Urban Institute team studying nonprofits in Ohio, a state attorney general’s office, a researcher writing about Texas foundations), so a partitioned distribution layer dramatically cuts download size and load time without touching the underlying master.

13.2 Partition key

Rows are partitioned on org_addr_state — the cleaned mailing-address state from the BMF. This is the org’s reported state (populated for ~99.8 % of rows), not the geocoded state, so coverage is essentially complete. Rows where org_addr_state is missing or empty are bucketed into state=ZZ so they remain discoverable rather than being dropped.

The partition column is renamed state in the parquet output (so queries can WHERE state = 'NY' without remembering the longer underlying name).

13.3 Outputs

Two formats are produced from the same source:

Format Layout Use case
Parquet data/master/state_marts/parquet/state=XX/part-0.parquet DuckDB / Athena / pandas — Hive-partitioned
CSV data/master/state_marts/csv/bmf_master_XX.csv Spreadsheet tools, R/Python users who want a single file

S3 mirrors the same structure under s3://nccsdata/master/bmf/state_marts/.

The Hive layout means a DuckDB query like

SELECT * FROM read_parquet('s3://nccsdata/master/bmf/state_marts/parquet/**/*.parquet',
                           hive_partitioning = 1)
WHERE state = 'NY'

reads only the state=NY/ partition.

13.4 Coverage

The build produces 63 marts on the current geocoded master:

  • 50 US states + DC
  • 5 US territories: PR, VI, GU, AS, MP
  • 3 APO/FPO military codes: AA, AE, AP
  • 3 Compact-of-Free-Association codes: FM, MH, PW
  • 1 missing-state bucket: ZZ

13.5 Running it

source("R/run_master_state_marts.R")

The script reads data/geocoding/master/merged/bmf_master_geocoded.parquet, writes both formats to data/master/state_marts/, and (if ENABLE_S3_UPLOAD <- TRUE) uploads everything to S3.

13.6 Update cadence

Rebuild the marts whenever the geocoded master is rebuilt — i.e. after a new monthly current BMF lands and the master + master geocoding workflow is re-run. The state marts are a pure derivation of the geocoded master; they have no other inputs.

13.7 Caveats

  • Mailing state, not operational state. A nonprofit headquartered in DE that operates nationally appears only in the DE mart. Use the geocoded master directly for analyses that need the geocoded state (geo_state_abbr).
  • All-VARCHAR schema. Same convention as the master — consumers should cast numeric and date columns after reading. See Chapter 12 for the recommended cast list.
  • No schema changes. The marts contain the exact same columns as the geocoded master plus the state partition column.