Read NCCS BMF Data from S3 — nccs

Reads the NCCS Business Master File (BMF) stored as a parquet file in a public S3 bucket. Supports predicate-pushdown filtering on state, county, NTEE classification (subsector, code, NTEEv2 code, major group), exempt organization type, financial size, and BMF recency for efficient reads.

Usage

nccs_read(
  state = NULL,
  county = NULL,
  ntee_subsector = NULL,
  ntee_major_group = NULL,
  ntee_code = NULL,
  nteev2_code = NULL,
  exempt_org_type = NULL,
  org_type = c("all", "501c3", "public_charity", "pc", "private_foundation", "pf",
    "non_501c3"),
  size_metric = c("revenue", "income", "asset"),
  size_min = NULL,
  size_max = NULL,
  min_last_year = NULL,
  cache = TRUE,
  cache_max_age = 30L,
  coerce = TRUE,
  columns = NULL,
  collect = TRUE
)

Arguments

state

Character vector of two-letter state abbreviations (e.g., `"PA"`, `c("PA", "NY")`). Filters `org_addr_state`.

county

Character vector of county names (e.g., `"Lackawanna County"`). Filters `geo_county`.

ntee_subsector

Character vector of NTEEv2 subsector values. Accepts either subsector codes (`"UNI"`, `"ART"`) or human-readable names (`"Universities"`, `"Arts, Culture and Humanities"`), matched case-insensitively. See [nccs_catalog()] for valid values. Filters `nteev2_subsector`.

ntee_major_group

Character vector of single-letter NTEE major groups (`"A"` through `"Z"`). Filters `ntee_code_major_group`. See [nccs_catalog("ntee_major_group", labels = TRUE)] for descriptions.

ntee_code

Character vector of standardized 3-character NTEE codes (e.g., `"B40"`, `c("A20", "A23")`). Filters `ntee_code_clean`. Not validated — invalid codes just return no rows.

nteev2_code

Character vector of NTEEv2 3-character codes. Filters `nteev2_code`. Not validated.

exempt_org_type

Character vector of exempt organization type descriptions. See [nccs_catalog()] for valid values. Filters `exempt_organization_type`.

org_type

Convenience filter that combines `subsection_code` and `foundation_code` into the four common 501(c)(3) cuts plus their complement. One of:

`"all"` (default) — no filter.
`"501c3"` — every 501(c)(3) organization (`subsection_code == "3"`).
`"public_charity"` / `"pc"` — 501(c)(3) public charities (foundation codes other than the three private-foundation types).
`"private_foundation"` / `"pf"` — 501(c)(3) private foundations (`foundation_code` in `"2"`, `"3"`, `"4"`).
`"non_501c3"` — everything that is not a 501(c)(3).

size_metric

One of `"revenue"`, `"income"`, or `"asset"` indicating which financial amount to use with `size_min` / `size_max`. Defaults to `"revenue"`. The underlying columns are stored upstream as character; this function casts to numeric inside the predicate so the filter pushes down to Arrow.

size_min, size_max

Numeric. Optional inclusive bounds on the `size_metric` amount. `NULL` (default) leaves a side unbounded. Rows with `NA` for the chosen metric are dropped when either bound is set.

min_last_year

Integer. If set, restricts results to EINs whose `last_year_in_bmf` (the calendar year of the most recent BMF vintage in which the EIN appeared) is at least this value. Use this as a recency / "still active" filter — e.g., `min_last_year = 2024` keeps organizations seen in BMF in 2024 or later.

cache

Controls local caching of the master parquet. The S3 file is hundreds of MB; without caching each call re-downloads it. `TRUE` (default) caches in `tools::R_user_dir("nccsdata", "cache")`. A character path uses that directory instead. `FALSE` skips the cache and reads directly from S3 every call. See [nccs_cache_dir()] and [nccs_cache_clear()].

cache_max_age

Integer. Maximum age in days before the cached parquet is considered stale and re-downloaded. Defaults to 30 (the master is rebuilt monthly upstream). Ignored when `cache = FALSE`.

coerce

Logical. If `TRUE` (default), known character-typed financial, date, and indicator columns are coerced to their natural R types after collection (see "Column coercion" below). Set to `FALSE` to leave every column as published upstream. Only takes effect when `collect = TRUE`; with `collect = FALSE` the lazy Arrow query is returned untouched.

columns

Column selection. `NULL` (default) returns a sensible default subset. A character vector returns those specific columns. `"all"` returns all columns (warning: 400+ MB). Columns used in active filters are always included.

collect

Logical. If `TRUE` (default), collects the result into a tibble. If `FALSE`, returns a lazy Arrow query for further dplyr operations.

Value

A tibble (if `collect = TRUE`) or an Arrow Dataset query (if `collect = FALSE`).

Details

Reads the rolling "master" geocoded BMF at `s3://nccsdata/geocoding/bmf-master/merged/bmf_master_geocoded.parquet`. For a specific dated monthly snapshot, see [nccs_vintage_url()] — those artifacts are CSVs with per-vintage schemas and are not exposed through this function.

Column coercion

The upstream BMF parquet stores most columns as `character` by design (vintage-stacking requires it). When `coerce = TRUE`, `nccs_read()` casts the following on the collected tibble:

Numeric: `asset_amount`, `income_amount`, `revenue_amount`, `geo_score`, `geo_distance`.
Date (`YYYY-MM-DD`): `ruling_date`, `tax_period_ymd`.
Logical indicators (via [nccs_as_indicator()] with the `yn` scheme): `group_exemption_is_member`, `org_addr_is_po_box`, `org_addr_is_rural_route`, `org_addr_has_special_chars`, `org_addr_is_missing`, `org_addr_missing_number`, `org_addr_state_invalid`, `ruling_date_is_missing`, `tax_period_is_missing`, `in_care_of_name_provided`.

Code-like columns (e.g. `subsection_code`, `classification_code`, ZIPs) are intentionally left as `character` — they are identifiers, not numbers. Columns not in the result (because of `columns` selection) are silently skipped.

Examples

if (FALSE) { # \dontrun{
# All Pennsylvania nonprofits
pa <- nccs_read(state = "PA")

# Universities in PA seen in BMF in 2024 or later
pa_uni <- nccs_read(
  state = "PA",
  ntee_subsector = "Universities",
  min_last_year = 2024
)

# Arts orgs (major group A) with revenue between $1M and $10M
arts_mid <- nccs_read(
  ntee_major_group = "A",
  size_metric = "revenue",
  size_min = 1e6,
  size_max = 1e7
)

# Only 501(c)(3) private foundations in California
ca_pf <- nccs_read(state = "CA", org_type = "private_foundation")

# Lazy query for custom dplyr chains
query <- nccs_read(state = "PA", collect = FALSE)
result <- query |>
  dplyr::filter(geo_county == "Lackawanna County") |>
  dplyr::collect()
} # }