Reads the NCCS Business Master File (BMF) stored as a parquet file in a public S3 bucket. Supports predicate-pushdown filtering on state, county, NTEE classification (subsector, code, NTEEv2 code, major group), exempt organization type, financial size, and BMF recency for efficient reads.
Usage
nccs_read(
state = NULL,
county = NULL,
ntee_subsector = NULL,
ntee_major_group = NULL,
ntee_code = NULL,
nteev2_code = NULL,
exempt_org_type = NULL,
org_type = c("all", "501c3", "public_charity", "pc", "private_foundation", "pf",
"non_501c3"),
size_metric = c("revenue", "income", "asset"),
size_min = NULL,
size_max = NULL,
min_last_year = NULL,
cache = TRUE,
cache_max_age = 30L,
coerce = TRUE,
columns = NULL,
collect = TRUE
)Arguments
- state
Character vector of two-letter state abbreviations (e.g., `"PA"`, `c("PA", "NY")`). Filters `org_addr_state`.
- county
Character vector of county names (e.g., `"Lackawanna County"`). Filters `geo_county`.
- ntee_subsector
Character vector of NTEEv2 subsector values. Accepts either subsector codes (`"UNI"`, `"ART"`) or human-readable names (`"Universities"`, `"Arts, Culture and Humanities"`), matched case-insensitively. See [nccs_catalog()] for valid values. Filters `nteev2_subsector`.
- ntee_major_group
Character vector of single-letter NTEE major groups (`"A"` through `"Z"`). Filters `ntee_code_major_group`. See [nccs_catalog("ntee_major_group", labels = TRUE)] for descriptions.
- ntee_code
Character vector of standardized 3-character NTEE codes (e.g., `"B40"`, `c("A20", "A23")`). Filters `ntee_code_clean`. Not validated — invalid codes just return no rows.
- nteev2_code
Character vector of NTEEv2 3-character codes. Filters `nteev2_code`. Not validated.
- exempt_org_type
Character vector of exempt organization type descriptions. See [nccs_catalog()] for valid values. Filters `exempt_organization_type`.
- org_type
Convenience filter that combines `subsection_code` and `foundation_code` into the four common 501(c)(3) cuts plus their complement. One of:
`"all"` (default) — no filter.
`"501c3"` — every 501(c)(3) organization (`subsection_code == "3"`).
`"public_charity"` / `"pc"` — 501(c)(3) public charities (foundation codes other than the three private-foundation types).
`"private_foundation"` / `"pf"` — 501(c)(3) private foundations (`foundation_code` in `"2"`, `"3"`, `"4"`).
`"non_501c3"` — everything that is not a 501(c)(3).
- size_metric
One of `"revenue"`, `"income"`, or `"asset"` indicating which financial amount to use with `size_min` / `size_max`. Defaults to `"revenue"`. The underlying columns are stored upstream as character; this function casts to numeric inside the predicate so the filter pushes down to Arrow.
- size_min, size_max
Numeric. Optional inclusive bounds on the `size_metric` amount. `NULL` (default) leaves a side unbounded. Rows with `NA` for the chosen metric are dropped when either bound is set.
- min_last_year
Integer. If set, restricts results to EINs whose `last_year_in_bmf` (the calendar year of the most recent BMF vintage in which the EIN appeared) is at least this value. Use this as a recency / "still active" filter — e.g., `min_last_year = 2024` keeps organizations seen in BMF in 2024 or later.
- cache
Controls local caching of the master parquet. The S3 file is hundreds of MB; without caching each call re-downloads it. `TRUE` (default) caches in `tools::R_user_dir("nccsdata", "cache")`. A character path uses that directory instead. `FALSE` skips the cache and reads directly from S3 every call. See [nccs_cache_dir()] and [nccs_cache_clear()].
- cache_max_age
Integer. Maximum age in days before the cached parquet is considered stale and re-downloaded. Defaults to 30 (the master is rebuilt monthly upstream). Ignored when `cache = FALSE`.
- coerce
Logical. If `TRUE` (default), known character-typed financial, date, and indicator columns are coerced to their natural R types after collection (see "Column coercion" below). Set to `FALSE` to leave every column as published upstream. Only takes effect when `collect = TRUE`; with `collect = FALSE` the lazy Arrow query is returned untouched.
- columns
Column selection. `NULL` (default) returns a sensible default subset. A character vector returns those specific columns. `"all"` returns all columns (warning: 400+ MB). Columns used in active filters are always included.
- collect
Logical. If `TRUE` (default), collects the result into a tibble. If `FALSE`, returns a lazy Arrow query for further dplyr operations.
Details
Reads the rolling "master" geocoded BMF at `s3://nccsdata/geocoding/bmf-master/merged/bmf_master_geocoded.parquet`. For a specific dated monthly snapshot, see [nccs_vintage_url()] — those artifacts are CSVs with per-vintage schemas and are not exposed through this function.
Column coercion
The upstream BMF parquet stores most columns as `character` by design (vintage-stacking requires it). When `coerce = TRUE`, `nccs_read()` casts the following on the collected tibble:
Numeric: `asset_amount`, `income_amount`, `revenue_amount`, `geo_score`, `geo_distance`.
Date (`YYYY-MM-DD`): `ruling_date`, `tax_period_ymd`.
Logical indicators (via [nccs_as_indicator()] with the `yn` scheme): `group_exemption_is_member`, `org_addr_is_po_box`, `org_addr_is_rural_route`, `org_addr_has_special_chars`, `org_addr_is_missing`, `org_addr_missing_number`, `org_addr_state_invalid`, `ruling_date_is_missing`, `tax_period_is_missing`, `in_care_of_name_provided`.
Code-like columns (e.g. `subsection_code`, `classification_code`, ZIPs) are intentionally left as `character` — they are identifiers, not numbers. Columns not in the result (because of `columns` selection) are silently skipped.
Examples
if (FALSE) { # \dontrun{
# All Pennsylvania nonprofits
pa <- nccs_read(state = "PA")
# Universities in PA seen in BMF in 2024 or later
pa_uni <- nccs_read(
state = "PA",
ntee_subsector = "Universities",
min_last_year = 2024
)
# Arts orgs (major group A) with revenue between $1M and $10M
arts_mid <- nccs_read(
ntee_major_group = "A",
size_metric = "revenue",
size_min = 1e6,
size_max = 1e7
)
# Only 501(c)(3) private foundations in California
ca_pf <- nccs_read(state = "CA", org_type = "private_foundation")
# Lazy query for custom dplyr chains
query <- nccs_read(state = "PA", collect = FALSE)
result <- query |>
dplyr::filter(geo_county == "Lackawanna County") |>
dplyr::collect()
} # }