Overview
The nccsdata package provides access to nonprofit
organization data from the National Center for Charitable Statistics
(NCCS). It reads IRS Business Master File (BMF) data stored as parquet
files in a public S3 bucket, with support for efficient filtering by
state, county, NTEE subsector, and exempt organization type.
The package requires no API keys or authentication — the data is publicly accessible.
Exploring the Data Dictionary
Before querying data, you can explore the 97 available columns using
nccs_dictionary():
library(nccsdata)
# See all available columns
nccs_dictionary()
# Find geocoding-related columns
nccs_dictionary("geo")
# Find NTEE classification columns
nccs_dictionary("ntee")Discovering Filter Values
Use nccs_catalog() to see the valid values for each
filter before querying:
# NTEE v2 subsector codes
nccs_catalog("ntee_subsector")
#> [1] "ART" "EDU" "ENV" "HEL" "HMS" "HOS" "IFA" "MMB" "PSB" "REL" "UNI" "UNU"
# State and territory codes
nccs_catalog("state")
# Exempt organization types (e.g., 501(c)(3), 501(c)(4), etc.)
nccs_catalog("exempt_org_type")Reading Data
The core function is nccs_read(), which reads BMF data
from S3 with predicate-pushdown filtering for efficient reads.
Filter by Organization Type
The org_type argument bundles the most common 501(c)(3)
cuts so you don’t have to manage subsection_code and
foundation_code by hand:
# All 501(c)(3) public charities
pc <- nccs_read(org_type = "public_charity")
# California private foundations
ca_pf <- nccs_read(state = "CA", org_type = "private_foundation")
# Everything that is not a 501(c)(3)
non_c3 <- nccs_read(org_type = "non_501c3")Pass org_type = "501c3" to keep both public charities
and private foundations, or "all" (default) to skip the
filter entirely.
Selecting Columns
The BMF parquet file contains 97 columns and is over 400 MB. By
default, nccs_read() returns a curated subset of commonly
needed columns. You can customize this:
Lazy Evaluation
Set collect = FALSE to get a lazy Arrow query instead of
a tibble. This is useful for building custom dplyr chains before
collecting:
Column Types
The published BMF parquet stores most columns as
character (the upstream pipeline stacks vintages and writes
them out as strings). By default nccs_read() coerces
well-known columns to their natural R types on the collected tibble:
-
asset_amount,income_amount,revenue_amount,geo_score,geo_distance→numeric. -
ruling_date,tax_period_ymd→Date. - The
org_addr_is_*,ruling_date_is_missing,tax_period_is_missingandgroup_exemption_is_memberstyle flags →logical(vianccs_as_indicator()).
ID-like codes (subsection_code,
classification_code, ZIPs, NTEE codes) stay as
character — they’re identifiers, not numbers. Pass
coerce = FALSE to skip all coercion and get every column
exactly as published.
Local Caching
The master BMF parquet is hundreds of MB. nccs_read()
caches it in tools::R_user_dir("nccsdata", "cache") on
first use and reuses the local copy on subsequent calls (in the same
session and across sessions). The cached file is refreshed automatically
when it’s older than cache_max_age days (30 by default,
matching the upstream monthly rebuild cadence). Disable with
cache = FALSE, or force a refresh:
Reading Historical Vintages
nccs_read() reads the rolling geocoded master, which the
upstream pipeline rebuilds every month. For analyses that need a
specific dated snapshot (papers, dashboards, regulatory work), use
nccs_vintage_url() to get the S3 URI for that month, then
read the CSV directly:
uri <- nccs_vintage_url("2023_07")
dict <- arrow::read_csv_arrow(
nccs_vintage_url("2023_07", kind = "dictionary")
)
bmf_2023_07 <- arrow::read_csv_arrow(uri)Vintage snapshots are not geocoded — they carry raw Form 990
address columns instead of the master’s geo_* columns.
Schemas also drift across vintages and across the legacy seam
(pre-modern vintages live under a bmf-legacy path; pass
legacy = TRUE to nccs_vintage_url()). Always
inspect the matching data dictionary before filtering or joining.
Form 990 Filings (CORE Series)
The BMF is one row per registered organization. To work with one row per filing, use the CORE Series — Form 990 filings harmonized into a single schema across vintages. Three tiers are published:
-
"merged"(canonical, default): legacy + SOI-current merged on(ein, tax_period). One row per(ein, tax_period), 1987-2024, forms990combinedand990pf. Deduplicated — keeps the first occurrence per key. -
"soi": IRS SOI annual extracts, harmonized. 2012-2024, forms990,990ez,990pf,990combined. Carries anis_amendmentflag if you need to separate originals from revisions. -
"legacy": NCCS legacy CORE files, harmonized. 1987-2011, forms990combinedand990pf.
# Inspect a partition's columns before reading
dict <- nccs_core_columns("merged", 2020, "990combined")
head(dict[, c("harmonized_name", "data_type", "description")])
# Read one partition with column projection
df <- nccs_read_core(
tier = "merged",
tax_year = 2020,
form = "990combined",
columns = c("ein", "tax_period", "total_revenue", "total_expenses")
)
# Or build a lazy query for custom filters
nccs_read_core("merged", 2020, "990combined", collect = FALSE) |>
dplyr::filter(subsection_cd == 3) |>
dplyr::select(ein, tax_period, total_revenue) |>
dplyr::collect()For multi-year analyses, pass a vector to tax_year. The
function stacks the requested partitions into one Arrow dataset (so
filters and column projection still push down across all of them) and,
in an interactive session, reports the total download size before
fetching any partitions it doesn’t already have cached.
panel <- nccs_read_core(
tier = "merged",
tax_year = 2015:2022,
form = "990combined",
columns = c("ein", "tax_period", "total_revenue", "total_expenses")
)nccs_core_coverage() returns row counts for every
published partition of a tier — useful for sanity checks and quickly
spotting coverage gaps (e.g. SOI’s missing 2017-2019 990pf
partitions).
Summarizing
nccs_summary() produces grouped count summaries:
pa <- nccs_read(state = "PA")
# Total count
nccs_summary(pa)
# Count by county
nccs_summary(pa, group_by = "geo_county")
# Count by county and NTEE subsector
nccs_summary(pa, group_by = c("geo_county", "nteev2_subsector"))Saving Results
Write summary results to CSV:
pa <- nccs_read(
state = "PA",
county = c("Lackawanna County", "Luzerne County", "Wayne County")
)
nccs_summary(
pa,
group_by = c("geo_county", "nteev2_subsector"),
output_csv = "nepa_nonprofit_counts.csv"
)