nccsdata provides tools to download, filter, and analyze nonprofit organization data from the National Center for Charitable Statistics (NCCS). It reads IRS Business Master File (BMF) data stored as parquet files in a public S3 bucket, with support for predicate-pushdown filtering by state, county, NTEE subsector, and exempt organization type.
Note: This is version 2.0.0, a ground-up rewrite of the package. The v1 API (
get_data(),preview_sample(),parse_ntee()) has been replaced. See the migration section below.
Installation
Install the development version from GitHub:
# install.packages("devtools")
devtools::install_github("UrbanInstitute/nccsdata")Usage
Reading BMF data
nccs_read() downloads BMF data from S3 with optional filters. Filtering happens at the Arrow level via predicate pushdown, so only matching rows are read into memory.
library(nccsdata)
# All Pennsylvania nonprofits (default columns)
pa <- nccs_read(state = "PA")
# Arts nonprofits in New York
ny_arts <- nccs_read(state = "NY", ntee_subsector = "ART")
# Select specific columns
pa_slim <- nccs_read(
state = "PA",
columns = c("ein", "org_name_display", "geo_county", "income_amount")
)
# Lazy query for custom dplyr pipelines
query <- nccs_read(state = "PA", collect = FALSE)
result <- query |>
dplyr::filter(geo_county == "Lackawanna County") |>
dplyr::collect()Summarizing data
nccs_summary() produces grouped count summaries from a collected data frame.
pa <- nccs_read(state = "PA")
# Total count
nccs_summary(pa)
# Count by county
nccs_summary(pa, group_by = "geo_county")
# Count by county and subsector, export to CSV
nccs_summary(pa, group_by = c("geo_county", "nteev2_subsector"),
output_csv = "pa_counts.csv")Discovering valid filter values
nccs_catalog() lists valid values for nccs_read() filters without any network calls.
nccs_catalog("state")
nccs_catalog("ntee_subsector")
nccs_catalog("exempt_org_type")
# Pass `labels = TRUE` for a code + description tibble, sourced from the
# bundled BMF lookup tables.
nccs_catalog("ntee_subsector", labels = TRUE)
nccs_catalog("foundation_code", labels = TRUE)Cleaning external data
The BMF returned by nccs_read() is already normalized upstream, but two helpers are exposed for users joining external CSVs or API responses against it:
# Coerce EINs in any format to canonical XX-XXXXXXX
nccs_normalize_ein(c("123456789", "12-3456789", 1234567))
#> [1] "12-3456789" "12-3456789" "00-1234567"
# Coerce IRS binary-indicator columns to logical
nccs_as_indicator(c("Y", "N", "1", "2"))
#> [1] TRUE FALSE TRUE FALSE
# e-file indicator accepts E/P (2015, 2018+) and Y/N (2016-2017)
nccs_as_indicator(c("E", "P", "Y", "N"), scheme = "efile")Browsing the data dictionary
nccs_dictionary() returns a tibble describing all BMF columns, with optional pattern filtering.
# All columns
nccs_dictionary()
# Find geocoding-related columns
nccs_dictionary("geo")
# Find NTEE-related columns
nccs_dictionary("ntee")Scope and design
nccsdata is intentionally a lean reader. A few principles that shape what is — and is not — in the package:
-
No re-cleaning of upstream data. The BMF and CORE Series parquet files are cleaned by the sibling ETL pipelines (
nccs-data-bmf,nccs-data-core). EIN normalization, NTEE decoding, geocoding, and subsection labeling are done before publication. We don’t re-implement them here. -
The two exceptions are helpers for external data.
nccs_normalize_ein()andnccs_as_indicator()exist so you can bring your own CSVs (member rosters, survey extracts, donor lists) into the same shape as the package’s output before joining. -
One opinionated analytic helper: inflation adjustment.
nccs_deflate()and the bundled annualcpi_useries are included because real-dollar conversion needs a reference table the user otherwise has to fetch themselves, and the conversion itself is mechanical and uncontroversial. -
No canonical financial ratios. Operating margin, program-expense ratio, fundraising efficiency, months of operating reserves, and similar measures are deliberately not bundled. Their definitions vary by analyst (which numerator, which denominator, which exclusions), and shipping one canonical version would make the package take editorial sides. They’re also one-line
mutate()calls on the columns CORE already provides. -
Lean dependencies. Hard imports are
arrow,dplyr,utils. Anything heavier (sf, tigris, ggplot2, data.table) belongs in vignettes that show how to combinenccsdatawith those packages, not as a dependency.
If you want to build analytic functionality on top of this package, the right pattern is a downstream package or notebook that imports nccsdata and adds your team’s preferred ratio definitions.
Migrating from v1
| v1 function | v2 replacement |
|---|---|
get_data() |
nccs_read() |
preview_sample() |
nccs_summary() |
ntee_preview() / parse_ntee()
|
nccs_catalog("ntee_subsector") |
Key changes:
- Data source moved from legacy Core/BMF CSVs to geocoded BMF parquet files on S3.
- Filtering now uses Arrow predicate pushdown instead of downloading full files.
- Dependencies reduced from 12 packages to 3 (
arrow,dplyr,utils).
Documentation
Full documentation is available at https://urbaninstitute.github.io/nccsdata/.
Getting help
- Browse the getting started vignette
- Open an issue on GitHub
- Contact the maintainer at
tpoongundranar@urban.org