nccsdata Part 1: Downloading Data
Introduction
The nccsdata
package
equips users with tools to read, filter, and append metadata in publicly
available NCCS Core and Business Master File (BMF) datasets. Its key
features include the following:
- downloading legacy Core and BMF datasets from the NCCS Data Archive across multiple years to construct panel datasets for research
- providing information on the National Taxonomy of Exempt Entities (NTEE) used by the IRS and NCCS to classify nonprofits
- providing information on US census units that can be used to filter downloaded data based on geography
- constructing summary tables for downloaded data
In part one of this four-part series on the
nccsdata
package, we
introduce the package and outline the process of downloading NCCS legacy
data using the
get_data()
function.
Parts two through four cover NTEE codes, census data, and summary
tables.
Installation
You can install the development version of nccsdata
directly from its
GitHub repository with:
install.packages( "devtools" )
devtools::install_github( "UrbanInstitute/nccsdata" )
Next, load in the package with:
library( nccsdata )
Downloading Data
Use nccsdata
cto
download legacy core data from 1989 to 2019 for charities, nonprofits,
or private foundations that have filed their respective required IRS
forms, including Form 990, 990EZs, or both.
These data can be filtered based on the type of organization, the type of IRS forms files, NTEE codes, and geographic units from the US census.
core_2005_nonprofit_pz <-
get_data( dsname = "core",
time = "2005",
scope.orgtype = "NONPROFIT",
scope.formtype = "PZ" )
#> Requested files have a total size of 82.6 MB. Proceed
#> with download? Enter Y/N (Yes/no/cancel)
tibble::as_tibble( core_2005_nonprofit_pz )
#> # A tibble: 157,211 × 150
#> NTEECC new.code type.org broad.category major.group univ hosp two.digit
#> <chr> <chr> <chr> <chr> <chr> <lgl> <lgl> <chr>
#> 1 J40 RG-HMS-J40 RG HMS J FALSE FALSE 40
#> 2 W30 RG-PSB-W30 RG PSB W FALSE FALSE 30
#> 3 W30 RG-PSB-W30 RG PSB W FALSE FALSE 30
#> 4 W30 RG-PSB-W30 RG PSB W FALSE FALSE 30
#> 5 W30 RG-PSB-W30 RG PSB W FALSE FALSE 30
#> 6 Y42 RG-MMB-Y42 RG MMB Y FALSE FALSE 42
#> 7 S41 RG-PSB-S41 RG PSB S FALSE FALSE 41
#> 8 N60 RG-HMS-N60 RG HMS N FALSE FALSE 60
#> 9 S41 RG-PSB-S41 RG PSB S FALSE FALSE 41
#> 10 S41 RG-PSB-S41 RG PSB S FALSE FALSE 41
#> # ℹ 157,201 more rows
#> # ℹ 142 more variables: further.category <int>, division.subdivision <chr>,
#> # broad.category.description <chr>, major.group.description <chr>,
#> # code.name <chr>, division.subdivision.description <chr>, keywords <chr>,
#> # further.category.desciption <chr>, ntee2.code <chr>, EIN <chr>,
#> # TAXPER <int>, STYEAR <int>, CONT <int>, DUES <int>, SECUR <int64>,
#> # SALESEXP <int64>, INVINC <int>, SOLICIT <int>, GOODS <int>, GRPROF <int>, …
core_2005_artnonprofits_newyork <-
get_data( dsname = "core",
time = "2016",
scope.orgtype = "NONPROFIT",
scope.formtype = "PZ",
ntee = "ART",
geo.state = "NY" )
#> Requested files have a total size of 113.6 MB. Proceed
#> with download? Enter Y/N (Yes/no/cancel)
tibble::as_tibble( core_2005_artnonprofits_newyork )
#> # A tibble: 346 × 168
#> NTEECC new.code type.org broad.category major.group univ hosp two.digit
#> <chr> <chr> <chr> <chr> <chr> <lgl> <lgl> <chr>
#> 1 A01 AA-ART-A00 AA ART A FALSE FALSE 1
#> 2 A01 AA-ART-A00 AA ART A FALSE FALSE 1
#> 3 A03 PA-ART-A00 PA ART A FALSE FALSE 3
#> 4 A03 PA-ART-A00 PA ART A FALSE FALSE 3
#> 5 A03 PA-ART-A00 PA ART A FALSE FALSE 3
#> 6 A03 PA-ART-A00 PA ART A FALSE FALSE 3
#> 7 A03 PA-ART-A00 PA ART A FALSE FALSE 3
#> 8 A03 PA-ART-A00 PA ART A FALSE FALSE 3
#> 9 A03 PA-ART-A00 PA ART A FALSE FALSE 3
#> 10 A03 PA-ART-A00 PA ART A FALSE FALSE 3
#> # ℹ 336 more rows
#> # ℹ 160 more variables: further.category <int>, division.subdivision <chr>,
#> # broad.category.description <chr>, major.group.description <chr>,
#> # code.name <chr>, division.subdivision.description <chr>, keywords <chr>,
#> # further.category.desciption <chr>, ntee2.code <chr>, EIN <int>,
#> # ACCPER <int>, ACTIV1 <dbl>, ACTIV2 <dbl>, ACTIV3 <dbl>, ADDRESS <chr>,
#> # AFCD <dbl>, ASS_BOY <dbl>, ASS_EOY <dbl>, BOND_BOY <dbl>, BOND_EOY <dbl>, …
Data are downloaded with the
get_data()
function. In this story, we provide several examples illustrating how
this function can be used to retrieve these legacy data.
Downloading Core Data
With
get_data()
we can define the type of data, the desired time range (in years),
organization type, and form type using the arguments dsname, time,
scope.orgtype, and scope.formtype respectively.
The acceptable values for these arguments are as follows:
dsname
: The type of data to downloadtime
: Any year from 1989-2019 where data is available. Full catalog can be foundscope.orgtype
CHARITIES
: All charitiesNONPROFIT
: All nonprofitsPRIVFOUND
: All private foundations
scope.formtype
PC
: Nonprofits that file the full IRS Form 990EZ
: Nonprofits that file 990EZs onlyPZ
: Nonprofits that file both full Form 990s and 990EZsPF
: Private foundation filings
For example, the code snippet below downloads NCCS core data from the year 2015 for all nonprofits that file both full 990s and 990EZs:
core <-
get_data( dsname = "core",
time = "2015",
scope.orgtype = "NONPROFIT",
scope.formtype = "PZ" )
Whenever
get_data()
is used, the user will encounter a prompt that provides information
about the download size of the requested data. The prompt also requests
permission to perform the download. This allows the user to preemptively
cancel downloads that are too large for their computer or internet
connection.
Filtering data downloads using NTEE codes
To further refine data downloads,
get_data()
can also pull only a subset of the data based on NTEE classifications
using its various ntee associated arguments, as shown in this example:
core_art <-
get_data( dsname = "core",
time = "2015",
scope.orgtype = "NONPROFIT",
scope.formtype = "PZ",
ntee = c("ART") )
In the above code snippet, we pull the same dataset but only select rows belonging to nonprofits involved in the Arts, Culture and Humanities. A full description of NTEE codes is available here.
The available ntee arguments are:
ntee
: Any valid full or partial NTEE codentee.group
: Level 1 of a full NTEE codentee.code
: Levels 2-4 of a full NTEE codentee.orgtype
: Level 5 of a full NTEE code
Part two of this series of data stories covers NTEE codes, their structures, and additional associated NTEE functions in greater detail.
Filtering Data By Geography
We can filter the data by US census units through
get_data()
’s
geo arguments. The code snippet below shows how to retrieve rows about
nonprofits in New York City using geo.state and geo.city for state-
and city-level filtering respectively.
core_NYC <-
get_data( dsname = "core",
time = "2015",
scope.orgtype = "NONPROFIT",
scope.formtype = "PZ",
geo.state = "NY",
geo.city = "New York City" )
Additional geo arguments can be used to subset the data by county (geo.county) and region (geo.region).
geo arguments must be used in conjunction with one another. For example, for Allen, Indiana, and San Francisco, California, we would use:
geo.state
= “IN”,geo.county
= “Allen” for “Allen, IN”geo.state
= “CA”,geo.city
= “San Francisco” for “San Francisco, CA”
get_data()
layers these filters to subset the data based on the desired geographic
unit. If only one argument is used, it returns all rows falling within
the requested geographic region (for example, geo.region = “south”
returns all rows from the southern states, and geo.city = “Lebanon”
returns all rows belonging to cities with the name ‘Lebanon’).
For more in-depth information on these geographic filters and additional geography-related functions, refer to part three of this data story.
Appending BMF Data to Core Data
get_data()
automatically appends NTEE metadata to the requested dataset and can be
configured to append BMF data to any downloaded Core data set.
Appending metadata from the IRS Business Master File (BMF) requires downloading an additional 185 MB and can be toggled on/off with append_bmf.
corebmf <-
get_data( dsname = "core",
time = "2015",
scope.orgtype = "NONPROFIT",
scope.formtype = "PZ",
append.bmf = TRUE )
Downloading BMF Data
The geo and ntee arguments discussed earlier can also be applied to download and filter BMF data. The code snippet below returns a subset of BMF for California-based nonprofits in the Arts, Culture, and Humanities group:
bmf <-
get_data( dsname = "bmf",
ntee = c("ART"),
geo.state = c("CA") )
Conclusion
With the
get_data()
function, researchers familiar with R can easily access NCCS data for
use in their work. Further details about the package are available on
the official nccsdata package
website
nccsdata Part 4: Summary Tables
Part 4 of 4 data stories covering the nccsdata R package. This story focuses on summarising NCCS legacy data.
nccsdata Part 2: NTEE Codes
Part 2 of 4 data stories covering the nccsdata R package. This story focuses on parsing NTEE codes.