Introduction

This guide outlines some useful workflows for pulling data sets commonly used by the Urban Institute.

library(tidycensus)

library(tidycensus) by Kyle Walker (complete intro here) is the best tool for accessing some Census data sets in R from the Census Bureau API. The package returns tidy data frames and can easily pull shapefiles by adding geometry = TRUE.

You will need to apply for a Census API key and add it to your R session. Don’t add your API key to your script and don’t add it to a GitHub repository!

Here is a simple example for one state with shapefiles:

library(tidyverse)
library(purrr)
library(tidycensus)

# pull median household income and shapefiles for Census tracts in Alabama
get_acs(geography = "tract", 
                variables = "B19013_001", 
                state = "01",
                year = 2015,
                geometry = TRUE,
                progress = FALSE)
Simple feature collection with 1181 features and 5 fields (with 1 geometry empty)
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -88.47323 ymin: 30.22333 xmax: -84.88908 ymax: 35.00803
Geodetic CRS:  NAD83
First 10 features:
         GEOID                                          NAME   variable
1  01003010500     Census Tract 105, Baldwin County, Alabama B19013_001
2  01003011501  Census Tract 115.01, Baldwin County, Alabama B19013_001
3  01009050500      Census Tract 505, Blount County, Alabama B19013_001
4  01015981901 Census Tract 9819.01, Calhoun County, Alabama B19013_001
5  01025957700     Census Tract 9577, Clarke County, Alabama B19013_001
6  01025958002  Census Tract 9580.02, Clarke County, Alabama B19013_001
7  01031011000      Census Tract 110, Coffee County, Alabama B19013_001
8  01033020500     Census Tract 205, Colbert County, Alabama B19013_001
9  01037961200      Census Tract 9612, Coosa County, Alabama B19013_001
10 01039961700  Census Tract 9617, Covington County, Alabama B19013_001
   estimate   moe                       geometry
1     41944  8100 MULTIPOLYGON (((-87.80249 3...
2     41417 14204 MULTIPOLYGON (((-87.71719 3...
3     40055  8054 MULTIPOLYGON (((-86.75735 3...
4        NA    NA MULTIPOLYGON (((-86.01323 3...
5     32708  4806 MULTIPOLYGON (((-88.1805 31...
6     29048 14759 MULTIPOLYGON (((-87.98623 3...
7     44732  7640 MULTIPOLYGON (((-85.92018 3...
8     49052  6543 MULTIPOLYGON (((-87.76733 3...
9     31957  9954 MULTIPOLYGON (((-86.46069 3...
10    32697  6021 MULTIPOLYGON (((-86.6998 31...

Smaller geographies like Census tracts can only be pulled state-by-state. This example demonstrates how to iterate across FIPS codes to pull Census tracts for multiple states. The process is as follows:

  1. Pick the variables of interest
  2. Create a vector of state FIPS codes for the states of interest
  3. Create a custom function that works on a single state FIPS code
  4. Iterate the function along the vector of state FIPS codes with map_df() from library(purrr)

Here is an example that pulls median household income at the Census tract level for multiple states:

# variables of interest
vars <- c(
  "B19013_001"  # median household income estimate
)

# states of interest: alabama, alaska, arizona
state_fips <- c("01", "02", "04")
    
# create a custom function that works for one state
get_income <- function(state_fips) {
    
    income_data <- get_acs(geography = "tract", 
                                                 variables = vars, 
                                                 state = state_fips,
                                                 year = 2015)
    
    return(income_data)
    
}

# iterate the function
map_df(.x = state_fips, # iterate along the vector of state fips codes
             .f = get_income) # apply get_income() to each fips_code  
# A tibble: 2,874 × 5
   GEOID       NAME                                        varia…¹ estim…²   moe
   <chr>       <chr>                                       <chr>     <dbl> <dbl>
 1 01001020100 Census Tract 201, Autauga County, Alabama   B19013…   61838 11900
 2 01001020200 Census Tract 202, Autauga County, Alabama   B19013…   32303 13538
 3 01001020300 Census Tract 203, Autauga County, Alabama   B19013…   44922  5629
 4 01001020400 Census Tract 204, Autauga County, Alabama   B19013…   54329  7003
 5 01001020500 Census Tract 205, Autauga County, Alabama   B19013…   51965  6935
 6 01001020600 Census Tract 206, Autauga County, Alabama   B19013…   63092  9585
 7 01001020700 Census Tract 207, Autauga County, Alabama   B19013…   34821  7867
 8 01001020801 Census Tract 208.01, Autauga County, Alaba… B19013…   73728  2447
 9 01001020802 Census Tract 208.02, Autauga County, Alaba… B19013…   60063  8602
10 01001020900 Census Tract 209, Autauga County, Alabama   B19013…   41287  7857
# … with 2,864 more rows, and abbreviated variable names ¹​variable, ²​estimate

library(tidycensus) works well with library(tidyverse) and enables access to geospatial data, but it is limited to only some Census Bureau data sets. The next package has less functionality but allows for accessing any data available on the Census API.


library(censusapi)

library(censusapi) by Hannah Recht (complete intro here) can access any published table that is accessible through the Census Bureau API. A full listing is available here.

You will need to apply for a Census API key and add it to your R session. Don’t add your API key to your script and don’t add it to a GitHub repository!

Here is a simple example that pulls median household income and its margin of error for Census tracts in Alabama:

library(tidyverse)
library(purrr)
library(censusapi)
vars <- c(
  "B19013_001E",  # median household income estimate
  "B19013_001M"   # median household income margin of error
)

getCensus(name = "acs/acs5",
                    key = Sys.getenv("CENSUS_API_KEY"),
                    vars = vars, 
                    region = "tract:*",
                    regionin = "state:01",
                    vintage = 2015) %>%
    as_tibble()
# A tibble: 1,181 × 5
   state county tract  B19013_001E B19013_001M
   <chr> <chr>  <chr>        <dbl>       <dbl>
 1 01    103    005109       29644        4098
 2 01    103    005106       35864        3443
 3 01    103    005107       66739        5468
 4 01    103    005108       64632        9804
 5 01    103    005701       46306        7926
 6 01    103    005702       47769       12939
 7 01    105    686800       30662        7299
 8 01    009    050102       43325        9484
 9 01    009    050300       37548        9655
10 01    009    050700       46452        5167
# … with 1,171 more rows

Smaller geographies like Census tracts can only be pulled state-by-state. This example demonstrates how to iterate across FIPS codes to pull Census tracts for multiple states. The process is as follows:

  1. Pick the variables of interest
  2. Create a vector of state FIPS codes for the states of interest
  3. Create a custom function that works on a single state FIPS code
  4. Iterate the function along the vector of state FIPS codes with map_df() from library(purrr)

Here is an example that pulls median household income at the Census tract level for multiple states:

# variables of interest
vars <- c(
  "B19013_001E",  # median household income estimate
  "B19013_001M"   # median household income margin of error
)

# states of interest: alabama, alaska, arizona
state_fips <- c("01", "02", "04")
    
# create a custom function that works for one state
get_income <- function(state_fips) {
    
    income_data <- getCensus(name = "acs/acs5", 
                                                     key = Sys.getenv("CENSUS_API_KEY"),
                                                     vars = vars, 
                                                     region = "tract:*",
                                                     regionin = paste0("state:", state_fips),
                                                     vintage = 2015)
    
    return(income_data)
    
}

# iterate the function
map_df(.x = state_fips, # iterate along the vector of state fips codes
             .f = get_income) %>% # apply get_income() to each fips_code  
    as_tibble() 
# A tibble: 2,874 × 5
   state county tract  B19013_001E B19013_001M
   <chr> <chr>  <chr>        <dbl>       <dbl>
 1 01    103    005109       29644        4098
 2 01    103    005106       35864        3443
 3 01    103    005107       66739        5468
 4 01    103    005108       64632        9804
 5 01    103    005701       46306        7926
 6 01    103    005702       47769       12939
 7 01    105    686800       30662        7299
 8 01    009    050102       43325        9484
 9 01    009    050300       37548        9655
10 01    009    050700       46452        5167
# … with 2,864 more rows