Introduction


This guide will teach you the concepts and code you will need for mapping and geospatial analysis in R. This is a long guide, so if you need something specific, we encourage you to scroll to the appropriate section using the Table of Contents on the left. If you just want copy and pasteable code to create different kinds of maps, head to the Map Gallery. Now let’s start mapping!

Geospatial Workflow

This picture below outlines what we think are the main steps in a geospatial workflow. This guide will be split into sections describing each of the steps.

Should this be a map?

The Urban Institute Data Visualization Style Guide offers some blunt but useful suggestions for maps:

Just because you’ve got geographic data, doesn’t mean that you have to make a map. Many times, there are more efficient storyforms that will get your point across more clearly. If your data shows a very clear geographic trend or if the absolute location of a place or event matters, maps might be the best approach, but sometimes the reflexive impulse to map the data can make you forget that showing the data in another form might answer other—and sometimes more important—questions.

So we would encourage you to think critically before making a map.

Why map with R?

R can have a steeper learning curve than point-and-click tools - like QGIS or ArcGIS - for geospatial analysis and mapping. But creating maps in R has many advantages including:

  1. Reproducibility: By creating maps with R code, you can easily share the outputs and the code that generated the output with collaborators, allowing them to replicate your work and catch errors easily.

  2. Iteration: With point and click software like ArcGIS, making 50 maps would be 50 times the work/time. But using R, we can easily make make many iterations of the same map with a few changes to the code.

  3. Easy Updates: Writing code provides a roadmap for others (and future you!) to quickly update parts of the map as needed. Say for example a collaborator wanted to change the legend colors of 50 state maps. With R, this is possible in just a few seconds!

  4. An Expansive ecosystem: There are several R packages that make it very easy to get spatial data, create static and interactive maps, and perform spatial analyses. This feature rich package ecosystem which all play nice together is frankly unmatched by other programming languages and even point and click tools like QGIS and ArcGIS. Some of these R packages include:

    • sf: For managing and analyzing spatial dataframes
    • tigris: For downloading in Census geographies
    • ggplot2: For making publication ready static maps
    • urbnmapr: For automatically adding Urban styling to static maps
    • mapview: For making expxploratory interactive maps
  5. Cost: Most point-and-click tools for geospatial analysis are proprietary and expensive. R is free open-source software. The software and most of its packages can be used for free by anyone for almost any use case.

Helpful Learning Resources

In addition to this guide, you may want to look at these other helpful resources:

Get Spatial Data


library(sf)

The short version

library(sf) stores geospatial data, which are points (a single longitude/latitude), lines (a pair of connected points), or polygons (a collection of points which make a polygon) in a geometry column within R dataframes

This is what sf dataframe looks like in the console:

dc_parks <- st_read("mapping/data/dc_parks.geojson", 
                    quiet = TRUE)

# Print just the NAME and geometry column
dc_parks %>%
  select(NAME) %>%
  head(2)
## Simple feature collection with 2 features and 1 field
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -77.01063 ymin: 38.81718 xmax: -76.9625 ymax: 38.89723
## Geodetic CRS:  WGS 84
##                           NAME                       geometry
## 1 Kingman and Heritage Islands MULTIPOLYGON (((-76.96566 3...
## 2              Bald Eagle Hill MULTIPOLYGON (((-77.01063 3...

The long version

The sf library is a key tool for reading in, managing, and working with spatial data in R. sf stands for simple features (not San Francisco you Bay Area folks) and denotes a way to describe the spatial attributes of real life objects. The R object you will be working with most frequently for mapping is an sf dataframe. An sf dataframe is essentially a regular R dataframe, with a couple of extra features for use in mapping. These extra features exclusive to sf dataframes include:

  • sticky geometry columns
  • attached coordinate reference systems
  • some other spatial metadata

The most important of the above list is the sticky geometry column, which is a magical column that contains all of the geographic information for each row of data. Say for example you had a sf dataframe of all DC census tracts. Then the geometry column would contain all of the geographic points used to define DC census tract polygons. The stickiness of this column means that no matter what data munging/filtering you do, you will not be able to drop or delete the geometry column. Below is a graphic to help you understand this:

credits: @allisonhorst

This is what an sf dataframe looks like in the console:

# Read in spatial data about DC parks from DC Open Data Portal
dc_parks  <- st_read("https://opendata.arcgis.com/api/v3/datasets/287eaa2ecbff4d699762bbc6795ffdca_9/downloads/data?format=geojson&spatialRefId=4326",
                    quiet = TRUE)

# dc_parks <- st_read("mapping/data/dc_parks.geojson")

# Select just a few columns for readability
dc_parks <- dc_parks %>%
  select(NAME, geometry)

# Print to the console
dc_parks
## Simple feature collection with 256 features and 1 field
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -77.11113 ymin: 38.81718 xmax: -76.91108 ymax: 38.98811
## Geodetic CRS:  WGS 84
## First 10 features:
##                           NAME                       geometry
## 1              Plymouth Circle MULTIPOLYGON (((-77.04677 3...
## 2       Triangle Park RES 0566 MULTIPOLYGON (((-77.04481 3...
## 3               Shepherd Field MULTIPOLYGON (((-77.03528 3...
## 4  Marvin Caplan Memorial Park MULTIPOLYGON (((-77.03027 3...
## 5             Pinehurst Circle MULTIPOLYGON (((-77.06643 3...
## 6   Triangle Park 3278    0801 MULTIPOLYGON (((-77.01759 3...
## 7                 Fort Stevens MULTIPOLYGON (((-77.02988 3...
## 8     Takoma Recreation Center MULTIPOLYGON (((-77.01794 3...
## 9      Takoma Community Center MULTIPOLYGON (((-77.01716 3...
## 10      Triangle Park RES 0648 MULTIPOLYGON (((-77.03362 3...

Note that there is some spatial metadata such as the Geometry Type, Bounding Box, and CRS which shows up as a header before the actual contents of the dataframe.

Since sf dataframes operate similarly to regular dataframes, we can use all our familiar tidyverse functions for data wrangling, including select, filter, rename, mutate, group_by and summarize. The sf package also has many functions that provide easy ways to replicate common tasks done in other GIS software like spatial joins, clipping, and buffering. Almost all of the mapping and geospatial analysis methods described in this guide rely on you having an sf dataframe. So let’s talk about how to get one!

Importing spatial data

Getting an sf dataframe is always the first step in the geospatial workflow. Here’s how to import spatial data for…

States and counties

We highly recommend using the library(urbnmapr) package, which was created by folks here at Urban to easily create state and county level maps. The get_urbn_map() function in the package allows you to read in spatial data on states and counties, with options to include territories. Importantly, it will also display AL and HI as insets on the map in accordance with the Urban Institute Data Visualization Style Guide. For information on how to install urbnmapr, see the GitHub repository.

Below is an example of how you would use urbnmapr to get an sf dataframe of all the states or counties in the US.

library(urbnmapr)

# Get state data
states <- get_urbn_map("states", sf = TRUE)

# Can also get county data
counties <- get_urbn_map("counties", sf = TRUE)

Other Census geographies

Use the library(tigris) package, which allows you to easily download TIGER and other cartographic boundaries from the US Census Bureau. In order to automatically load in the boundaries as sf objects, run once per R session.

library(tigris) has all the standard census geographies, including census tracts, counties, CBSAs, ZCTAs, congressional districts, tribal areas, and more. It also includes other elements such as water, roads, and military bases.

By default, libraray(tigris) will download large very large and detailed TIGER line boundary files. For thematic mapping, the smaller cartographic boundary files are a better choice, as they are clipped to the shoreline, generalized, and therefore usually smaller in size without losing too much accuracy. To load cartographic boundaries, use the cb = TRUE argument. If you are doing detailed geospatial analysis and need the most detailed shapefiles, then you should use the detailed TIGER line boundary files and set cb = FALSE.

Below is an example of how you would use library(tigris) to get a sf dataframe of all Census tracts in DC for 2019.

library(tigris)

# Only need to set once per script
options(tigris_class = "sf")

dc_tracts <- tracts(
  state = "DC",
  cb = TRUE,
  year = 2019
)

Unlike library(urbnmapr), different functions are used to get geographic data for different geographic levels. For instance, the blocks() function will load census block group data, and the tracts() function will load tract data. Other functions include block_groups(), zctas() , and core_based_statistical_areas(). For the full list of supported geographies and functions, see the package vignette.

For folks interested in pulling in Census demographic information along with Census geographies, we recommend checking out the sister package to library(tigris): library(tidycensus). That package allows you to download in Census variables and Census geographic data simultaneously.

Countries

We recommend using the library(rnaturalearth) package, which is similar to library(tigris) but allows you to download and use boundaries beyond the US. Instead of setting class to sf one time per session as we did with library(tigris), you must set the returnclass = "sf" argument each time you use a function from the package. Below is an example of downloading in an sf dataframe of all the countries in the world.

library(rnaturalearth)

world <- ne_countries(returnclass = "sf")

ggplot() +
  geom_sf(data = world, mapping = aes())

Your own files

Shapefiles/GeoJSONS

Shapefiles and GeoJSONs are 2 common spatial file formats you will found out in the wild. library(sf) has a function called st_read which allows you to easily read in these files as sf dataframes. The only required argument is dsn or data source name. This is the filepath of the .shp file or the .geojson file on your local computer. For geojsons, dsn can also be a URL.

Below is an example of reading in a shapefile of fire stations in DC which is stored in mapping/data/shapefiles/. Note that shapefiles are actually stored as 6+ different files inside a folder. You need to provide the filepath to the file ending in .shp.

library(sf)

# Print out all files in the directory
list.files("mapping/data/shapefiles")
## [1] "Fire_Stations.cpg" "Fire_Stations.dbf" "Fire_Stations.prj"
## [4] "Fire_Stations.shp" "Fire_Stations.shx" "Fire_Stations.xml"
# Read in .shp file
dc_firestations <- st_read(
  dsn = "mapping/data/shapefiles/Fire_Stations.shp",
  quiet = TRUE
)

And now dc_firestations is an sf dataframe you can use for all your mapping needs! st_read supports reading in a wide variety of other spatial file formats, including geodatabases, KML files, and over 200 others. For an incomplete list, please see the this sf vignette.

CSVs or dataframes with lat/lons

If you have a CSV with geographic information stored in columns, you will need to read in the CSV as a regular R dataframe and then convert to an sf dataframe. library(sf) contains the st_as_sf() function for converting regular R dataframes into an sf dataframe. The two arguments you must specify for this function are:

  • coords: A length 2 vector with the names of the columns corresponding to longitude and latitude (in that order!). For example, c("lon", "lat").
  • crs: The CRS (coordinate references system) for your longitude/latitude coordinates. Remember you need to specify both the
    authority and the SRID code, for example (“EPSG:4326”). For more information on finding and setting CRS codes, please see the CRS section.

Below is an example of reading in data from a CSV and converting it to an sf dataframe.

library(sf)

# Read in dataset of state capitals which is stored as a csv
state_capitals <- read_csv("mapping/data/state-capitals.csv")

state_capitals <- state_capitals %>%
  # Specify names of the lon/lat columns in the CSV to use to make geometry col
  st_as_sf(
    coords = c("longitude", "latitude"),
    crs = 4326
  )

One common mistake is that before converting to an sf dataframe, you must drop any rows that have NA values for latitude or longitude. If your data contains NA values, then the st_as_sf() function will throw an error.

Appending spatial info to your data

Oftentimes, the data you are working with will just have state or county identifiers - like FIPS codes or state abbreviations - but will not contain any geographic information. In this case, you must do the extra work of downloading in the geographic data as an sf dataframe and then joining your non-spatial data to the spatial data. Generally this involves 3 steps:

  1. Reading in your own data as a data frame
  2. Reading in the geographic data as an sf dataframe
  3. Using left_join to merge the geographic data with your own non spatial data and create a new expanded sf dataframe

Let’s say we had a dataframe on CHIP enrollment by state with state abbreviations.

# read the state CHIP data
chip_by_state <- read_csv("mapping/data/chip-enrollment.csv") %>%
  # clean column names so there are no random spaces/uppercase letters
  janitor::clean_names()

# print to the console
chip_by_state %>% head()
## # A tibble: 6 × 3
##   state      chip_enrollment state_abbreviation
##   <chr>                <dbl> <chr>             
## 1 Alabama             150040 AL                
## 2 Alaska               15662 AK                
## 3 Arizona              88224 AZ                
## 4 Arkansas            120863 AR                
## 5 California         2022213 CA                
## 6 Colorado            167227 CO

In order to convert this to an sf dataframe, we need to read in the spatial boundaries for each state and append it to our dataframe. Here is how we do that with get_urbn_map() and left_join() .

library(urbnmapr)

# read in state geographic data from urbnmapr
states <- get_urbn_map(map = "states", sf = TRUE)

# left join state geographies to chip data
chip_with_geographies <- states %>%
  left_join(
    chip_by_state,
    # Specify join column, which are slightly differently named in states and chip
    # respectively
    by = c("state_abbv" = "state_abbreviation")
  )

chip_with_geographies %>%
  select(state_fips, state_abbv, chip_enrollment)
## Simple feature collection with 51 features and 3 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -2600000 ymin: -2363000 xmax: 2516374 ymax: 732352.2
## Projected CRS: US National Atlas Equal Area
## First 10 features:
##    state_fips state_abbv chip_enrollment                       geometry
## 1          01         AL          150040 MULTIPOLYGON (((1150023 -15...
## 2          04         AZ           88224 MULTIPOLYGON (((-1386136 -1...
## 3          08         CO          167227 MULTIPOLYGON (((-786661.9 -...
## 4          09         CT           25551 MULTIPOLYGON (((2156197 -83...
## 5          12         FL          374884 MULTIPOLYGON (((1953691 -20...
## 6          13         GA          232050 MULTIPOLYGON (((1308636 -10...
## 7          16         ID           35964 MULTIPOLYGON (((-1357097 78...
## 8          18         IN          114927 MULTIPOLYGON (((1042064 -71...
## 9          20         KS           79319 MULTIPOLYGON (((-174904.2 -...
## 10         22         LA          161565 MULTIPOLYGON (((1075669 -15...

Project

Coordinate Reference Systems

The short version

Just watch this video and know the following:

  • All spatial data has a CRS, which specifies how to identify a location on earth.

  • It’s important that all spatial datasets you are working with be in the same CRS. You can find the CRS with st_crs() and change the CRS with st_transform().

  • The Urban Institute Style Guide requires the use of the Atlas Equal Earth Projection ("ESRI:102003") for national maps. For state and local maps, use this handy guide to find an appropriate State Plane projection.

The long version

Coordinate reference systems (CRS) specify the 3d shape of the earth and optionally how we project that 3d shape onto a 2d surface. They are an important part of working with spatial data as you need to ensure that all the data you are working with are in the same CRS in order for spatial operations and maps to be accurate.

CRS can be specified either by name (ie Maryland State Plane) or Spatial Reference System IDentifier (SRID). THe SRID is a numeric identifier that uniquely identifies a coordinate reference system. Generally when referring to an SRID, you need to refer to an authority (ie the data source) and a unique ID. An example is EPSG:26985 which refers to the Maryland State plane projection from the EPSG, or ESRI:102003 which refers to the Atlas Equal Area projection from ESRI. Most CRS codes will be from the EPSG, and some from ESRI and others. A good resource for finding/validating CRS codes is epsg.io.

Sidenote - EPSG stands for the now defunct European Petroleum Survey Group. And while oil companies have generally been terrible for the earth, the one nice thing they did for the earth was to set up common standards for coordinate reference systems.

You might be thinking well isn’t the earth just a sphere? Why do we need all this complicated stuff? And the answer is well the earth is kind of a sphere, but it’s really more of a misshapen ellipsoid which is pudgier at the equator than at the poles. To visualize how coordinate reference systems work, imagine that the earth is a (lumpy) orange. Now peel the skin off an orange and try to flatten it. There are many ways to do it, but all will create distortions of some kind. The CRS will give us the formula we’ve used to specify the shape of the orange (usually a sphere or ellipsoid of some kind) and optionally, specify how we flattened the orange into 2d.

Broadly, there are two kinds of Coordinate Reference Systems:

  1. Geographic coordinate systems

    • (sometimes called unprojected coordinate systems)
    • Specifies a 3d shape for the earth
    • Uses a spheroid/ellipsoid to approximate shape of the earth
    • Usually use decimal degree units (ie latitude/longitude) to identify locations on earth
  2. Projected coordinate systems

    • Specifies a 3d shape for the earth + a 2d mapping

      • Is a geographic coordinate system + a projection

        credit: xkcd

      • projection: mathematical formula used to convert a 3d coordinate system to a 2d flat coordinate system

      • Many different kinds of projections, including Equal Area, Equidistant, Conformal, etc

      • All projections distort the true shape of the earth in some way, either in terms of shape, area, or angle. Required xkcd comic

      • Usually use linear units (ie feet, meters) and therefore useful for distance based spatial operations (ie creating buffers)

Finding the CRS

If you are lucky, your data will have embedded CRS data that will be automatically detected when the file is read in. This is usually the case for GeoJSONS (.geojson) and shapefiles (.shp). When you use st_read() on these files, you should see the CRS displayed in the metadata:

You can also the st_crs() function to find the CRS. The CRS code is located at the end in ID[authority, SRID].

st_crs(dc_firestations)
## Coordinate Reference System:
##   User input: WGS 84 
##   wkt:
## GEOGCRS["WGS 84",
##     DATUM["World Geodetic System 1984",
##         ELLIPSOID["WGS 84",6378137,298.257223563,
##             LENGTHUNIT["metre",1]]],
##     PRIMEM["Greenwich",0,
##         ANGLEUNIT["degree",0.0174532925199433]],
##     CS[ellipsoidal,2],
##         AXIS["latitude",north,
##             ORDER[1],
##             ANGLEUNIT["degree",0.0174532925199433]],
##         AXIS["longitude",east,
##             ORDER[2],
##             ANGLEUNIT["degree",0.0174532925199433]],
##     ID["EPSG",4326]]

Sometimes, the CRS will be blank or NA as the dataset did not specify the CRS. In that case you MUST find and set the CRS for your data before proceeding with analysis. Below are some good rules of thumb for finding out what the CRS for your data is:

  • For geojsons, the CRS should always be EPSG:4326 (or WGS 84). The official geojson specification states that this is the only valid CRS for geojsons, but in the wild, this may not be true 100% of the time.
  • For shapefiles, there should be a file that ends in .proj in the same directory as the .shp file. This file contains the projection information for that file and should be used automatically when reading in shapefiles.
  • For CSV’s with latitude/longitude columns, the CRS is usually EPSG:4326 (or WGS 84).
  • Look at the metadata and any accompanying documentation to see if the coordinate reference system for the data is specified

If none of the above rules of thumb apply to you, check out the crsuggest R package.

Once you’ve identified the appropriate CRS, you can set the CRS for your data with st_crs():

# If you are certain that your data contains coordinates in the ESRI Atlas Equal Earth projections
st_crs(some_sf_dataframe) <- st_crs("ESRI:102003")

Transforming the CRS

Often you will need to change the CRS for your sf dataframe so that all datasets you are using have the same CRS, or to use a projected CRS for performing more accurate spatial operations. You can do this with st_transform:

# Transforming CRS from WGS 84 to Urban required Equal Earth Projection
state_capitals <- state_capitals %>% st_transform("ESRI:102003")

st_transform() also allows you to just use the CRS of another sf dataframe when transforming.

# transform CRS of chip_with_geographies to be the same as CRS of dc_firestations
chip_with_geographies <- chip_with_geographies %>%
  st_transform(crs = st_crs(state_capitals))

If you are working with local data, you should use an appropriate state plane projection instead of the Atlas Equal Earth projection which is meant for national maps. library(crsuggest) can simplify the process of picking an appropriate state plane CRS.

library(crsuggest)

suggest_crs(dc_firestations) %>%
  # Use the value in the "crs_code" column to transform CRS's
  head(4)
## # A tibble: 4 × 6
##   crs_code crs_name        crs_type crs_gcs crs_units crs_proj4                 
##   <chr>    <chr>           <chr>      <dbl> <chr>     <chr>                     
## 1 6488     NAD83(2011) / … project…    6318 us-ft     +proj=lcc +lat_0=37.66666…
## 2 6487     NAD83(2011) / … project…    6318 m         +proj=lcc +lat_0=37.66666…
## 3 3582     NAD83(NSRS2007… project…    4759 us-ft     +proj=lcc +lat_0=37.66666…
## 4 3559     NAD83(NSRS2007… project…    4759 m         +proj=lcc +lat_0=37.66666…

Map

In order to start mapping, you need an sf dataframe. If you don’t have one, see the Get Spatial Data section above.

The basics

library(ggplot2)

Most mapping in R fits the same theoretical framework as plotting in R using library(ggplot2). To learn more about ggplot2, visit the Data Viz page or read the official ggplot book.

The key function for mapping is the special geom_sf() function which works with sf dataframes. This function magically detects whether you have point or polygon spatial data and displays the results on a map.

A simple map

To make a simple map, add geom_sf() to a ggplot() and set data = an_sf_dataframe. Below is code for making a map of all 50 states using library(urbnmapr):

library(urbnmapr)

states <- get_urbn_map("states", sf = TRUE)

ggplot() +
  geom_sf(
    data = states,
    mapping = aes()
  )

Styling

library(urbnthemes)

library(urbnthemes) automatically styles maps in accordance with the Urban Institute Data Visualization Style Guide. By using library(urbnthemes), you can create publication ready maps you can immediately drop in to Urban research briefs or blog posts.

To install urbnthemes, visit the package’s GitHub repository and follow the instructions. There are 2 ways to use the urbnthemes functions:

library(urbnthemes)

# You can either run this once per script to automatically style all maps with
# the Urban theme
set_urbn_defaults(style = "map")

# Or you can add `+ theme_urbn_map()` to the end of every map you make
ggplot() +
  geom_sf(states, mapping = aes()) +
  theme_urbn_map()

Layering

You can layer multiple points/lines/polygons on top of each other using the + operator from library(ggplot2). The shapes will appear from bottom to top (ie the last mapped object will show up on top). It is important that all layers are in the same CRS (coordinate reference system).

state_capitals <- state_capitals %>%
  # This will change CRS to ESRI:102003 and shift the AK and HI state capitals
  # point locations to the appropriate locations on the inset maps.
  tigris::shift_geometry() %>%
  # For now filter out AL and HI as their state capitals will be slightly off.
  filter(!state %in% c("Alaska", "Hawaii"))

ggplot() +
  geom_sf(
    data = states,
    mapping = aes()
  ) +
  # Note we change the data argument
  geom_sf(
    data = state_capitals,
    mapping = aes(),
    # urbnthemes library has urbn color palettes built in.
    color = palette_urbn_main["yellow"],
    size = 2.0
  ) +
  theme_urbn_map()

Fill and Outline Colors

The same commands used to change colors, opacity, lines, size, etc. in charts can be used for maps too. To change the colors of the map , just use the fill = and color = parameters in geom_sf(). fill will change the fill color of polygons; color will change the color of polygon outlines, lines, and points.

Generally, maps that show the magnitude of a variable use the blue sequential ramp and maps that display positives and negatives use the diverging color ramp.library(urbnthemes) contains inbuilt. helper variables (like palette_urbn_main) for accessing color palettes from the Urban Data Viz Style guide. If for example you want states to be Urban’s magenta color:

ggplot() +
  geom_sf(states,
    mapping = aes(),
    # Adjust polygon fill color
    fill = palette_urbn_main["magenta"],
    # Adjust polygon outline color
    color = "white"
  ) +
  theme_urbn_map()

Adding text

You can also add text, like state abbreviations, directly to your map using geom_sf_text and the helper function get_urbn_labels().

library(urbnmapr)

ggplot() +
  geom_sf(states,
    mapping = aes(),
    color = "white"
  ) +
  theme_urbn_map() +
  # Generates dataframe of state abbv and appropriate location to plot them
  geom_sf_text(
    data = get_urbn_labels(
      map = "states",
      sf = TRUE
    ),
    aes(label = state_abbv),
    size = 3
  )

There’s also geom_sf_label() if you want labels with a border.