Introduction

This guide will teach you the concepts and code you will need for mapping and geospatial analysis in R. This is a long guide, so if you need something specific, we encourage you to scroll to the appropriate section using the Table of Contents on the left. If you just want copy and pasteable code to create different kinds of maps, head to the Map Gallery.

Now let’s start mapping!

Geospatial Workflow

This picture below outlines what we think are the main steps in a geospatial workflow. This guide will be split into sections describing each of the steps.

Should this be a map?

The Urban Institute Data Visualization Style Guide offers some blunt but useful suggestions for maps:

Just because you’ve got geographic data, doesn’t mean that you have to make a map. Many times, there are more efficient storyforms that will get your point across more clearly. If your data shows a very clear geographic trend or if the absolute location of a place or event matters, maps might be the best approach, but sometimes the reflexive impulse to map the data can make you forget that showing the data in another form might answer other—and sometimes more important—questions.

So we would encourage you to think critically before making a map.

Why map with R?

R can have a steeper learning curve than point-and-click tools - like QGIS or ArcGIS - for geospatial analysis and mapping. But creating maps in R has many advantages including:

  1. Reproducibility: By creating maps with R code, you can easily share the outputs and the code that generated the output with collaborators, allowing them to replicate your work and catch errors easily.

  2. Iteration: With point and click software like ArcGIS, making 50 maps would be 50 times the work/time. But using R, we can easily make make many iterations of the same map with a few changes to the code.

  3. Easy Updates: Writing code provides a roadmap for others (and future you!) to quickly update parts of the map as needed. Say for example a collaborator wanted to change the legend colors of 50 state maps. With R, this is possible in just a few seconds!

  4. An Expansive ecosystem: There are several R packages that make it very easy to get spatial data, create static and interactive maps, and perform spatial analyses. This feature rich package ecosystem which all play nice together is frankly unmatched by other programming languages and even point and click tools like QGIS and ArcGIS. Some of these R packages include:

    • sf: For managing and analyzing spatial dataframes
    • tigris: For downloading in Census geographies
    • ggplot2: For making publication ready static maps
    • urbnmapr: For automatically adding Urban styling to static maps
    • mapview: For making expxploratory interactive maps
  5. Cost: Most point-and-click tools for geospatial analysis are proprietary and expensive. R is free open-source software. The software and most of its packages can be used for free by anyone for almost any use case.

Helpful Learning Resources

In addition to this guide, you may want to look at these other helpful resources:

Get Spatial Data


library(sf)

The short version

library(sf) stores geospatial data, which are points (a single longitude/latitude), lines (a pair of connected points), or polygons (a collection of points which make a polygon) in a geometry column within R dataframes

This is what sf dataframe looks like in the console:

dc_parks <- st_read("mapping/data/dc_parks.geojson", 
                                        quiet = TRUE)

# Print just the NAME and geometry column
dc_parks %>%
  select(NAME) %>%
  head(2)
Simple feature collection with 2 features and 1 field
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -77.01063 ymin: 38.81718 xmax: -76.9625 ymax: 38.89723
Geodetic CRS:  WGS 84
                          NAME                       geometry
1 Kingman and Heritage Islands MULTIPOLYGON (((-76.96566 3...
2              Bald Eagle Hill MULTIPOLYGON (((-77.01063 3...

The long version

The sf library is a key tool for reading in, managing, and working with spatial data in R. sf stands for simple features (not San Francisco you Bay Area folks) and denotes a way to describe the spatial attributes of real life objects. The R object you will be working with most frequently for mapping is an sf dataframe. An sf dataframe is essentially a regular R dataframe, with a couple of extra features for use in mapping. These extra features exclusive to sf dataframes include:

  • sticky geometry columns
  • attached coordinate reference systems
  • some other spatial metadata

The most important of the above list is the sticky geometry column, which is a magical column that contains all of the geographic information for each row of data. Say for example you had a sf dataframe of all DC census tracts. Then the geometry column would contain all of the geographic points used to define DC census tract polygons. The stickiness of this column means that no matter what data munging/filtering you do, you will not be able to drop or delete the geometry column. Below is a graphic to help you understand this:

credits: @allisonhorst

This is what an sf dataframe looks like in the console:

# Read in spatial data about DC parks from DC Open Data Portal
dc_parks  <- st_read("https://opendata.arcgis.com/api/v3/datasets/287eaa2ecbff4d699762bbc6795ffdca_9/downloads/data?format=geojson&spatialRefId=4326",
                                        quiet = TRUE)

# dc_parks <- st_read("mapping/data/dc_parks.geojson")

# Select just a few columns for readability
dc_parks <- dc_parks %>%
  select(NAME, geometry)

# Print to the console
dc_parks
Simple feature collection with 256 features and 1 field
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -77.11113 ymin: 38.81718 xmax: -76.91108 ymax: 38.98811
Geodetic CRS:  WGS 84
First 10 features:
                          NAME                       geometry
1              Plymouth Circle MULTIPOLYGON (((-77.04677 3...
2       Triangle Park RES 0566 MULTIPOLYGON (((-77.04481 3...
3               Shepherd Field MULTIPOLYGON (((-77.03528 3...
4  Marvin Caplan Memorial Park MULTIPOLYGON (((-77.03027 3...
5             Pinehurst Circle MULTIPOLYGON (((-77.06643 3...
6   Triangle Park 3278    0801 MULTIPOLYGON (((-77.01759 3...
7                 Fort Stevens MULTIPOLYGON (((-77.02988 3...
8     Takoma Recreation Center MULTIPOLYGON (((-77.01794 3...
9      Takoma Community Center MULTIPOLYGON (((-77.01716 3...
10      Triangle Park RES 0648 MULTIPOLYGON (((-77.03362 3...

Note that there is some spatial metadata such as the Geometry Type, Bounding Box, and CRS which shows up as a header before the actual contents of the dataframe.

Since sf dataframes operate similarly to regular dataframes, we can use all our familiar tidyverse functions for data wrangling, including select, filter, rename, mutate, group_by and summarize. The sf package also has many functions that provide easy ways to replicate common tasks done in other GIS software like spatial joins, clipping, and buffering. Almost all of the mapping and geospatial analysis methods described in this guide rely on you having an sf dataframe. So let’s talk about how to get one!

Importing spatial data

Getting an sf dataframe is always the first step in the geospatial workflow. Here’s how to import spatial data for…

States and counties

We highly recommend using the library(urbnmapr) package, which was created by folks here at Urban to easily create state and county level maps. The get_urbn_map() function in the package allows you to read in spatial data on states and counties, with options to include territories. Importantly, it will also display AL and HI as insets on the map in accordance with the Urban Institute Data Visualization Style Guide. For information on how to install urbnmapr, see the GitHub repository.

Below is an example of how you would use urbnmapr to get an sf dataframe of all the states or counties in the US.

library(urbnmapr)

# Get state data
states <- get_urbn_map("states", sf = TRUE)

# Can also get county data
counties <- get_urbn_map("counties", sf = TRUE)

Other Census geographies

Use the library(tigris) package, which allows you to easily download TIGER and other cartographic boundaries from the US Census Bureau. In order to automatically load in the boundaries as sf objects, run once per R session.

library(tigris) has all the standard census geographies, including census tracts, counties, CBSAs, ZCTAs, congressional districts, tribal areas, and more. It also includes other elements such as water, roads, and military bases.

By default, libraray(tigris) will download large very large and detailed TIGER line boundary files. For thematic mapping, the smaller cartographic boundary files are a better choice, as they are clipped to the shoreline, generalized, and therefore usually smaller in size without losing too much accuracy. To load cartographic boundaries, use the cb = TRUE argument. If you are doing detailed geospatial analysis and need the most detailed shapefiles, then you should use the detailed TIGER line boundary files and set cb = FALSE.

Below is an example of how you would use library(tigris) to get a sf dataframe of all Census tracts in DC for 2019.

library(tigris)

# Only need to set once per script
options(tigris_class = "sf")

dc_tracts <- tracts(
  state = "DC",
  cb = TRUE,
  year = 2019
)

Unlike library(urbnmapr), different functions are used to get geographic data for different geographic levels. For instance, the blocks() function will load census block group data, and the tracts() function will load tract data. Other functions include block_groups(), zctas() , and core_based_statistical_areas(). For the full list of supported geographies and functions, see the package vignette.

For folks interested in pulling in Census demographic information along with Census geographies, we recommend checking out the sister package to library(tigris): library(tidycensus). That package allows you to download in Census variables and Census geographic data simultaneously.

Countries

We recommend using the library(rnaturalearth) package, which is similar to library(tigris) but allows you to download and use boundaries beyond the US. Instead of setting class to sf one time per session as we did with library(tigris), you must set the returnclass = "sf" argument each time you use a function from the package. Below is an example of downloading in an sf dataframe of all the countries in the world.

library(rnaturalearth)

world <- ne_countries(returnclass = "sf")

ggplot() +
  geom_sf(data = world, mapping = aes())

Your own files

Shapefiles/GeoJSONS

Shapefiles and GeoJSONs are 2 common spatial file formats you will found out in the wild. library(sf) has a function called st_read which allows you to easily read in these files as sf dataframes. The only required argument is dsn or data source name. This is the filepath of the .shp file or the .geojson file on your local computer. For geojsons, dsn can also be a URL.

Below is an example of reading in a shapefile of fire stations in DC which is stored in mapping/data/shapefiles/. Note that shapefiles are actually stored as 6+ different files inside a folder. You need to provide the filepath to the file ending in .shp.

library(sf)

# Print out all files in the directory
list.files("mapping/data/shapefiles")
[1] "Fire_Stations.cpg" "Fire_Stations.dbf" "Fire_Stations.prj"
[4] "Fire_Stations.shp" "Fire_Stations.shx" "Fire_Stations.xml"
# Read in .shp file
dc_firestations <- st_read(
  dsn = "mapping/data/shapefiles/Fire_Stations.shp",
  quiet = TRUE
)

And now dc_firestations is an sf dataframe you can use for all your mapping needs! st_read supports reading in a wide variety of other spatial file formats, including geodatabases, KML files, and over 200 others. For an incomplete list, please see the this sf vignette.

CSVs or dataframes with lat/lons

If you have a CSV with geographic information stored in columns, you will need to read in the CSV as a regular R dataframe and then convert to an sf dataframe. library(sf) contains the st_as_sf() function for converting regular R dataframes into an sf dataframe. The two arguments you must specify for this function are:

  • coords: A length 2 vector with the names of the columns corresponding to longitude and latitude (in that order!). For example, c("lon", "lat").
  • crs: The CRS (coordinate references system) for your longitude/latitude coordinates. Remember you need to specify both the
    authority and the SRID code, for example (“EPSG:4326”). For more information on finding and setting CRS codes, please see the CRS section.

Below is an example of reading in data from a CSV and converting it to an sf dataframe.

library(sf)

# Read in dataset of state capitals which is stored as a csv
state_capitals <- read_csv("mapping/data/state-capitals.csv")

state_capitals <- state_capitals %>%
  # Specify names of the lon/lat columns in the CSV to use to make geometry col
  st_as_sf(
    coords = c("longitude", "latitude"),
    crs = 4326
  )

One common mistake is that before converting to an sf dataframe, you must drop any rows that have NA values for latitude or longitude. If your data contains NA values, then the st_as_sf() function will throw an error.

Appending spatial info to your data

Oftentimes, the data you are working with will just have state or county identifiers - like FIPS codes or state abbreviations - but will not contain any geographic information. In this case, you must do the extra work of downloading in the geographic data as an sf dataframe and then joining your non-spatial data to the spatial data. Generally this involves 3 steps:

  1. Reading in your own data as a data frame
  2. Reading in the geographic data as an sf dataframe
  3. Using left_join to merge the geographic data with your own non spatial data and create a new expanded sf dataframe

Let’s say we had a dataframe on CHIP enrollment by state with state abbreviations.

# read the state CHIP data
chip_by_state <- read_csv("mapping/data/chip-enrollment.csv") %>%
  # clean column names so there are no random spaces/uppercase letters
  janitor::clean_names()

# print to the console
chip_by_state %>% head()
# A tibble: 6 × 3
  state      chip_enrollment state_abbreviation
  <chr>                <dbl> <chr>             
1 Alabama             150040 AL                
2 Alaska               15662 AK                
3 Arizona              88224 AZ                
4 Arkansas            120863 AR                
5 California         2022213 CA                
6 Colorado            167227 CO                

In order to convert this to an sf dataframe, we need to read in the spatial boundaries for each state and append it to our dataframe. Here is how we do that with get_urbn_map() and left_join() .

library(urbnmapr)

# read in state geographic data from urbnmapr
states <- get_urbn_map(map = "states", sf = TRUE)

# left join state geographies to chip data
chip_with_geographies <- states %>%
  left_join(
    chip_by_state,
    # Specify join column, which are slightly differently named in states and chip
    # respectively
    by = c("state_abbv" = "state_abbreviation")
  )

chip_with_geographies %>%
  select(state_fips, state_abbv, chip_enrollment)
Simple feature collection with 51 features and 3 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -2600000 ymin: -2363000 xmax: 2516374 ymax: 732352.2
Projected CRS: NAD27 / US National Atlas Equal Area
First 10 features:
   state_fips state_abbv chip_enrollment                       geometry
1          01         AL          150040 MULTIPOLYGON (((1150023 -15...
2          04         AZ           88224 MULTIPOLYGON (((-1386136 -1...
3          08         CO          167227 MULTIPOLYGON (((-786661.9 -...
4          09         CT           25551 MULTIPOLYGON (((2156197 -83...
5          12         FL          374884 MULTIPOLYGON (((1953691 -20...
6          13         GA          232050 MULTIPOLYGON (((1308636 -10...
7          16         ID           35964 MULTIPOLYGON (((-1357097 78...
8          18         IN          114927 MULTIPOLYGON (((1042064 -71...
9          20         KS           79319 MULTIPOLYGON (((-174904.2 -...
10         22         LA          161565 MULTIPOLYGON (((1075669 -15...

Project

Coordinate Reference Systems

The short version

Just watch this video and know the following:

  • All spatial data has a CRS, which specifies how to identify a location on earth.

  • It’s important that all spatial datasets you are working with be in the same CRS. You can find the CRS with st_crs() and change the CRS with st_transform().

  • The Urban Institute Style Guide requires the use of the Atlas Equal Earth Projection ("ESRI:102003") for national maps. For state and local maps, use this handy guide to find an appropriate State Plane projection.

The long version

Coordinate reference systems (CRS) specify the 3d shape of the earth and optionally how we project that 3d shape onto a 2d surface. They are an important part of working with spatial data as you need to ensure that all the data you are working with are in the same CRS in order for spatial operations and maps to be accurate.

CRS can be specified either by name (ie Maryland State Plane) or Spatial Reference System IDentifier (SRID). THe SRID is a numeric identifier that uniquely identifies a coordinate reference system. Generally when referring to an SRID, you need to refer to an authority (ie the data source) and a unique ID. An example is EPSG:26985 which refers to the Maryland State plane projection from the EPSG, or ESRI:102003 which refers to the Atlas Equal Area projection from ESRI. Most CRS codes will be from the EPSG, and some from ESRI and others. A good resource for finding/validating CRS codes is epsg.io.

Sidenote - EPSG stands for the now defunct European Petroleum Survey Group. And while oil companies have generally been terrible for the earth, the one nice thing they did for the earth was to set up common standards for coordinate reference systems.

You might be thinking well isn’t the earth just a sphere? Why do we need all this complicated stuff? And the answer is well the earth is kind of a sphere, but it’s really more of a misshapen ellipsoid which is pudgier at the equator than at the poles. To visualize how coordinate reference systems work, imagine that the earth is a (lumpy) orange. Now peel the skin off an orange and try to flatten it. There are many ways to do it, but all will create distortions of some kind. The CRS will give us the formula we’ve used to specify the shape of the orange (usually a sphere or ellipsoid of some kind) and optionally, specify how we flattened the orange into 2d.

Broadly, there are two kinds of Coordinate Reference Systems:

  1. Geographic coordinate systems

    • (sometimes called unprojected coordinate systems)
    • Specifies a 3d shape for the earth
    • Uses a spheroid/ellipsoid to approximate shape of the earth
    • Usually use decimal degree units (ie latitude/longitude) to identify locations on earth

  1. Projected coordinate systems

    • Specifies a 3d shape for the earth + a 2d mapping

      • Is a geographic coordinate system + a projection

        credit: xkcd

      • projection: mathematical formula used to convert a 3d coordinate system to a 2d flat coordinate system

      • Many different kinds of projections, including Equal Area, Equidistant, Conformal, etc

      • All projections distort the true shape of the earth in some way, either in terms of shape, area, or angle. Required xkcd comic

      • Usually use linear units (ie feet, meters) and therefore useful for distance based spatial operations (ie creating buffers)

Finding the CRS

If you are lucky, your data will have embedded CRS data that will be automatically detected when the file is read in. This is usually the case for GeoJSONS (.geojson) and shapefiles (.shp). When you use st_read() on these files, you should see the CRS displayed in the metadata:

You can also the st_crs() function to find the CRS. The CRS code is located at the end in ID[authority, SRID].

st_crs(dc_firestations)
Coordinate Reference System:
  User input: WGS 84 
  wkt:
GEOGCRS["WGS 84",
    DATUM["World Geodetic System 1984",
        ELLIPSOID["WGS 84",6378137,298.257223563,
            LENGTHUNIT["metre",1]]],
    PRIMEM["Greenwich",0,
        ANGLEUNIT["degree",0.0174532925199433]],
    CS[ellipsoidal,2],
        AXIS["latitude",north,
            ORDER[1],
            ANGLEUNIT["degree",0.0174532925199433]],
        AXIS["longitude",east,
            ORDER[2],
            ANGLEUNIT["degree",0.0174532925199433]],
    ID["EPSG",4326]]

Sometimes, the CRS will be blank or NA as the dataset did not specify the CRS. In that case you MUST find and set the CRS for your data before proceeding with analysis. Below are some good rules of thumb for finding out what the CRS for your data is:

  • For geojsons, the CRS should always be EPSG:4326 (or WGS 84). The official geojson specification states that this is the only valid CRS for geojsons, but in the wild, this may not be true 100% of the time.
  • For shapefiles, there should be a file that ends in .proj in the same directory as the .shp file. This file contains the projection information for that file and should be used automatically when reading in shapefiles.
  • For CSV’s with latitude/longitude columns, the CRS is usually EPSG:4326 (or WGS 84).
  • Look at the metadata and any accompanying documentation to see if the coordinate reference system for the data is specified

If none of the above rules of thumb apply to you, check out the crsuggest R package.

Once you’ve identified the appropriate CRS, you can set the CRS for your data with st_crs():

# If you are certain that your data contains coordinates in the ESRI Atlas Equal Earth projections
st_crs(some_sf_dataframe) <- st_crs("ESRI:102003")

Transforming the CRS

Often you will need to change the CRS for your sf dataframe so that all datasets you are using have the same CRS, or to use a projected CRS for performing more accurate spatial operations. You can do this with st_transform:

# Transforming CRS from WGS 84 to Urban required Equal Earth Projection
state_capitals <- state_capitals %>% st_transform("ESRI:102003")

st_transform() also allows you to just use the CRS of another sf dataframe when transforming.

# transform CRS of chip_with_geographies to be the same as CRS of dc_firestations
chip_with_geographies <- chip_with_geographies %>%
  st_transform(crs = st_crs(state_capitals))

If you are working with local data, you should use an appropriate state plane projection instead of the Atlas Equal Earth projection which is meant for national maps. library(crsuggest) can simplify the process of picking an appropriate state plane CRS.

library(crsuggest)

suggest_crs(dc_firestations) %>%
  # Use the value in the "crs_code" column to transform CRS's
  head(4)
# A tibble: 4 × 6
  crs_code crs_name                         crs_type crs_gcs crs_units crs_proj4
  <chr>    <chr>                            <chr>      <dbl> <chr>     <chr>    
1 6488     NAD83(2011) / Maryland (ftUS)    project…    6318 us-ft     +proj=lc…
2 6487     NAD83(2011) / Maryland           project…    6318 m         +proj=lc…
3 3582     NAD83(NSRS2007) / Maryland (ftU… project…    4759 us-ft     +proj=lc…
4 3559     NAD83(NSRS2007) / Maryland       project…    4759 m         +proj=lc…

Map

In order to start mapping, you need an sf dataframe. If you don’t have one, see the Get Spatial Data section above.

The basics

library(ggplot2)

Most mapping in R fits the same theoretical framework as plotting in R using library(ggplot2). To learn more about ggplot2, visit the Data Viz page or read the official ggplot book.

The key function for mapping is the special geom_sf() function which works with sf dataframes. This function magically detects whether you have point or polygon spatial data and displays the results on a map.

A simple map

To make a simple map, add geom_sf() to a ggplot() and set data = an_sf_dataframe. Below is code for making a map of all 50 states using library(urbnmapr):

library(urbnmapr)

states <- get_urbn_map("states", sf = TRUE)

ggplot() +
  geom_sf(
    data = states,
    mapping = aes()
  )

Styling

library(urbnthemes)

library(urbnthemes) automatically styles maps in accordance with the Urban Institute Data Visualization Style Guide. By using library(urbnthemes), you can create publication ready maps you can immediately drop in to Urban research briefs or blog posts.

To install urbnthemes, visit the package’s GitHub repository and follow the instructions. There are 2 ways to use the urbnthemes functions:

library(urbnthemes)

# You can either run this once per script to automatically style all maps with
# the Urban theme
set_urbn_defaults(style = "map")

# Or you can add `+ theme_urbn_map()` to the end of every map you make
ggplot() +
  geom_sf(states, mapping = aes()) +
  theme_urbn_map()

Layering

You can layer multiple points/lines/polygons on top of each other using the + operator from library(ggplot2). The shapes will appear from bottom to top (ie the last mapped object will show up on top). It is important that all layers are in the same CRS (coordinate reference system).

state_capitals <- state_capitals %>%
  # This will change CRS to ESRI:102003 and shift the AK and HI state capitals
  # point locations to the appropriate locations on the inset maps.
  tigris::shift_geometry() %>%
  # For now filter out AL and HI as their state capitals will be slightly off.
  filter(!state %in% c("Alaska", "Hawaii"))

ggplot() +
  geom_sf(
    data = states,
    mapping = aes()
  ) +
  # Note we change the data argument
  geom_sf(
    data = state_capitals,
    mapping = aes(),
    # urbnthemes library has urbn color palettes built in.
    color = palette_urbn_main["yellow"],
    size = 2.0
  ) +
  theme_urbn_map()

Fill and Outline Colors

The same commands used to change colors, opacity, lines, size, etc. in charts can be used for maps too. To change the colors of the map , just use the fill = and color = parameters in geom_sf(). fill will change the fill color of polygons; color will change the color of polygon outlines, lines, and points.

Generally, maps that show the magnitude of a variable use the blue sequential ramp and maps that display positives and negatives use the diverging color ramp.library(urbnthemes) contains inbuilt. helper variables (like palette_urbn_main) for accessing color palettes from the Urban Data Viz Style guide. If for example you want states to be Urban’s magenta color:

ggplot() +
  geom_sf(states,
    mapping = aes(),
    # Adjust polygon fill color
    fill = palette_urbn_main["magenta"],
    # Adjust polygon outline color
    color = "white"
  ) +
  theme_urbn_map()

Adding text

You can also add text, like state abbreviations, directly to your map using geom_sf_text and the helper function get_urbn_labels().

library(urbnmapr)

ggplot() +
  geom_sf(states,
    mapping = aes(),
    color = "white"
  ) +
  theme_urbn_map() +
  # Generates dataframe of state abbv and appropriate location to plot them
  geom_sf_text(
    data = get_urbn_labels(
      map = "states",
      sf = TRUE
    ),
    aes(label = state_abbv),
    size = 3
  )

There’s also geom_sf_label() if you want labels with a border.

Spatial Operations

Cropping

Cropping (or clipping) is geographically filtering an sf dataframe to just the area we are interested in. Say we wanted to look at the roads around Fire Station 24 in DC.

library(tigris)
library(units)

dc_firestations <- dc_firestations %>%
  st_transform("EPSG:6487")


# Draw 500 meter circle around one fire station
fire_station_24_buffered <- dc_firestations %>%
  filter(NAME == "Engine 24 Station") %>%
  st_buffer(set_units(500, "meter"))

# Get listing of all roads in DC
dc_roads <- roads(
  state = "DC",
  county = "District of Columbia",
  class = "sf",
  progress_bar = FALSE
) %>%
  st_transform("EPSG:6487")

# View roads on top of fire_station
ggplot() +
  # Order matters! We need to plot fire_stations first, and then roads on top
  # to see overlapping firestations
  geom_sf(
    data = fire_station_24_buffered,
    fill = palette_urbn_cyan[1],
    color = palette_urbn_cyan[7]
  ) +
  geom_sf(
    data = dc_roads,
    color = palette_urbn_gray[7]
  ) +
  theme_urbn_map()

We can clip the larger roads dataframe to just roads that overlap with the circle around the fire station with st_intersection().

# Use st_intersection() to crop the roads data to just roads within the
# fire_station radius
dc_roads_around_fire_station_24_buffered <- fire_station_24_buffered %>%
  st_intersection(dc_roads)

ggplot() +
  geom_sf(
    data = fire_station_24_buffered,
    fill = palette_urbn_cyan[1],
    color = palette_urbn_cyan[7]
  ) +
  geom_sf(
    data = dc_roads_around_fire_station_24_buffered,
    color = palette_urbn_gray[7]
  ) +
  theme_urbn_map()

More Coming Soon!

Calculating Distance

Spatial Joins

Point to Polygon

Polygon to Polygon

Aggregating

Drive/Transit times

Geocoding

Geocoding is the process of turning text (usually addresses) into geographic coordinates (usually latitudes/longitudes) for use in mapping. For Urban researchers, we highly recommend using the Urban geocoder as it is fast, accurate, designed to work with sensitive/confidential data and most importantly free to use for Urban researchers! To learn about how we set up and chose the geocoder for the Urban Institute, you can read our Data@Urban blog.

Cleaning Addresses

The single most important factor in getting accurate geocoded data is having cleaned, well structured address data. This can prove difficult as address data out in the wild is often messy and unstandardized. While the rules for cleaning addresses are very data specific, below are some examples of clean addresses you should aim for in your data cleaning process:

f_address Type of address
123 Troy Drive, Pillowtown, CO, 92432 residnetial address
789 Abed Avenue, Apt 666, Blankesburg, CO, 92489 residential apartment address
Shirley Boulevard and Britta Drive, Blanketsburg, CO, 92489 street intersection
Pillowtown, CO city
92489, CO Zip Code

All that being said, our geocoder is pretty tolerant of different address formats, typos/spelling errors and missing states, zip codes, etc. So don’t spend too much time cleaning every address in the data. Also note that while our geocoder is able to geocode cities and zip codes, it will return the lat/lon of the center of the city/zip code, which may not be what you want.

Instructions

To use the Urban geocoder, you will need to:

  1. Generate a CSV with a column named f_address which contains the addresses in single line format (ie 123 Abed Avenue, Blanketsburg, CO, 94328). This means that if you have the addresses split across multiple columns (ie Address, City, State, Zip columns), you will need to concatenate them into one column. Also see our Address cleaning section above.

  2. Go to the Urban geocoder and answer the initial questions. This will tell you whether your data is non-confidential or confidential data, and allow you to upload your CSV for geocoding.

  3. Wait for an email telling you your results are ready. If your data is non-confidential, this email will contain a link to your geocoded results. This link expires in 24 hours, so make sure to download your data before then. If you data is confidential, the email will contain a link to the location on the Y Drive where your confidential geocoded data is stored. You can specify this output folder when submitting the CSV in step 1.

Geocoder outputs

The geocoded file will be your original data, plus a few more columns (including latitude and longitude). each of the new columns that have been appended to your original data. It’s very important that you take a look at the Addr_type column in the CSV before doing further analysis to check the accuracy of the geocoding process.

Column Description
Match_addr The actual address that the inputted address was matched to. This is the address that the geocoder used to get Latitudes / Longitudes. If there are potentially many typos or non standard address formats in your data file, you will want to take a close look at this column to confirm that the matched address correctly handled typos and badly formatted addresses.
Longitude The WGS 84 datum Longitude (EPSG code 4326)
Latitude The WGS 84 datum Latitude (EPSG code 4326)
Addr_type The match level for a geocode request. This should be used as an indicator of the precision of geocode results. Generally, Subaddress, PointAddress, StreetAddress, and StreetInt represent accurate matches. The list below contains all possible values for this field. Green values represent High accuracy matches, yellow represents Medium accuracy matches and red represents Low accuracy/inaccurate matches. If you have many yellow and red values in your data, you should manually check the results before proceeding with analysis. All possible values:

Subaddress: A subset of a PointAddress that represents a house or building subaddress location, such as an apartment unit, floor, or individual building within a complex. The UnitName, UnitType, LevelName, LevelType, BldgName, and BldgType field values help to distinguish subaddresses which may be associated with the same PointAddress. Reference data consists of point features with associated house number, street name, and subaddress elements, along with administrative divisions and optional postal code; for example, 3836 Emerald Ave, Suite C, La Verne, CA, 91750.

PointAddress: A street address based on points that represent house and building locations. Typically, this is the most spatially accurate match level. Reference data contains address points with associated house numbers and street names, along with administrative divisions and optional postal code. The X / Y (Longitude/Latitude) and geometry output values for a PointAddress match represent the street entry location for the address; this is the location used for routing operations. The DisplayX and DisplayY values represent the rooftop, or actual, location of the address. Example: 380 New York St, Redlands, CA, 92373.

StreetAddress — A street address that differs from PointAddress because the house number is interpolated from a range of numbers. Reference data contains street center lines with house number ranges, along with administrative divisions and optional postal code information, for example, 647 Haight St, San Francisco, CA, 94117.

StreetInt: A street address consisting of a street intersection along with city and optional state and postal code information. This is derived from StreetAddress reference data, for example, Redlands Blvd & New York St, Redlands, CA, 92373.

StreetName: Similar to a street address but without the house number. Reference data contains street centerlines with associated street names (no numbered address ranges), along with administrative divisions and optional postal code, for example, W Olive Ave, Redlands, CA, 92373.

StreetAddressExt: An interpolated street address match that is returned when parameter matchOutOfRange=true and the input house number exceeds the house number range for the matched street segment.

DistanceMarker: A street address that represents the linear distance along a street, typically in kilometers or miles, from a designated origin location. Example: Carr 682 KM 4, Barceloneta, 00617.

PostalExt: A postal code with an additional extension, such as the United States Postal Service ZIP+4. Reference data is postal code points with extensions, for example, 90210-3841.

POI: —Points of interest. Reference data consists of administrative division place-names, businesses, landmarks, and geographic features, for example, Golden Gate Bridge.

Locality: A place-name representing a populated place. The Type output field provides more detailed information about the type of populated place. Possible Type values for Locality matches include Block, Sector, Neighborhood, District, City, MetroArea, County, State or Province, Territory, Country, and Zone. Example: Bogotá, COL,

PostalLoc: A combination of postal code and city name. Reference data is typically a union of postal boundaries and administrative (locality) boundaries, for example, 7132 Frauenkirchen.

Postal: Postal code. Reference data is postal code points, for example, 90210 USA.
Score A number from 1–100 indicating the degree to which the input tokens in a geocoding request match the address components in a candidate record. A score of 100 represents a perfect match, while lower scores represent decreasing match accuracy.
Status Indicates whether a batch geocode request results in a match, tie, or unmatched. Possible values include

M - Match. The returned address matches the input address and is the highest scoring candidate.

T - Tied. The returned address matches the input address but has the same score as one or more additional candidates.

U - Unmatched. No addresses match the inputted address.
geometry The WKT (Well-known text) representation of the latitudes and longitudes. This column may be useful if you’re reading the CSV into R, Python, or ArcGIS
Region The state that Match_addr is located in
RegionAbbr Abbreviated State Name. For example, CA for California
Subregion The county that the input address is located in
MetroArea The name of the Metropolitan area that Match_addr is located in. This field may be blank if the input address is not located within a metro area.
City The city that Match_addr is located in
Nbrhd The Neighborhood that Match_addr is located in. Note these are ESRI defined neighborhoods which may or may not align with other sources neighborhood definitions


Geospatial Modeling

Coming soon!

Bibliography and references


sessionInfo()
R version 4.2.2 (2022-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Monterey 12.5.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] mapview_2.11.0      geofacet_0.2.0      urbnthemes_0.0.2   
 [4] urbnmapr_0.0.0.9002 sf_1.0-9            here_1.0.1         
 [7] kableExtra_1.3.4    knitr_1.40          forcats_0.5.2      
[10] stringr_1.4.1       dplyr_1.0.10        purrr_0.3.5        
[13] readr_2.1.3         tidyr_1.2.1         tibble_3.1.8       
[16] ggplot2_3.4.0       tidyverse_1.3.2    

loaded via a namespace (and not attached):
  [1] leafem_0.2.0            googledrive_2.0.0       colorspace_2.0-3       
  [4] ellipsis_0.3.2          class_7.3-20            leaflet_2.1.1          
  [7] rprojroot_2.0.3         satellite_1.0.4         snakecase_0.11.0       
 [10] base64enc_0.1-3         fs_1.5.2                rstudioapi_0.14        
 [13] proxy_0.4-27            farver_2.1.1            ggrepel_0.9.2          
 [16] bit64_4.0.5             fansi_1.0.3             lubridate_1.9.0        
 [19] xml2_1.3.3              codetools_0.2-18        extrafont_0.18         
 [22] jsonlite_1.8.3          broom_1.0.1             Rttf2pt1_1.3.11        
 [25] dbplyr_2.2.1            png_0.1-7               rgeos_0.6-1            
 [28] compiler_4.2.2          httr_1.4.4              backports_1.4.1        
 [31] assertthat_0.2.1        fastmap_1.1.0           gargle_1.2.1           
 [34] cli_3.4.1               leaflet.providers_1.9.0 htmltools_0.5.4        
 [37] tools_4.2.2             gtable_0.3.1            glue_1.6.2             
 [40] Rcpp_1.0.9              leaflet.extras2_1.2.0   raster_3.6-14          
 [43] cellranger_1.1.0        imguR_1.0.3             vctrs_0.5.1            
 [46] svglite_2.1.1           extrafontdb_1.0         crosstalk_1.2.0        
 [49] xfun_0.34               rvest_1.0.3             timechange_0.1.1       
 [52] lifecycle_1.0.3         renv_0.16.0             googlesheets4_1.0.1    
 [55] terra_1.7-3             scales_1.2.1            vroom_1.6.0            
 [58] clisymbols_1.2.0        hms_1.1.2               parallel_4.2.2         
 [61] RColorBrewer_1.1-3      yaml_2.3.6              curl_4.3.3             
 [64] gridExtra_2.3           stringi_1.7.8           leafpop_0.1.0          
 [67] e1071_1.7-12            rlang_1.0.6             pkgconfig_2.0.3        
 [70] systemfonts_1.0.4       geogrid_0.1.1           evaluate_0.18          
 [73] lattice_0.20-45         htmlwidgets_1.6.1       labeling_0.4.2         
 [76] bit_4.0.5               tidyselect_1.2.0        magrittr_2.0.3         
 [79] R6_2.5.1                generics_0.1.3          DBI_1.1.3              
 [82] pillar_1.8.1            haven_2.5.1             withr_2.5.0            
 [85] units_0.8-0             sp_1.6-0                janitor_2.2.0          
 [88] modelr_0.1.10           crayon_1.5.2            uuid_1.1-0             
 [91] KernSmooth_2.23-20      utf8_1.2.2              tzdb_0.3.0             
 [94] rmarkdown_2.18          jpeg_0.1-9              rnaturalearth_0.3.2    
 [97] grid_4.2.2              readxl_1.4.1            reprex_2.0.2           
[100] digest_0.6.30           classInt_0.4-8          webshot_0.5.4          
[103] brew_1.0-8              stats4_4.2.2            munsell_0.5.0          
[106] viridisLite_0.4.1       tidylog_1.0.2