14  County FIPS Crosswalk

The county FIPS crosswalk maps the geocoder’s dirty county labels to the canonical Census county identity — a stable FIPS GEOID plus the official NAMELSAD name — so downstream consumers can canonicalize county names and filter by a collision-proof key.

14.1 Motivation

The Urban Institute geocoder returns a free-text county name (geo_county, from the geocoder’s Subregion field) and a coordinate, but no FIPS code. Joining or filtering on the raw name is fragile:

  • Name collisions. A bare label like Baltimore could mean Baltimore city (FIPS 24510) or Baltimore County (24005) — two distinct county-equivalents. Filtering by name silently merges or misroutes them; filtering by FIPS cannot.
  • Spelling / formatting drift. St. Louis vs St Louis, City of Alexandria vs Alexandria city, accented Juana Díaz vs Juana Díaz Municipio.
  • No stable join key for attaching county-level data (ACS, BLS, rural-urban codes, etc.), all of which key on FIPS.

Rather than bake FIPS columns into the Master BMF — which would pin every consumer to one TIGER vintage and bloat the file (see ADR 0016 / 12) — we publish a separate, small crosswalk that consumers join themselves. County boundaries are stable enough that a single crosswalk artifact is the safe exception to the “don’t expose vintaged geography” rule.

14.2 What it is not

This crosswalk does not overwrite geo_county in the Master BMF (sector-in-brief reads that field verbatim), and it does not add geography columns to the master. It is an optional, consumer-composed join layer. It carries county identity only — no tract, block, place, or ZCTA.

Roll up to metro areas? The county FIPS produced here is the join key for the CBSA Crosswalk, which maps each county to its metropolitan / micropolitan statistical area.

14.3 Schema

One row per distinct (geo_state_abbr, geo_county_raw) pair, so a consumer left_join never fans out.

Column Type Description
geo_state_abbr chr USPS state / territory abbreviation (join key)
geo_county_raw chr The dirty geocoder county label, verbatim (join key)
geo_county_fips chr 5-char county GEOID, leading zeros preserved (NA if unresolved)
state_fips chr 2-char state FIPS, leading zeros preserved
geo_county_canonical chr Census NAMELSAD (e.g. Baltimore County, Alexandria city)
resolution chr resolved | ambiguous | unresolved | deferred_ct_planning_region
tiger_year int TIGER/Census boundary vintage the FIPS/name come from (2023)

The FIPS columns are strings, not integers — leading zeros (01001 Autauga AL, 06037 Los Angeles CA, 09xxx CT) are significant and must be preserved on read.

14.4 How it is built

The build is intentionally isolated from the pipeline runtime: sf and tigris are used only here, never in a pipeline run.

scripts/read_county_points.R    # single S3 read -> local cache
scripts/build_county_fips_crosswalk.R
  1. One read of s3://nccsdata/geocoding/bmf-master/merged/bmf_master_geocoded.parquet pulls distinct (geo_state_abbr, geo_county) plus a ~1 km-gridded spread of representative points, weighted by org count per cell, and caches them locally (data/crosswalks/_county_points_cache.parquet).

  2. tigris::counties(cb = TRUE, year = 2023) provides the county polygons and gazetteer (50 states + DC + PR + the island areas).

  3. Each cached point is assigned a county by st_within (point-in-polygon).

  4. Each (state, raw label) group is resolved by name match against the TIGER gazetteer, corroborated by org-mass:

    • Name match is authoritative for canonicalization. The normalized label (de-accented, City of XX city, SaintSt, trailing County/Parish/Borough/Municipio stripped, city kept) is matched to the in-state gazetteer. A unique match resolves.
    • Org mass is the fallback and the corroborator. Votes are weighted by how many orgs carry the label in each cell — not by distinct cells — so a thin sliver of border-spillover cannot outvote the county core. When the name does not match, a county holding ≥ 90 % of in-state org mass resolves the label.
    • Genuine ambiguity is surfaced, never guessed. A bare label that could mean either an independent city or its namesake county (Baltimore, Richmond, St. Louis), or a name that maps to two in-state counties, is left ambiguous with geo_county_fips = NA.

Anything not cleanly resolved is written to data/crosswalks/county_fips_crosswalk_audit.csv with the candidate GEOIDs and their mass shares — evidence, not a verdict.

# Rebuild (creds exported for the one-time read):
eval "$(aws configure export-credentials --profile thiya --format env)"
Rscript scripts/read_county_points.R
Rscript scripts/build_county_fips_crosswalk.R   # TIGER_YEAR=2023 by default

14.5 Coverage (TIGER 2023)

Of 3,635 distinct (state, raw county) pairs:

Resolution Count What it means
resolved 3,615 Unique FIPS + canonical name attached
deferred_ct_planning_region 8 CT old-county labels — resolve by coordinate via the companion
ambiguous 8 Cannot be canonicalized to one FIPS — see below
unresolved 4 Label names a county that is not in the org’s state (source error)

The 8 ambiguous labels are all genuine, not artifacts:

  • Independent-city ⁄ namesake-county collisionsMD Baltimore, MO St Louis / St. Louis, VA Fairfax / Richmond / Roanoke. A bare label cannot be assigned to the city or the county with certainty.
  • Abolished Alaska census areaValdez-Cordova (split into Chugach 02063 and Copper River 02066 in 2019).
  • American SamoaWest maps to two districts.

Connecticut is handled separately. In 2022 Census retired CT’s eight historical counties for nine planning regions (0911009190), so an old <name> County label (Fairfield, Hartford, … all eight) genuinely spans multiple new GEOIDs and cannot be canonicalized at the (state, county) grain. Rather than guess — or, worse, let org-mass silently collapse the counties that happen to clear the 90 % threshold onto a single region — every CT <name> County label is marked deferred_ct_planning_region and routed to the coordinate-keyed CT planning-region companion. (Bare CT town labels such as Hartford or New Haven sit in exactly one region and still resolve normally.)

The 4 unresolved labels (CT Westchester, DE Dorchester, ID Whitman, PA Garrett) name counties belonging to a different state than the org’s geo_state_abbr — upstream data errors, each covering a handful of rows.

14.6 How to use it

The crosswalk is published to s3://nccsdata/crosswalks/county-fips/ (a _manifest.json alongside records the row count, sha256, and tiger_year). Consumers join it themselves; nothing in this repo or the Master BMF depends on it.

1 — Canonicalize names and attach FIPS. Left-join onto a geocoded table on the two raw keys:

library(dplyr); library(arrow)
xwalk <- read_parquet("county_fips_crosswalk.parquet")  # FIPS cols are chr

bmf_geo |>
  left_join(xwalk, by = c("geo_state_abbr", "geo_county" = "geo_county_raw")) |>
  # geo_county_fips / geo_county_canonical now attached; NA where ambiguous
  mutate(geo_county_display = coalesce(geo_county_canonical, geo_county))

2 — Filter by FIPS, pulling the codes from the crosswalk (never hardcode them). This is collision-proof where names are not:

# Baltimore *County* only (not Baltimore city):
balt_co <- xwalk |>
  filter(geo_county_canonical == "Baltimore County") |>
  pull(geo_county_fips)              # "24005"

bmf_geo |>
  left_join(xwalk, by = c("geo_state_abbr", "geo_county" = "geo_county_raw")) |>
  filter(geo_county_fips %in% balt_co)

3 — Get the canonical county list for a state (e.g. to populate a dropdown) straight from the crosswalk’s resolved rows.

Because the join is on the raw label, rows whose label is ambiguous, unresolved, or deferred_ct_planning_region receive NA FIPS — handle them explicitly (the coalesce above keeps the original label as a display fallback) rather than letting them silently drop. For the CT rows specifically, recover the planning region by coordinate via the CT planning-region companion.

14.7 Maintenance

The crosswalk is keyed to a TIGER vintage (tiger_year, currently 2023). Rebuild and re-publish when either the geocoded Master BMF gains materially new county labels or a newer TIGER vintage changes county boundaries. Re-publishing is idempotent — R/publish_county_fips_crosswalk.R compares the parquet’s sha256 against the remote _manifest.json and skips an unchanged file.