14 County FIPS Crosswalk
The county FIPS crosswalk maps the geocoder’s dirty county labels to the canonical Census county identity — a stable FIPS GEOID plus the official NAMELSAD name — so downstream consumers can canonicalize county names and filter by a collision-proof key.
14.1 Motivation
The Urban Institute geocoder returns a free-text county name (geo_county, from the geocoder’s Subregion field) and a coordinate, but no FIPS code. Joining or filtering on the raw name is fragile:
- Name collisions. A bare label like
Baltimorecould mean Baltimore city (FIPS24510) or Baltimore County (24005) — two distinct county-equivalents. Filtering by name silently merges or misroutes them; filtering by FIPS cannot. - Spelling / formatting drift.
St. LouisvsSt Louis,City of AlexandriavsAlexandria city, accentedJuana DíazvsJuana Díaz Municipio. - No stable join key for attaching county-level data (ACS, BLS, rural-urban codes, etc.), all of which key on FIPS.
Rather than bake FIPS columns into the Master BMF — which would pin every consumer to one TIGER vintage and bloat the file (see ADR 0016 / 12) — we publish a separate, small crosswalk that consumers join themselves. County boundaries are stable enough that a single crosswalk artifact is the safe exception to the “don’t expose vintaged geography” rule.
14.2 What it is not
This crosswalk does not overwrite geo_county in the Master BMF (sector-in-brief reads that field verbatim), and it does not add geography columns to the master. It is an optional, consumer-composed join layer. It carries county identity only — no tract, block, place, or ZCTA.
Roll up to metro areas? The county FIPS produced here is the join key for the CBSA Crosswalk, which maps each county to its metropolitan / micropolitan statistical area.
14.3 Schema
One row per distinct (geo_state_abbr, geo_county_raw) pair, so a consumer left_join never fans out.
| Column | Type | Description |
|---|---|---|
geo_state_abbr |
chr | USPS state / territory abbreviation (join key) |
geo_county_raw |
chr | The dirty geocoder county label, verbatim (join key) |
geo_county_fips |
chr | 5-char county GEOID, leading zeros preserved (NA if unresolved) |
state_fips |
chr | 2-char state FIPS, leading zeros preserved |
geo_county_canonical |
chr | Census NAMELSAD (e.g. Baltimore County, Alexandria city) |
resolution |
chr | resolved | ambiguous | unresolved | deferred_ct_planning_region |
tiger_year |
int | TIGER/Census boundary vintage the FIPS/name come from (2023) |
The FIPS columns are strings, not integers — leading zeros (
01001Autauga AL,06037Los Angeles CA,09xxxCT) are significant and must be preserved on read.
14.4 How it is built
The build is intentionally isolated from the pipeline runtime: sf and tigris are used only here, never in a pipeline run.
scripts/read_county_points.R # single S3 read -> local cache
scripts/build_county_fips_crosswalk.R
One read of
s3://nccsdata/geocoding/bmf-master/merged/bmf_master_geocoded.parquetpulls distinct(geo_state_abbr, geo_county)plus a ~1 km-gridded spread of representative points, weighted by org count per cell, and caches them locally (data/crosswalks/_county_points_cache.parquet).tigris::counties(cb = TRUE, year = 2023)provides the county polygons and gazetteer (50 states + DC + PR + the island areas).Each cached point is assigned a county by
st_within(point-in-polygon).Each
(state, raw label)group is resolved by name match against the TIGER gazetteer, corroborated by org-mass:- Name match is authoritative for canonicalization. The normalized label (de-accented,
City of X→X city,Saint→St, trailingCounty/Parish/Borough/Municipiostripped,citykept) is matched to the in-state gazetteer. A unique match resolves. - Org mass is the fallback and the corroborator. Votes are weighted by how many orgs carry the label in each cell — not by distinct cells — so a thin sliver of border-spillover cannot outvote the county core. When the name does not match, a county holding ≥ 90 % of in-state org mass resolves the label.
- Genuine ambiguity is surfaced, never guessed. A bare label that could mean either an independent city or its namesake county (
Baltimore,Richmond,St. Louis), or a name that maps to two in-state counties, is leftambiguouswithgeo_county_fips = NA.
- Name match is authoritative for canonicalization. The normalized label (de-accented,
Anything not cleanly resolved is written to data/crosswalks/county_fips_crosswalk_audit.csv with the candidate GEOIDs and their mass shares — evidence, not a verdict.
# Rebuild (creds exported for the one-time read):
eval "$(aws configure export-credentials --profile thiya --format env)"
Rscript scripts/read_county_points.R
Rscript scripts/build_county_fips_crosswalk.R # TIGER_YEAR=2023 by default14.5 Coverage (TIGER 2023)
Of 3,635 distinct (state, raw county) pairs:
| Resolution | Count | What it means |
|---|---|---|
resolved |
3,615 | Unique FIPS + canonical name attached |
deferred_ct_planning_region |
8 | CT old-county labels — resolve by coordinate via the companion |
ambiguous |
8 | Cannot be canonicalized to one FIPS — see below |
unresolved |
4 | Label names a county that is not in the org’s state (source error) |
The 8 ambiguous labels are all genuine, not artifacts:
- Independent-city ⁄ namesake-county collisions —
MD Baltimore,MO St Louis/St. Louis,VA Fairfax/Richmond/Roanoke. A bare label cannot be assigned to the city or the county with certainty. - Abolished Alaska census area —
Valdez-Cordova(split into Chugach02063and Copper River02066in 2019). - American Samoa —
Westmaps to two districts.
Connecticut is handled separately. In 2022 Census retired CT’s eight historical counties for nine planning regions (09110–09190), so an old <name> County label (Fairfield, Hartford, … all eight) genuinely spans multiple new GEOIDs and cannot be canonicalized at the (state, county) grain. Rather than guess — or, worse, let org-mass silently collapse the counties that happen to clear the 90 % threshold onto a single region — every CT <name> County label is marked deferred_ct_planning_region and routed to the coordinate-keyed CT planning-region companion. (Bare CT town labels such as Hartford or New Haven sit in exactly one region and still resolve normally.)
The 4 unresolved labels (CT Westchester, DE Dorchester, ID Whitman, PA Garrett) name counties belonging to a different state than the org’s geo_state_abbr — upstream data errors, each covering a handful of rows.
14.6 How to use it
The crosswalk is published to s3://nccsdata/crosswalks/county-fips/ (a _manifest.json alongside records the row count, sha256, and tiger_year). Consumers join it themselves; nothing in this repo or the Master BMF depends on it.
1 — Canonicalize names and attach FIPS. Left-join onto a geocoded table on the two raw keys:
library(dplyr); library(arrow)
xwalk <- read_parquet("county_fips_crosswalk.parquet") # FIPS cols are chr
bmf_geo |>
left_join(xwalk, by = c("geo_state_abbr", "geo_county" = "geo_county_raw")) |>
# geo_county_fips / geo_county_canonical now attached; NA where ambiguous
mutate(geo_county_display = coalesce(geo_county_canonical, geo_county))2 — Filter by FIPS, pulling the codes from the crosswalk (never hardcode them). This is collision-proof where names are not:
# Baltimore *County* only (not Baltimore city):
balt_co <- xwalk |>
filter(geo_county_canonical == "Baltimore County") |>
pull(geo_county_fips) # "24005"
bmf_geo |>
left_join(xwalk, by = c("geo_state_abbr", "geo_county" = "geo_county_raw")) |>
filter(geo_county_fips %in% balt_co)3 — Get the canonical county list for a state (e.g. to populate a dropdown) straight from the crosswalk’s resolved rows.
Because the join is on the raw label, rows whose label is ambiguous, unresolved, or deferred_ct_planning_region receive NA FIPS — handle them explicitly (the coalesce above keeps the original label as a display fallback) rather than letting them silently drop. For the CT rows specifically, recover the planning region by coordinate via the CT planning-region companion.
14.7 Maintenance
The crosswalk is keyed to a TIGER vintage (tiger_year, currently 2023). Rebuild and re-publish when either the geocoded Master BMF gains materially new county labels or a newer TIGER vintage changes county boundaries. Re-publishing is idempotent — R/publish_county_fips_crosswalk.R compares the parquet’s sha256 against the remote _manifest.json and skips an unchanged file.