17 NTEE-Resolved Crosswalk
The NTEE-resolved crosswalk is a per-EIN lookup that recovers a usable NTEE classification for organizations whose NTEE_CD is blank in the current IRS BMF but was populated in some earlier vintage. It consolidates the raw NTEE code observed for each EIN across every vintage of both pipelines (current monthly + legacy 501CX) and exposes several resolved views so the consumer — not this repo — picks the rule that fits their analysis.
Like the geography crosswalks, it is a separate join layer, not columns on the Master BMF (ADR 0016 consumer-composes).
17.1 Why it exists
The IRS periodically NULLs NTEE_CD for organizations that were previously classified. The motivating case is Carnegie Mellon University (EIN 25-0969449): NTEE_CD is empty in the current BMF but B43 (a university code) in the legacy NCCS files. After the NTEE cleaner fix (ADR 0032), CMU still resolves to UNU/Z99 in the current vintage — because the source code is genuinely absent. No cleaner can recover a code the IRS removed; only a cross-vintage lookup can.
Aggregating across vintages recovers a code for the large population of such EINs while leaving the current value visible and untouched, so a consumer can choose between “what the IRS says today” and “the best code we have ever seen for this org.”
17.2 What it is
One row per EIN (universe = any EIN that carries an NTEE in at least one vintage). Each row exposes four parallel resolutions plus provenance — “expose all, no opinionated pick”:
EIN ──ntee_resolved_crosswalk──▶ ntee_current (latest current vintage; may be NULL)
ntee_most_recent (newest non-null, any source)
ntee_modal (most-observed across vintages)
ntee_code_distribution (full per-code history, JSON)
Each resolution carries its _subsector (NTEEv2 sector, e.g. UNI, HOS, EDU) and _nteev2 (full NTEEv2 code), so “universities” is ntee_most_recent_subsector == 'UNI', not a substring guess.
17.3 Source & vintage
Built from ntee_code_raw in the all-columns intermediate parquets of both pipelines:
| Pattern | Source |
|---|---|
s3://nccsdata/intermediate/bmf/*/*.parquet |
current |
s3://nccsdata/intermediate/bmf-legacy/*/*.parquet |
legacy |
Raw codes are vintage-invariant — the verbatim IRS/NCCS source string, independent of which cleaner version processed that vintage — so no reprocessing of old vintages is required. The distinct raw codes are cleaned once through the current, fixed transform_ntee_code() (reused, never reimplemented), so the resolved codes always reflect the latest cleaner logic.
17.4 Schema
One row per ein. 18 columns.
| Column | Type | Description |
|---|---|---|
ein |
chr | Formatted EIN (XX-XXXXXXX) — join key |
ein_prefixed |
chr | Coercion-safe lowercase EIN key (ein-XX-XXXXXXX); bijective with ein (ADR 0036) |
EIN2 |
chr | Legacy-compatibility alias (EIN-XX-XXXXXXX); same key, legacy ecosystem format (ADR 0036) |
ntee_current |
chr | Cleaned NTEE-CC from the latest current vintage; NA if never in current or IRS-NULLed |
ntee_current_subsector |
chr | NTEEv2 subsector for the current code |
ntee_current_nteev2 |
chr | Full NTEEv2 code for the current code |
current_vintage |
chr | YYYY_MM of the latest current vintage observed (may be NA) |
ntee_most_recent |
chr | Cleaned NTEE-CC from the newest vintage carrying a non-null code (any source) |
ntee_most_recent_subsector |
chr | NTEEv2 subsector for the most-recent code |
ntee_most_recent_nteev2 |
chr | Full NTEEv2 code for the most-recent code |
ntee_most_recent_vintage |
chr | YYYY_MM that supplied ntee_most_recent |
ntee_most_recent_source |
chr | current or legacy — pipeline that supplied ntee_most_recent |
ntee_modal |
chr | Most-observed cleaned code across vintages (ties broken by recency) |
ntee_modal_subsector |
chr | NTEEv2 subsector for the modal code |
ntee_modal_nteev2 |
chr | Full NTEEv2 code for the modal code |
ntee_modal_n |
int | Number of vintage-observations supporting the modal code |
ntee_code_distribution |
chr | JSON keyed on cleaned NTEE-CC: {code: {n, first, last}} — full history |
n_distinct_codes |
int | Count of distinct cleaned codes ever observed for the EIN |
n_vintages_with_ntee |
int | Total vintage-observations that carried an NTEE |
ntee_agreement |
chr | single (one observation) / unanimous (all agree) / mixed |
einis the formatted string. In the all-columns intermediate parquet the formatted EIN collides case-insensitively with the raw integerEINand surfaces asein_1; the build resolves the column defensively from the schema.
17.5 How it is built
eval "$(aws configure export-credentials --profile <profile> --format env)"
Rscript scripts/build_ntee_resolved_crosswalk.R- Over every intermediate parquet, project only
(ein, ntee_code_raw)via DuckDBhttpfsand aggregate per(ein, src, raw)with first/last vintage. Parquet column projection means only a small slice of the ~36 GB corpus crosses the wire, so this runs locally; no EC2 needed. - Clean the (small) set of distinct raw codes through
transform_ntee_code()— legacy codes via thelegacy_mode5-char crosswalk, current codes via the standard path — and join the cleaned forms back. - Resolve the four per-EIN views and write
parquet+csvtodata/crosswalks/.
The heavy GROUP BY spills to disk: the script sets a temp_directory, preserve_insertion_order=false, and an env-overridable memory_limit / threads, so it completes on a RAM-limited host (it streams to disk rather than OOM-ing). It also INSTALL/LOADs the DuckDB aws extension, which the S3 credential_chain secret requires.
17.6 Coverage
Built over the 2026-06 corpus (current + legacy intermediate parquets):
| Metric | Value |
|---|---|
| EINs with a resolvable NTEE | 3,613,958 |
| EINs with a usable current NTEE | 1,559,544 |
| EINs with NULL/blank current NTEE | 2,054,414 |
…of which recovered a code via ntee_most_recent |
2,031,485 |
ntee_agreement = unanimous |
2,585,156 (71.5 %) |
ntee_agreement = mixed |
777,236 (21.5 %) |
ntee_agreement = single |
251,566 (7.0 %) |
The bulk of the NULL-current population is legacy-only EINs (never present in a current vintage); ntee_most_recent gives them a classification anchored to their last observed code. The CMU sanity check passes: ntee_current = NA, ntee_most_recent = B43 (UNI), ntee_modal = B43.
17.7 How to use it
Published to s3://nccsdata/crosswalks/ntee-resolved/ (parquet + csv + _manifest.json). Join by ein and pick the column that matches your question:
library(dplyr); library(arrow)
ntee <- read_parquet("ntee_resolved_crosswalk.parquet")
bmf |>
left_join(ntee, by = "ein") |>
# "best available" classification: current if present, else most-recent
mutate(ntee_best = coalesce(ntee_current, ntee_most_recent),
subsector = coalesce(ntee_current_subsector, ntee_most_recent_subsector)) |>
# e.g. all universities ever recognized
filter(subsector == "UNI")- Use
ntee_currentto match exactly what the IRS reports today. - Use
ntee_most_recent(or thecoalesceabove) to recover NULL-source orgs. - Use
ntee_modalfor the org’s most-typical historical classification. - Use
ntee_code_distribution+ntee_agreementto audit reclassifications.
17.8 Maintenance
Rebuild and re-publish (R/publish_ntee_resolved_crosswalk.R, idempotent on sha256) after each new monthly current BMF lands, so ntee_current tracks the newest vintage. The legacy half is static. The manifest records the input prefixes and the sha256 of transform_ntee_code.R and the legacy 5-char lookup, so a change to either is visible in the published provenance.