flowchart TB
classDef node fill:#d2d2d2,stroke:#333,color:#000
subgraph Monthly["Monthly Pipeline Runs"]
BMF1[("BMF 2024-01")]
BMF2[("BMF 2024-02")]
BMF3[("BMF 2024-03")]
BMFN[("BMF YYYY-MM")]
end
subgraph Post["Post-Processing (After All BMFs)"]
COMPARE["Compare Records<br/>Across Snapshots"]
DETECT["Detect Changes"]
end
subgraph Dims["Dimension Tables"]
DIM1[("dim_org_name")]
DIM2[("dim_org_address")]
DIM3[("dim_ntee")]
end
BMF1 --> COMPARE
BMF2 --> COMPARE
BMF3 --> COMPARE
BMFN --> COMPARE
COMPARE --> DETECT
DETECT --> DIM1
DETECT --> DIM2
DETECT --> DIM3
style Monthly fill:#fff9e6,stroke:#fdbf11
style Post fill:#fee39b,stroke:#fdbf11
style Dims fill:#fdbf11,stroke:#d19c0f
5 Dimension Tables
6 Dimension Table Architecture
The BMF pipeline generates dimension tables that track changes across multiple BMF snapshots over time. These tables follow the SCD Type 2 (Slowly Changing Dimension) pattern from data warehousing, preserving complete historical records.
The dimension tables described in this chapter (dim_org_name, dim_org_address, dim_ntee) are not yet created by this pipeline. Currently, this pipeline processes individual monthly BMF snapshots.
Once all historical BMF files have been processed, dimension table generation will be integrated into the pipeline to track changes across time. The schemas and patterns documented here describe the planned output structure.
6.1 Overview
Dimension tables will be created after all BMF files are processed, not during individual pipeline runs. They compare records across monthly BMF snapshots to detect and track changes in:
- Organization names
- Organization addresses
- NTEE classifications
6.2 SCD Type 2 Pattern
Each dimension table tracks changes using effective date ranges:
| Column | Purpose |
|---|---|
dim_id |
Surrogate primary key (unique per record version) |
ein |
Foreign key to BMF fact table |
effective_start_date |
First BMF date with this attribute value |
effective_end_date |
Last BMF date (NULL if currently active) |
is_current |
TRUE if this is the current active record |
This enables:
- Point-in-time queries: What was org X’s name in 2020?
- Change detection: When did org Y move addresses?
- Trend analysis: How many orgs changed NTEE codes between years?
7 dim_org_name
Tracks changes to organization names over time.
7.1 Schema
| Column | Type | Description |
|---|---|---|
dim_id |
integer | Surrogate primary key |
ein |
character | Employer Identification Number (FK) |
org_name_display |
character | Title-cased organization name suitable for display purposes |
org_name_join |
character | Standardized name for matching and joining (uppercase punctuation removed) |
org_legal_suffix |
character | Legal entity suffix extracted from name (Inc Corp LLC Foundation etc.) |
org_parent_name |
character | Parent organization name if this is a subordinate/chapter organization |
effective_start_date |
date | First BMF date with this name |
effective_end_date |
date | Last BMF date (NULL if current) |
is_current |
logical | TRUE if this is the current record |
7.2 Example Data
| dim_id | ein | org_name_display | effective_start_date | effective_end_date | is_current |
|---|---|---|---|---|---|
| 1 | 12-3456789 | Old Name Foundation | 2020-01-01 | 2023-06-01 | FALSE |
| 2 | 12-3456789 | New Name Foundation | 2023-07-01 | NULL | TRUE |
| 3 | 98-7654321 | Stable Organization Inc | 2018-01-01 | NULL | TRUE |
7.3 Usage Examples
7.3.1 Find organizations that changed names
library(data.table)
library(arrow)
# Planned output location once dimension tables are generated
dim_org_name <- arrow::read_parquet("data/processed/dim_org_name.parquet")
setDT(dim_org_name)
# Organizations with multiple name records (indicates name changes)
name_changes <- dim_org_name[, .N, by = ein][N > 1]
print(paste(nrow(name_changes), "organizations have changed names"))7.3.2 Get current name for all organizations
current_names <- dim_org_name[is_current == TRUE]7.3.3 View name history for a specific organization
dim_org_name[ein == "12-3456789"][order(effective_start_date)]8 dim_org_address
Tracks changes to organization addresses over time.
8.1 Schema
| Column | Type | Description |
|---|---|---|
dim_id |
integer | Surrogate primary key |
ein |
character | Employer Identification Number (FK) |
org_addr_street |
character | Standardized street address with USPS abbreviations |
org_addr_city |
character | Cleaned city name |
org_addr_state |
character | Two-letter state abbreviation |
org_addr_zip5 |
character | 5-digit ZIP code |
org_addr_zip4 |
character | 4-digit ZIP code extension (if available) |
org_addr_full |
character | Complete formatted address string |
effective_start_date |
date | First BMF date at this address |
effective_end_date |
date | Last BMF date (NULL if current) |
is_current |
logical | TRUE if this is the current record |
8.2 Example Data
| dim_id | ein | org_addr_city | org_addr_state | effective_start_date | effective_end_date | is_current |
|---|---|---|---|---|---|---|
| 1 | 12-3456789 | New York | NY | 2019-01-01 | 2022-03-01 | FALSE |
| 2 | 12-3456789 | Los Angeles | CA | 2022-04-01 | NULL | TRUE |
| 3 | 98-7654321 | Chicago | IL | 2015-01-01 | NULL | TRUE |
8.3 Usage Examples
8.3.1 Find organizations that relocated
# Planned output location once dimension tables are generated
dim_org_address <- arrow::read_parquet("data/processed/dim_org_address.parquet")
setDT(dim_org_address)
# Organizations with multiple address records
relocations <- dim_org_address[, .N, by = ein][N > 1]
print(paste(nrow(relocations), "organizations have changed addresses"))8.3.2 Get current address for all organizations
current_addresses <- dim_org_address[is_current == TRUE]8.3.3 Find organizations that moved to a specific state
# Organizations that moved TO California
moved_to_ca <- dim_org_address[
org_addr_state == "CA" & is_current == TRUE
][
ein %in% dim_org_address[org_addr_state != "CA", ein]
]8.3.4 Track address history for an organization
dim_org_address[ein == "12-3456789"][order(effective_start_date)]9 dim_ntee
Tracks changes to NTEE classification codes over time.
9.1 Schema
| Column | Type | Description |
|---|---|---|
dim_id |
integer | Surrogate primary key |
ein |
character | Employer Identification Number (FK) |
ntee_code_clean |
character | Standardized 3-character NTEE code |
ntee_code_definition |
character | Full description of NTEE classification |
ntee_code_major_group |
character | NTEE major group letter (A-Z) indicating broad category |
naics_code |
character | North American Industry Classification System code derived from NTEE |
nteev2_code |
character | NTEEv2 code portion (3 characters) |
nteev2_subsector |
character | NTEEv2 subsector code (e.g. UNI HOS ART ENV) |
nteev2_org_type |
character | NTEEv2 organization type (RG=Regular AA=Alliance etc.) |
effective_start_date |
date | First BMF with this NTEE code |
effective_end_date |
date | Last BMF date (NULL if current) |
is_current |
logical | TRUE if this is the current record |
9.2 Example Data
| dim_id | ein | ntee_code_clean | ntee_code_major_group | effective_start_date | effective_end_date | is_current |
|---|---|---|---|---|---|---|
| 1 | 12-3456789 | B20 | B | 2018-01-01 | 2021-12-01 | FALSE |
| 2 | 12-3456789 | E20 | E | 2022-01-01 | NULL | TRUE |
| 3 | 98-7654321 | A20 | A | 2016-01-01 | NULL | TRUE |
9.3 Usage Examples
9.3.1 Find organizations that changed NTEE codes
# Planned output location once dimension tables are generated
dim_ntee <- arrow::read_parquet("data/processed/dim_ntee.parquet")
setDT(dim_ntee)
# Organizations with NTEE code changes
ntee_changes <- dim_ntee[, .N, by = ein][N > 1]
print(paste(nrow(ntee_changes), "organizations have changed NTEE codes"))9.3.2 Analyze major group changes
# Find orgs that changed major groups (e.g., from Education to Health)
major_group_changes <- dim_ntee[, .(
n_major_groups = uniqueN(ntee_code_major_group)
), by = ein][n_major_groups > 1]9.3.3 Get current NTEE classification for all organizations
current_ntee <- dim_ntee[is_current == TRUE]9.3.4 Track NTEE history for an organization
dim_ntee[ein == "12-3456789"][order(effective_start_date)]10 Entity Relationship Diagram
erDiagram
BMF_FACT ||--o{ DIM_ORG_NAME : "ein"
BMF_FACT ||--o{ DIM_ORG_ADDRESS : "ein"
BMF_FACT ||--o{ DIM_NTEE : "ein"
BMF_FACT {
string ein PK "XX-XXXXXXX"
string org_name_display
string org_addr_full
string ntee_code_clean
string subsection_code
date ruling_date
numeric asset_amount
numeric income_amount
string activity_code_definitions "aggregated"
string all_classifications_string "aggregated"
}
DIM_ORG_NAME {
int dim_id PK
string ein FK
string org_name_display
string org_name_join
string org_legal_suffix
string org_parent_name
date effective_start_date
date effective_end_date
boolean is_current
}
DIM_ORG_ADDRESS {
int dim_id PK
string ein FK
string org_addr_street
string org_addr_city
string org_addr_state
string org_addr_zip5
string org_addr_full
date effective_start_date
date effective_end_date
boolean is_current
}
DIM_NTEE {
int dim_id PK
string ein FK
string ntee_code_clean
string ntee_code_definition
string ntee_code_major_group
string naics_code
string nteev2_code
date effective_start_date
date effective_end_date
boolean is_current
}
11 Note: Activity and Classification Codes
Activity codes and classification codes are not stored in separate dimension tables. Instead, they are aggregated into semicolon-separated strings in the main BMF fact table:
| Fact Table Column | Description |
|---|---|
activity_code |
Three 3-digit activity codes concatenated (9 characters total) |
activity_code_definitions |
Semicolon-separated descriptions of activity codes |
activity_code_categories |
Semicolon-separated activity categories |
all_classifications_string |
Semicolon-separated list of all classification descriptions |
This approach was chosen because:
- Activity codes rarely change - Organizations typically maintain the same activity codes over time
- Classification codes are subsection-specific - Their meaning depends on the subsection code, making a separate dimension table less useful
- Aggregated strings simplify queries - Most use cases need all codes together, not individual lookups
For detailed activity or classification analysis, the aggregated strings can be split:
library(data.table)
bmf <- arrow::read_parquet("data/processed/bmf_2025_01_processed.parquet")
setDT(bmf)
# Split activity definitions into separate rows
activity_long <- bmf[, .(
activity_def = unlist(strsplit(activity_code_definitions, "; "))
), by = ein]
# Count organizations by activity
activity_long[, .N, by = activity_def][order(-N)]