5  Dimension Tables

6 Dimension Table Architecture

The BMF pipeline generates dimension tables that track changes across multiple BMF snapshots over time. These tables follow the SCD Type 2 (Slowly Changing Dimension) pattern from data warehousing, preserving complete historical records.

NotePlanned Functionality

The dimension tables described in this chapter (dim_org_name, dim_org_address, dim_ntee) are not yet created by this pipeline. Currently, this pipeline processes individual monthly BMF snapshots.

Once all historical BMF files have been processed, dimension table generation will be integrated into the pipeline to track changes across time. The schemas and patterns documented here describe the planned output structure.

6.1 Overview

Dimension tables will be created after all BMF files are processed, not during individual pipeline runs. They compare records across monthly BMF snapshots to detect and track changes in:

  • Organization names
  • Organization addresses
  • NTEE classifications

flowchart TB
    classDef node fill:#d2d2d2,stroke:#333,color:#000

    subgraph Monthly["Monthly Pipeline Runs"]
        BMF1[("BMF 2024-01")]
        BMF2[("BMF 2024-02")]
        BMF3[("BMF 2024-03")]
        BMFN[("BMF YYYY-MM")]
    end

    subgraph Post["Post-Processing (After All BMFs)"]
        COMPARE["Compare Records<br/>Across Snapshots"]
        DETECT["Detect Changes"]
    end

    subgraph Dims["Dimension Tables"]
        DIM1[("dim_org_name")]
        DIM2[("dim_org_address")]
        DIM3[("dim_ntee")]
    end

    BMF1 --> COMPARE
    BMF2 --> COMPARE
    BMF3 --> COMPARE
    BMFN --> COMPARE
    COMPARE --> DETECT
    DETECT --> DIM1
    DETECT --> DIM2
    DETECT --> DIM3

    style Monthly fill:#fff9e6,stroke:#fdbf11
    style Post fill:#fee39b,stroke:#fdbf11
    style Dims fill:#fdbf11,stroke:#d19c0f

6.2 SCD Type 2 Pattern

Each dimension table tracks changes using effective date ranges:

Column Purpose
dim_id Surrogate primary key (unique per record version)
ein Foreign key to BMF fact table
effective_start_date First BMF date with this attribute value
effective_end_date Last BMF date (NULL if currently active)
is_current TRUE if this is the current active record

This enables:

  • Point-in-time queries: What was org X’s name in 2020?
  • Change detection: When did org Y move addresses?
  • Trend analysis: How many orgs changed NTEE codes between years?

7 dim_org_name

Tracks changes to organization names over time.

7.1 Schema

Column Type Description
dim_id integer Surrogate primary key
ein character Employer Identification Number (FK)
org_name_display character Title-cased organization name suitable for display purposes
org_name_join character Standardized name for matching and joining (uppercase punctuation removed)
org_legal_suffix character Legal entity suffix extracted from name (Inc Corp LLC Foundation etc.)
org_parent_name character Parent organization name if this is a subordinate/chapter organization
effective_start_date date First BMF date with this name
effective_end_date date Last BMF date (NULL if current)
is_current logical TRUE if this is the current record

7.2 Example Data

dim_id ein org_name_display effective_start_date effective_end_date is_current
1 12-3456789 Old Name Foundation 2020-01-01 2023-06-01 FALSE
2 12-3456789 New Name Foundation 2023-07-01 NULL TRUE
3 98-7654321 Stable Organization Inc 2018-01-01 NULL TRUE

7.3 Usage Examples

7.3.1 Find organizations that changed names

library(data.table)
library(arrow)

# Planned output location once dimension tables are generated
dim_org_name <- arrow::read_parquet("data/processed/dim_org_name.parquet")
setDT(dim_org_name)

# Organizations with multiple name records (indicates name changes)
name_changes <- dim_org_name[, .N, by = ein][N > 1]
print(paste(nrow(name_changes), "organizations have changed names"))

7.3.2 Get current name for all organizations

current_names <- dim_org_name[is_current == TRUE]

7.3.3 View name history for a specific organization

dim_org_name[ein == "12-3456789"][order(effective_start_date)]

8 dim_org_address

Tracks changes to organization addresses over time.

8.1 Schema

Column Type Description
dim_id integer Surrogate primary key
ein character Employer Identification Number (FK)
org_addr_street character Standardized street address with USPS abbreviations
org_addr_city character Cleaned city name
org_addr_state character Two-letter state abbreviation
org_addr_zip5 character 5-digit ZIP code
org_addr_zip4 character 4-digit ZIP code extension (if available)
org_addr_full character Complete formatted address string
effective_start_date date First BMF date at this address
effective_end_date date Last BMF date (NULL if current)
is_current logical TRUE if this is the current record

8.2 Example Data

dim_id ein org_addr_city org_addr_state effective_start_date effective_end_date is_current
1 12-3456789 New York NY 2019-01-01 2022-03-01 FALSE
2 12-3456789 Los Angeles CA 2022-04-01 NULL TRUE
3 98-7654321 Chicago IL 2015-01-01 NULL TRUE

8.3 Usage Examples

8.3.1 Find organizations that relocated

# Planned output location once dimension tables are generated
dim_org_address <- arrow::read_parquet("data/processed/dim_org_address.parquet")
setDT(dim_org_address)

# Organizations with multiple address records
relocations <- dim_org_address[, .N, by = ein][N > 1]
print(paste(nrow(relocations), "organizations have changed addresses"))

8.3.2 Get current address for all organizations

current_addresses <- dim_org_address[is_current == TRUE]

8.3.3 Find organizations that moved to a specific state

# Organizations that moved TO California
moved_to_ca <- dim_org_address[
  org_addr_state == "CA" & is_current == TRUE
][
  ein %in% dim_org_address[org_addr_state != "CA", ein]
]

8.3.4 Track address history for an organization

dim_org_address[ein == "12-3456789"][order(effective_start_date)]

9 dim_ntee

Tracks changes to NTEE classification codes over time.

9.1 Schema

Column Type Description
dim_id integer Surrogate primary key
ein character Employer Identification Number (FK)
ntee_code_clean character Standardized 3-character NTEE code
ntee_code_definition character Full description of NTEE classification
ntee_code_major_group character NTEE major group letter (A-Z) indicating broad category
naics_code character North American Industry Classification System code derived from NTEE
nteev2_code character NTEEv2 code portion (3 characters)
nteev2_subsector character NTEEv2 subsector code (e.g. UNI HOS ART ENV)
nteev2_org_type character NTEEv2 organization type (RG=Regular AA=Alliance etc.)
effective_start_date date First BMF with this NTEE code
effective_end_date date Last BMF date (NULL if current)
is_current logical TRUE if this is the current record

9.2 Example Data

dim_id ein ntee_code_clean ntee_code_major_group effective_start_date effective_end_date is_current
1 12-3456789 B20 B 2018-01-01 2021-12-01 FALSE
2 12-3456789 E20 E 2022-01-01 NULL TRUE
3 98-7654321 A20 A 2016-01-01 NULL TRUE

9.3 Usage Examples

9.3.1 Find organizations that changed NTEE codes

# Planned output location once dimension tables are generated
dim_ntee <- arrow::read_parquet("data/processed/dim_ntee.parquet")
setDT(dim_ntee)

# Organizations with NTEE code changes
ntee_changes <- dim_ntee[, .N, by = ein][N > 1]
print(paste(nrow(ntee_changes), "organizations have changed NTEE codes"))

9.3.2 Analyze major group changes

# Find orgs that changed major groups (e.g., from Education to Health)
major_group_changes <- dim_ntee[, .(
  n_major_groups = uniqueN(ntee_code_major_group)
), by = ein][n_major_groups > 1]

9.3.3 Get current NTEE classification for all organizations

current_ntee <- dim_ntee[is_current == TRUE]

9.3.4 Track NTEE history for an organization

dim_ntee[ein == "12-3456789"][order(effective_start_date)]

10 Entity Relationship Diagram

erDiagram
    BMF_FACT ||--o{ DIM_ORG_NAME : "ein"
    BMF_FACT ||--o{ DIM_ORG_ADDRESS : "ein"
    BMF_FACT ||--o{ DIM_NTEE : "ein"

    BMF_FACT {
        string ein PK "XX-XXXXXXX"
        string org_name_display
        string org_addr_full
        string ntee_code_clean
        string subsection_code
        date ruling_date
        numeric asset_amount
        numeric income_amount
        string activity_code_definitions "aggregated"
        string all_classifications_string "aggregated"
    }

    DIM_ORG_NAME {
        int dim_id PK
        string ein FK
        string org_name_display
        string org_name_join
        string org_legal_suffix
        string org_parent_name
        date effective_start_date
        date effective_end_date
        boolean is_current
    }

    DIM_ORG_ADDRESS {
        int dim_id PK
        string ein FK
        string org_addr_street
        string org_addr_city
        string org_addr_state
        string org_addr_zip5
        string org_addr_full
        date effective_start_date
        date effective_end_date
        boolean is_current
    }

    DIM_NTEE {
        int dim_id PK
        string ein FK
        string ntee_code_clean
        string ntee_code_definition
        string ntee_code_major_group
        string naics_code
        string nteev2_code
        date effective_start_date
        date effective_end_date
        boolean is_current
    }


11 Note: Activity and Classification Codes

Activity codes and classification codes are not stored in separate dimension tables. Instead, they are aggregated into semicolon-separated strings in the main BMF fact table:

Fact Table Column Description
activity_code Three 3-digit activity codes concatenated (9 characters total)
activity_code_definitions Semicolon-separated descriptions of activity codes
activity_code_categories Semicolon-separated activity categories
all_classifications_string Semicolon-separated list of all classification descriptions

This approach was chosen because:

  1. Activity codes rarely change - Organizations typically maintain the same activity codes over time
  2. Classification codes are subsection-specific - Their meaning depends on the subsection code, making a separate dimension table less useful
  3. Aggregated strings simplify queries - Most use cases need all codes together, not individual lookups

For detailed activity or classification analysis, the aggregated strings can be split:

library(data.table)

bmf <- arrow::read_parquet("data/processed/bmf_2025_01_processed.parquet")
setDT(bmf)

# Split activity definitions into separate rows
activity_long <- bmf[, .(
  activity_def = unlist(strsplit(activity_code_definitions, "; "))
), by = ein]

# Count organizations by activity
activity_long[, .N, by = activity_def][order(-N)]