5 Dimension Tables

6 Dimension Table Architecture

The BMF pipeline generates dimension tables that track changes across multiple BMF snapshots over time. These tables follow the SCD Type 2 (Slowly Changing Dimension) pattern from data warehousing, preserving complete historical records.

Planned Functionality

The dimension tables described in this chapter (dim_org_name, dim_org_address, dim_ntee) are not yet created by this pipeline. Currently, this pipeline processes individual monthly BMF snapshots.

Once all historical BMF files have been processed, dimension table generation will be integrated into the pipeline to track changes across time. The schemas and patterns documented here describe the planned output structure.

6.1 Overview

Dimension tables will be created after all BMF files are processed, not during individual pipeline runs. They compare records across monthly BMF snapshots to detect and track changes in:

Organization names
Organization addresses
NTEE classifications

flowchart TB
    classDef node fill:#d2d2d2,stroke:#333,color:#000

    subgraph Monthly["Monthly Pipeline Runs"]
        BMF1[("BMF 2024-01")]
        BMF2[("BMF 2024-02")]
        BMF3[("BMF 2024-03")]
        BMFN[("BMF YYYY-MM")]
    end

    subgraph Post["Post-Processing (After All BMFs)"]
        COMPARE["Compare Records<br/>Across Snapshots"]
        DETECT["Detect Changes"]
    end

    subgraph Dims["Dimension Tables"]
        DIM1[("dim_org_name")]
        DIM2[("dim_org_address")]
        DIM3[("dim_ntee")]
    end

    BMF1 --> COMPARE
    BMF2 --> COMPARE
    BMF3 --> COMPARE
    BMFN --> COMPARE
    COMPARE --> DETECT
    DETECT --> DIM1
    DETECT --> DIM2
    DETECT --> DIM3

    style Monthly fill:#fff9e6,stroke:#fdbf11
    style Post fill:#fee39b,stroke:#fdbf11
    style Dims fill:#fdbf11,stroke:#d19c0f

6.2 SCD Type 2 Pattern

Each dimension table tracks changes using effective date ranges:

Column	Purpose
`dim_id`	Surrogate primary key (unique per record version)
`ein`	Foreign key to BMF fact table
`effective_start_date`	First BMF date with this attribute value
`effective_end_date`	Last BMF date (NULL if currently active)
`is_current`	TRUE if this is the current active record

This enables:

Point-in-time queries: What was org X’s name in 2020?
Change detection: When did org Y move addresses?
Trend analysis: How many orgs changed NTEE codes between years?

7 dim_org_name

Tracks changes to organization names over time.

7.1 Schema

Column	Type	Description
`dim_id`	integer	Surrogate primary key
`ein`	character	Employer Identification Number (FK)
`org_name_display`	character	Title-cased organization name suitable for display purposes
`org_name_join`	character	Standardized name for matching and joining (uppercase punctuation removed)
`org_legal_suffix`	character	Legal entity suffix extracted from name (Inc Corp LLC Foundation etc.)
`org_parent_name`	character	Parent organization name if this is a subordinate/chapter organization
`effective_start_date`	date	First BMF date with this name
`effective_end_date`	date	Last BMF date (NULL if current)
`is_current`	logical	TRUE if this is the current record

7.2 Example Data

dim_id	ein	org_name_display	effective_start_date	effective_end_date	is_current
1	12-3456789	Old Name Foundation	2020-01-01	2023-06-01	FALSE
2	12-3456789	New Name Foundation	2023-07-01	NULL	TRUE
3	98-7654321	Stable Organization Inc	2018-01-01	NULL	TRUE

7.3 Usage Examples

7.3.1 Find organizations that changed names

library(data.table)
library(arrow)

# Planned output location once dimension tables are generated
dim_org_name <- arrow::read_parquet("data/processed/dim_org_name.parquet")
setDT(dim_org_name)

# Organizations with multiple name records (indicates name changes)
name_changes <- dim_org_name[, .N, by = ein][N > 1]
print(paste(nrow(name_changes), "organizations have changed names"))

7.3.2 Get current name for all organizations

current_names <- dim_org_name[is_current == TRUE]

7.3.3 View name history for a specific organization

dim_org_name[ein == "12-3456789"][order(effective_start_date)]

8 dim_org_address

Tracks changes to organization addresses over time.

8.1 Schema

Column	Type	Description
`dim_id`	integer	Surrogate primary key
`ein`	character	Employer Identification Number (FK)
`org_addr_street`	character	Standardized street address with USPS abbreviations
`org_addr_city`	character	Cleaned city name
`org_addr_state`	character	Two-letter state abbreviation
`org_addr_zip5`	character	5-digit ZIP code
`org_addr_zip4`	character	4-digit ZIP code extension (if available)
`org_addr_full`	character	Complete formatted address string
`effective_start_date`	date	First BMF date at this address
`effective_end_date`	date	Last BMF date (NULL if current)
`is_current`	logical	TRUE if this is the current record

8.2 Example Data

dim_id	ein	org_addr_city	org_addr_state	effective_start_date	effective_end_date	is_current
1	12-3456789	New York	NY	2019-01-01	2022-03-01	FALSE
2	12-3456789	Los Angeles	CA	2022-04-01	NULL	TRUE
3	98-7654321	Chicago	IL	2015-01-01	NULL	TRUE

8.3 Usage Examples

8.3.1 Find organizations that relocated

# Planned output location once dimension tables are generated
dim_org_address <- arrow::read_parquet("data/processed/dim_org_address.parquet")
setDT(dim_org_address)

# Organizations with multiple address records
relocations <- dim_org_address[, .N, by = ein][N > 1]
print(paste(nrow(relocations), "organizations have changed addresses"))

8.3.2 Get current address for all organizations

current_addresses <- dim_org_address[is_current == TRUE]

8.3.3 Find organizations that moved to a specific state

# Organizations that moved TO California
moved_to_ca <- dim_org_address[
  org_addr_state == "CA" & is_current == TRUE
][
  ein %in% dim_org_address[org_addr_state != "CA", ein]
]

8.3.4 Track address history for an organization

dim_org_address[ein == "12-3456789"][order(effective_start_date)]

9 dim_ntee

Tracks changes to NTEE classification codes over time.

9.1 Schema

Column	Type	Description
`dim_id`	integer	Surrogate primary key
`ein`	character	Employer Identification Number (FK)
`ntee_code_clean`	character	Standardized 3-character NTEE code
`ntee_code_definition`	character	Full description of NTEE classification
`ntee_code_major_group`	character	NTEE major group letter (A-Z) indicating broad category
`naics_code`	character	North American Industry Classification System code derived from NTEE
`nteev2_code`	character	NTEEv2 code portion (3 characters)
`nteev2_subsector`	character	NTEEv2 subsector code (e.g. UNI HOS ART ENV)
`nteev2_org_type`	character	NTEEv2 organization type (RG=Regular AA=Alliance etc.)
`effective_start_date`	date	First BMF with this NTEE code
`effective_end_date`	date	Last BMF date (NULL if current)
`is_current`	logical	TRUE if this is the current record

9.2 Example Data

dim_id	ein	ntee_code_clean	ntee_code_major_group	effective_start_date	effective_end_date	is_current
1	12-3456789	B20	B	2018-01-01	2021-12-01	FALSE
2	12-3456789	E20	E	2022-01-01	NULL	TRUE
3	98-7654321	A20	A	2016-01-01	NULL	TRUE

9.3 Usage Examples

9.3.1 Find organizations that changed NTEE codes

# Planned output location once dimension tables are generated
dim_ntee <- arrow::read_parquet("data/processed/dim_ntee.parquet")
setDT(dim_ntee)

# Organizations with NTEE code changes
ntee_changes <- dim_ntee[, .N, by = ein][N > 1]
print(paste(nrow(ntee_changes), "organizations have changed NTEE codes"))

9.3.2 Analyze major group changes

# Find orgs that changed major groups (e.g., from Education to Health)
major_group_changes <- dim_ntee[, .(
  n_major_groups = uniqueN(ntee_code_major_group)
), by = ein][n_major_groups > 1]

9.3.3 Get current NTEE classification for all organizations

current_ntee <- dim_ntee[is_current == TRUE]

9.3.4 Track NTEE history for an organization

dim_ntee[ein == "12-3456789"][order(effective_start_date)]

10 Entity Relationship Diagram

erDiagram
    BMF_FACT ||--o{ DIM_ORG_NAME : "ein"
    BMF_FACT ||--o{ DIM_ORG_ADDRESS : "ein"
    BMF_FACT ||--o{ DIM_NTEE : "ein"

    BMF_FACT {
        string ein PK "XX-XXXXXXX"
        string org_name_display
        string org_addr_full
        string ntee_code_clean
        string subsection_code
        date ruling_date
        numeric asset_amount
        numeric income_amount
        string activity_code_definitions "aggregated"
        string all_classifications_string "aggregated"
    }

    DIM_ORG_NAME {
        int dim_id PK
        string ein FK
        string org_name_display
        string org_name_join
        string org_legal_suffix
        string org_parent_name
        date effective_start_date
        date effective_end_date
        boolean is_current
    }

    DIM_ORG_ADDRESS {
        int dim_id PK
        string ein FK
        string org_addr_street
        string org_addr_city
        string org_addr_state
        string org_addr_zip5
        string org_addr_full
        date effective_start_date
        date effective_end_date
        boolean is_current
    }

    DIM_NTEE {
        int dim_id PK
        string ein FK
        string ntee_code_clean
        string ntee_code_definition
        string ntee_code_major_group
        string naics_code
        string nteev2_code
        date effective_start_date
        date effective_end_date
        boolean is_current
    }

11 Note: Activity and Classification Codes

Activity codes and classification codes are not stored in separate dimension tables. Instead, they are aggregated into semicolon-separated strings in the main BMF fact table:

Fact Table Column	Description
`activity_code`	Three 3-digit activity codes concatenated (9 characters total)
`activity_code_definitions`	Semicolon-separated descriptions of activity codes
`activity_code_categories`	Semicolon-separated activity categories
`all_classifications_string`	Semicolon-separated list of all classification descriptions

This approach was chosen because:

Activity codes rarely change - Organizations typically maintain the same activity codes over time
Classification codes are subsection-specific - Their meaning depends on the subsection code, making a separate dimension table less useful
Aggregated strings simplify queries - Most use cases need all codes together, not individual lookups

For detailed activity or classification analysis, the aggregated strings can be split:

library(data.table)

bmf <- arrow::read_parquet("data/processed/bmf_2025_01_processed.parquet")
setDT(bmf)

# Split activity definitions into separate rows
activity_long <- bmf[, .(
  activity_def = unlist(strsplit(activity_code_definitions, "; "))
), by = ein]

# Count organizations by activity
activity_long[, .N, by = activity_def][order(-N)]