flowchart TB
classDef node fill:#d2d2d2,stroke:#333,color:#000
subgraph Sources[Data Sources]
IRS[(IRS BMF)]
LOOKUP[(Lookup Tables)]
end
subgraph Transform[Transformation Layer]
T1[Identity Transforms]
T2[Classification Transforms]
T3[Activity Transforms]
T4[Temporal Transforms]
T5[Financial Transforms]
T6[Filing Transforms]
end
subgraph Outputs[Outputs]
FACT[(Fact Table)]
QUALITY[(Quality Report)]
end
IRS --> T1
IRS --> T2
IRS --> T3
IRS --> T4
IRS --> T5
IRS --> T6
LOOKUP --> T2
LOOKUP --> T3
LOOKUP --> T5
LOOKUP --> T6
T1 --> FACT
T2 --> FACT
T3 --> FACT
T4 --> FACT
T5 --> FACT
T6 --> FACT
FACT --> QUALITY
style Sources fill:#fff9e6,stroke:#fdbf11
style Transform fill:#fee39b,stroke:#fdbf11
style Outputs fill:#fdbf11,stroke:#d19c0f
3 Data Lineage
4 Data Flow Overview
This chapter traces data from raw IRS columns through transformations to final output columns.
4.1 High-Level Data Flow
4.2 Source-to-Target Column Mapping
4.2.1 Identity Fields
flowchart LR
classDef node fill:#d2d2d2,stroke:#333,color:#000
subgraph Raw[Raw IRS Columns]
EIN_R[EIN]
NAME_R[NAME]
SORT_R[SORT_NAME]
ICO_R[ICO]
GROUP_R[GROUP]
RULING_R[RULING]
end
subgraph Output[Output Columns]
ein[ein]
ein_raw[ein_raw]
org_name[org_name_display]
org_join[org_name_join]
org_suffix[org_legal_suffix]
org_parent[org_parent_name]
dba[dba_name]
dba_raw[dba_name_raw]
ico[in_care_of_name_clean]
ico_flag[in_care_of_name_provided]
gen[group_exemption_number]
gen_flag[group_exemption_is_member]
ruling[ruling_date]
ruling_flag[ruling_date_is_missing]
end
EIN_R --> ein
EIN_R --> ein_raw
NAME_R --> org_name
NAME_R --> org_join
NAME_R --> org_suffix
NAME_R --> org_parent
SORT_R --> dba
SORT_R --> dba_raw
ICO_R --> ico
ICO_R --> ico_flag
GROUP_R --> gen
GROUP_R --> gen_flag
RULING_R --> ruling
RULING_R --> ruling_flag
style Raw fill:#fff9e6,stroke:#fdbf11
style Output fill:#fdbf11,stroke:#d19c0f
4.2.2 Subsection and Classification Fields
The CLASSIFICATION field contains up to 4 single-digit codes. These are unpivoted into an internal dimension table, joined with lookups, then aggregated back to the main table.
flowchart LR
classDef node fill:#d2d2d2,stroke:#333,color:#000
subgraph Raw[Raw IRS Columns]
SUB[SUBSECTION]
CLASS[CLASSIFICATION]
end
subgraph Process[Processing]
SPLIT[Split into up to 4 codes]
end
subgraph Lookup[Lookup Tables]
L1[(classification_subsection)]
end
subgraph DIM[Internal Dimension Table]
DIM_CL[dim_classification]
end
subgraph Output[Output Columns]
sub_code[subsection_code]
org_type[exempt_organization_type]
class_code[classification_code]
all_class[all_classifications_string]
end
SUB --> L1 --> sub_code
SUB --> L1 --> org_type
CLASS --> SPLIT --> L1 --> DIM_CL
DIM_CL --> class_code
DIM_CL --> all_class
style Raw fill:#fff9e6,stroke:#fdbf11
style Process fill:#fee39b,stroke:#fdbf11
style Lookup fill:#fee39b,stroke:#fdbf11
style DIM fill:#fee39b,stroke:#fdbf11
style Output fill:#fdbf11,stroke:#d19c0f
4.2.3 Other Classification Fields
Simple lookup joins for affiliation, deductibility, foundation, organization, and status codes.
flowchart LR
classDef node fill:#d2d2d2,stroke:#333,color:#000
subgraph Raw[Raw IRS Columns]
AFF[AFFILIATION]
DED[DEDUCTIBILITY]
FND[FOUNDATION]
ORG[ORGANIZATION]
STAT[STATUS]
end
subgraph Lookup[Lookup Tables]
L2[(affiliation_code)]
L3[(deductibility_code)]
L4[(foundation_code)]
L5[(organization_code)]
L6[(status_code)]
end
subgraph Output[Output Columns]
aff_code[affiliation_code + _definition]
ded_code[deductibility_code + _definition]
fnd_code[foundation_code + _definition]
org_code[organization_code + _definition]
stat_code[status_code + _definition]
end
AFF --> L2 --> aff_code
DED --> L3 --> ded_code
FND --> L4 --> fnd_code
ORG --> L5 --> org_code
STAT --> L6 --> stat_code
style Raw fill:#fff9e6,stroke:#fdbf11
style Lookup fill:#fee39b,stroke:#fdbf11
style Output fill:#fdbf11,stroke:#d19c0f
4.2.4 Activity Fields
The ACTIVITY field contains 9 characters (3 activity codes × 3 chars each). These are unpivoted into an internal dimension table, joined with lookups, then aggregated back to the main table as semicolon-separated strings.
flowchart LR
classDef node fill:#d2d2d2,stroke:#333,color:#000
subgraph Raw[Raw IRS Columns]
ACT[ACTIVITY]
end
subgraph Process[Processing]
SPLIT[Split into 3 codes]
end
subgraph Lookup[Lookup Tables]
L1[(activity_code)]
end
subgraph DIM[Internal Dimension Table]
DIM_ACT[dim_activity]
end
subgraph Output[Output Columns]
act_code[activity_code]
act_def[activity_code_definitions]
act_cat[activity_code_categories]
end
ACT --> SPLIT --> L1 --> DIM_ACT
DIM_ACT --> act_code
DIM_ACT --> act_def
DIM_ACT --> act_cat
style Raw fill:#fff9e6,stroke:#fdbf11
style Process fill:#fee39b,stroke:#fdbf11
style Lookup fill:#fee39b,stroke:#fdbf11
style DIM fill:#fee39b,stroke:#fdbf11
style Output fill:#fdbf11,stroke:#d19c0f
4.2.5 NTEE Fields
NTEE codes are normalized and joined with multiple lookup tables to derive classifications and crosswalks.
flowchart LR
classDef node fill:#d2d2d2,stroke:#333,color:#000
subgraph Raw[Raw IRS Columns]
NTEE[NTEE_CD]
end
subgraph Process[Processing]
NORM[Normalize and validate]
end
subgraph Lookup[Lookup Tables]
L2[(ntee_code)]
L3[(ntee_major_group)]
L4[(ntee_activity_type)]
end
subgraph Output[Output Columns]
ntee_clean[ntee_code_clean]
ntee_def[ntee_code_definition]
ntee_major[ntee_code_major_group]
naics[naics_code]
nteev2[nteev2_code]
end
NTEE --> NORM
NORM --> L2 --> ntee_clean
NORM --> L2 --> ntee_def
NORM --> L3 --> ntee_major
NORM --> L4 --> naics
NORM --> L4 --> nteev2
style Raw fill:#fff9e6,stroke:#fdbf11
style Process fill:#fee39b,stroke:#fdbf11
style Lookup fill:#fee39b,stroke:#fdbf11
style Output fill:#fdbf11,stroke:#d19c0f
4.2.6 Address Fields: STREET
flowchart LR
classDef node fill:#d2d2d2,stroke:#333,color:#000
subgraph Raw[Raw IRS Column]
STREET[STREET]
end
subgraph Output[Output Columns]
street_raw[org_addr_street_raw]
street[org_addr_street]
end
subgraph Flags[Quality Flags]
po_box[org_addr_is_po_box]
rural[org_addr_is_rural_route]
special[org_addr_has_special_chars]
no_num[org_addr_missing_number]
end
STREET --> street_raw
STREET --> street
STREET --> po_box
STREET --> rural
STREET --> special
STREET --> no_num
style Raw fill:#fff9e6,stroke:#fdbf11
style Output fill:#fee39b,stroke:#fdbf11
style Flags fill:#fdbf11,stroke:#d19c0f
4.2.7 Address Fields: CITY
flowchart LR
classDef node fill:#d2d2d2,stroke:#333,color:#000
subgraph Raw[Raw IRS Column]
CITY[CITY]
end
subgraph Output[Output Columns]
city_raw[org_addr_city_raw]
city[org_addr_city]
end
CITY --> city_raw
CITY --> city
style Raw fill:#fff9e6,stroke:#fdbf11
style Output fill:#fdbf11,stroke:#d19c0f
4.2.8 Address Fields: STATE
flowchart LR
classDef node fill:#d2d2d2,stroke:#333,color:#000
subgraph Raw[Raw IRS Column]
STATE[STATE]
end
subgraph Output[Output Columns]
state_raw[org_addr_state_raw]
state[org_addr_state]
end
subgraph Flags[Quality Flags]
invalid[org_addr_state_invalid]
end
STATE --> state_raw
STATE --> state
STATE --> invalid
style Raw fill:#fff9e6,stroke:#fdbf11
style Output fill:#fee39b,stroke:#fdbf11
style Flags fill:#fdbf11,stroke:#d19c0f
4.2.9 Address Fields: ZIP
flowchart LR
classDef node fill:#d2d2d2,stroke:#333,color:#000
subgraph Raw[Raw IRS Column]
ZIP[ZIP]
end
subgraph Output[Output Columns]
zip_raw[org_addr_zip_raw]
zip5[org_addr_zip5]
zip4[org_addr_zip4]
zip[org_addr_zip]
end
ZIP --> zip_raw
ZIP --> zip5
ZIP --> zip4
ZIP --> zip
style Raw fill:#fff9e6,stroke:#fdbf11
style Output fill:#fdbf11,stroke:#d19c0f
4.2.10 Address Fields: Derived
These fields are derived from all four address components.
flowchart LR
classDef node fill:#d2d2d2,stroke:#333,color:#000
subgraph Raw[Raw IRS Columns]
STREET[STREET]
CITY[CITY]
STATE[STATE]
ZIP[ZIP]
end
subgraph Output[Output Columns]
full[org_addr_full]
missing[org_addr_is_missing]
end
STREET --> full
CITY --> full
STATE --> full
ZIP --> full
STREET --> missing
CITY --> missing
STATE --> missing
ZIP --> missing
style Raw fill:#fff9e6,stroke:#fdbf11
style Output fill:#fdbf11,stroke:#d19c0f
4.2.11 Temporal Fields
flowchart LR
classDef node fill:#d2d2d2,stroke:#333,color:#000
subgraph Raw[Raw IRS Columns]
TAX[TAX_PERIOD]
ACCT[ACCT_PD]
end
subgraph Output[Output Columns]
tax_str[tax_period_ym_str]
tax_ymd[tax_period_ymd]
tax_miss[tax_period_is_missing]
acct[accounting_period]
end
TAX --> tax_str
TAX --> tax_ymd
TAX --> tax_miss
ACCT --> acct
style Raw fill:#fff9e6,stroke:#fdbf11
style Output fill:#fdbf11,stroke:#d19c0f
4.2.12 Financial Fields
flowchart LR
classDef node fill:#d2d2d2,stroke:#333,color:#000
subgraph Raw[Raw IRS Columns]
ASSET_CD[ASSET_CD]
INCOME_CD[INCOME_CD]
ASSET_AMT[ASSET_AMT]
INCOME_AMT[INCOME_AMT]
REVENUE_AMT[REVENUE_AMT]
end
subgraph Lookup[Lookup Tables]
L1[(asset_code)]
L2[(income_code)]
end
subgraph Output[Output Columns]
asset_code[asset_code]
asset_def[asset_code_definition]
income_code[income_code]
income_def[income_code_definition]
asset_amt[asset_amount]
income_amt[income_amount]
revenue_amt[revenue_amount]
end
ASSET_CD --> L1 --> asset_code
ASSET_CD --> L1 --> asset_def
INCOME_CD --> L2 --> income_code
INCOME_CD --> L2 --> income_def
ASSET_AMT --> asset_amt
INCOME_AMT --> income_amt
REVENUE_AMT --> revenue_amt
style Raw fill:#fff9e6,stroke:#fdbf11
style Lookup fill:#fee39b,stroke:#fdbf11
style Output fill:#fdbf11,stroke:#d19c0f
4.2.13 Filing Fields
flowchart LR
classDef node fill:#d2d2d2,stroke:#333,color:#000
subgraph Raw[Raw IRS Columns]
FILING[FILING_REQ_CD]
PF_FILING[PF_FILING_REQ_CD]
end
subgraph Lookup[Lookup Tables]
L1[(filing_requirement_code)]
L2[(pf_filing_requirement_code)]
end
subgraph Output[Output Columns]
filing_code[filing_requirement_code]
filing_def[filing_requirement_code_definition]
pf_code[pf_filing_requirement_code]
pf_def[pf_filing_requirement_code_definition]
end
FILING --> L1 --> filing_code
FILING --> L1 --> filing_def
PF_FILING --> L2 --> pf_code
PF_FILING --> L2 --> pf_def
style Raw fill:#fff9e6,stroke:#fdbf11
style Lookup fill:#fee39b,stroke:#fdbf11
style Output fill:#fdbf11,stroke:#d19c0f
4.3 Complete Column Lineage Matrix
4.3.1 Raw IRS Columns (31 columns)
| Raw Column | Data Type | Transform | Output Column(s) |
|---|---|---|---|
| EIN | char(9) | transform_ein() |
ein, ein_raw |
| NAME | char | transform_organization_name() |
org_name_raw, org_name_join, org_name_display, org_parent_name, org_legal_suffix |
| ICO | char | transform_bmf_ico_name() |
in_care_of_name_raw, in_care_of_name_clean, in_care_of_name_provided |
| STREET | char | transform_address() |
org_addr_street_raw, org_addr_street |
| CITY | char | transform_address() |
org_addr_city_raw, org_addr_city |
| STATE | char(2) | transform_address() |
org_addr_state_raw, org_addr_state |
| ZIP | char | transform_address() |
org_addr_zip_raw, org_addr_zip5, org_addr_zip4, org_addr_zip |
| (derived) | - | transform_address() |
org_addr_full, quality flags (6 columns) |
| GROUP | int | transform_bmf_group_exemption_number() |
group_exemption_number_raw, group_exemption_number, group_exemption_is_member |
| SUBSECTION | int | transform_bmf_subsection_classification_codes() |
subsection_code, exempt_organization_type |
| AFFILIATION | int | transform_bmf_affiliation_code() |
affiliation_code, affiliation_code_definition |
| CLASSIFICATION | char(4) | transform_bmf_subsection_classification_codes() |
classification_code, all_classifications_string |
| RULING | int(6) | transform_bmf_ruling_date() |
ruling_date_ym_str, ruling_date, ruling_date_is_missing |
| DEDUCTIBILITY | int | transform_bmf_deductibility_code() |
deductibility_code, deductibility_code_definition |
| FOUNDATION | int | transform_bmf_foundation_code() |
foundation_code, foundation_code_definition |
| ACTIVITY | int(9) | transform_bmf_activity_code() |
activity_code, activity_code_definitions, activity_code_categories |
| ORGANIZATION | int | transform_bmf_organization_code() |
organization_code, organization_code_definition |
| STATUS | int | transform_bmf_status_code() |
status_code, status_code_definition |
| TAX_PERIOD | int(6) | transform_tax_period() |
tax_period_ym_str, tax_period_ymd, tax_period_is_missing |
| ASSET_CD | int | transform_bmf_asset_code() |
asset_code, asset_code_definition |
| INCOME_CD | int | transform_bmf_income_code() |
income_code, income_code_definition |
| FILING_REQ_CD | int | transform_bmf_filing_requirement_code() |
filing_requirement_code, filing_requirement_code_definition |
| PF_FILING_REQ_CD | int | transform_bmf_pf_filing_requirement_code() |
pf_filing_requirement_code, pf_filing_requirement_code_definition |
| ACCT_PD | int | transform_accounting_period() |
accounting_period |
| ASSET_AMT | numeric | transform_asset_amount() |
asset_amount |
| INCOME_AMT | numeric | transform_income_amount() |
income_amount |
| REVENUE_AMT | numeric | transform_revenue_amount() |
revenue_amount |
| NTEE_CD | char(4) | transform_ntee_code() |
ntee_code_raw, ntee_code_clean, ntee_code_definition, ntee_code_major_group, ntee_common_code, ntee_common_code_definition, naics_code, nteev2, nteev2_code, nteev2_subsector, nteev2_org_type |
| SORT_NAME | char | transform_dba_name() |
dba_name_raw, dba_name |
| REGION | int | (none) | Removed in processed output |
| RYEAR | int | (none) | Removed in processed output |
| ID | int | (none) | Removed in processed output |
4.3.2 Output Columns by Category
4.3.2.1 Identity Columns
| Column | Type | Source | Description |
|---|---|---|---|
ein |
char(11) | EIN | Employer Identification Number formatted as XX-XXXXXXX |
ein_raw |
char | EIN | Original 9-digit EIN value from source file without formatting |
org_name_raw |
char | NAME | Original organization name exactly as it appears in the source file |
org_name_join |
char | NAME | Standardized name for matching and joining (uppercase punctuation removed) |
org_name_display |
char | NAME | Title-cased organization name suitable for display purposes |
org_legal_suffix |
char | NAME | Legal entity suffix extracted from name (Inc Corp LLC Foundation etc.) |
org_parent_name |
char | NAME | Parent organization name if this is a subordinate/chapter organization |
in_care_of_name_raw |
char | ICO | Original In Care Of field from source file |
in_care_of_name_clean |
char | ICO | Cleaned ICO name with standardized formatting |
in_care_of_name_provided |
bool | ICO | Boolean indicating whether an ICO name was provided |
group_exemption_number_raw |
char | GROUP | Original group exemption number from source file |
group_exemption_number |
char(4) | GROUP | Cleaned group exemption number (GEN) |
group_exemption_is_member |
bool | GROUP | Boolean indicating if organization is a member of a group filing |
ruling_date_ym_str |
char(6) | RULING | Ruling date as YYYYMM string |
ruling_date |
Date | RULING | Date of IRS ruling granting exempt status. If missing assigned 1900-01-01 |
ruling_date_is_missing |
bool | RULING | TRUE if ruling date is missing or invalid |
dba_name_raw |
char | SORT_NAME | Original secondary/DBA name from source file |
dba_name |
char | SORT_NAME | Cleaned Doing Business As name |
4.3.2.2 Address Columns
| Column | Type | Source | Description |
|---|---|---|---|
org_addr_street_raw |
char | STREET | Original street address from source file |
org_addr_street |
char | STREET | Standardized street address with USPS abbreviations |
org_addr_city_raw |
char | CITY | Original city name from source file |
org_addr_city |
char | CITY | Cleaned city name |
org_addr_state_raw |
char | STATE | Original state code from source file |
org_addr_state |
char(2) | STATE | Two-letter state abbreviation |
org_addr_zip_raw |
char | ZIP | Original ZIP code from source file |
org_addr_zip5 |
char(5) | ZIP | 5-digit ZIP code |
org_addr_zip4 |
char(4) | ZIP | 4-digit ZIP code extension (if available) |
org_addr_zip |
char(10) | ZIP | Full ZIP code (5 or 9 digits) |
org_addr_full |
char | derived | Complete formatted address string |
org_addr_is_missing |
bool | derived | TRUE if street address is missing or empty |
org_addr_is_po_box |
bool | STREET | TRUE if address is a P.O. Box |
org_addr_is_rural_route |
bool | STREET | TRUE if address is a rural route |
org_addr_has_special_chars |
bool | STREET | TRUE if address contains unusual special characters |
org_addr_missing_number |
bool | STREET | TRUE if street address lacks a street number |
org_addr_state_invalid |
bool | STATE | TRUE if state code is not a valid US state/territory |
4.3.2.3 Classification Columns
| Column | Type | Source | Description |
|---|---|---|---|
subsection_code |
char | SUBSECTION | IRS subsection code (e.g. 03 for 501(c)(3) 04 for 501(c)(4)) |
classification_code |
char | CLASSIFICATION | IRS classification code indicating organization type within subsection |
exempt_organization_type |
char | SUBSECTION + lookup | Human-readable exempt organization type based on subsection |
all_classifications_string |
char | CLASSIFICATION + lookup | Semicolon-separated list of all classification descriptions |
affiliation_code |
int | AFFILIATION | Code indicating relationship to parent organization (1-9) |
affiliation_code_definition |
char | lookup | Description of affiliation relationship |
deductibility_code |
int | DEDUCTIBILITY | Code indicating deductibility status of contributions (1-4) |
deductibility_code_definition |
char | lookup | Description of contribution deductibility |
foundation_code |
int | FOUNDATION | Foundation status code (00-99) per IRS determination |
foundation_code_definition |
char | lookup | Description of foundation/public charity status |
organization_code |
int | ORGANIZATION | Code for type of organization (corporation trust etc.) |
organization_code_definition |
char | lookup | Description of organization type |
status_code |
int | STATUS | IRS determination status code (01-99) |
status_code_definition |
char | lookup | Description of exempt status |
4.3.2.4 Activity Columns
| Column | Type | Source | Description |
|---|---|---|---|
activity_code |
char | ACTIVITY | Three 3-digit activity codes concatenated (9 characters total) |
activity_code_definitions |
char | ACTIVITY + lookup | Semicolon-separated descriptions of activity codes |
activity_code_categories |
char | ACTIVITY + lookup | Semicolon-separated activity categories |
ntee_code_raw |
char | NTEE_CD | Original NTEE code from source file (1-4 characters) |
ntee_code_clean |
char | NTEE_CD | Standardized 3-character NTEE code |
ntee_code_definition |
char | lookup | Full description of NTEE classification |
ntee_code_major_group |
char | NTEE_CD | NTEE major group letter (A-Z) indicating broad category |
ntee_common_code |
char | NTEE_CD | Common code suffix for 4-character NTEE codes |
ntee_common_code_definition |
char | lookup | Description of common code suffix |
naics_code |
char | lookup | North American Industry Classification System code derived from NTEE |
nteev2 |
char | derived | Full NTEEv2 code in SUBSECTOR-CODE-TYPE format |
nteev2_code |
char | derived | NTEEv2 code portion (3 characters) |
nteev2_subsector |
char | derived | NTEEv2 subsector code (e.g. UNI HOS ART ENV) |
nteev2_org_type |
char | derived | NTEEv2 organization type (RG=Regular AA=Alliance etc.) |
4.3.2.5 Temporal Columns
| Column | Type | Source | Description |
|---|---|---|---|
tax_period_ym_str |
char | TAX_PERIOD | Tax period end date as YYYYMM string |
tax_period_ymd |
Date | TAX_PERIOD | Tax period end date in YYYY-MM-DD format. If missing assigned 1900-01-01 |
tax_period_is_missing |
bool | TAX_PERIOD | TRUE if tax period is missing |
accounting_period |
char | ACCT_PD | Month when organization’s accounting period ends (01-12) |
4.3.2.6 Financial Columns
| Column | Type | Source | Description |
|---|---|---|---|
asset_code |
int | ASSET_CD | Asset amount range code (0-9) |
asset_code_definition |
char | lookup | Description of asset range (e.g. $100000 to $499999) |
income_code |
int | INCOME_CD | Income amount range code (0-9) |
income_code_definition |
char | lookup | Description of income range |
asset_amount |
numeric | ASSET_AMT | Total assets in dollars (most recent return) |
income_amount |
numeric | INCOME_AMT | Total income in dollars (can be negative) |
revenue_amount |
numeric | REVENUE_AMT | Total revenue in dollars (can be negative) |
4.3.2.7 Filing Columns
| Column | Type | Source | Description |
|---|---|---|---|
filing_requirement_code |
int | FILING_REQ_CD | Code indicating required annual return form (0-6) |
filing_requirement_code_definition |
char | lookup | Description of filing requirement (990 990-EZ 990-N etc.) |
pf_filing_requirement_code |
int | PF_FILING_REQ_CD | Private foundation filing requirement code |
pf_filing_requirement_code_definition |
char | lookup | Description of private foundation filing requirement |
4.4 Dimension Table Lineage
4.4.1 Activity Code Dimension Table
flowchart TB
classDef node fill:#d2d2d2,stroke:#333,color:#000
subgraph Input[Input]
ACT[ACTIVITY column]
end
subgraph Split[Split 9-char field]
C1[Code 1]
C2[Code 2]
C3[Code 3]
end
subgraph Melt[Melt to Long Format]
R1[Row 1: ein + code]
R2[Row 2: ein + code]
R3[Row 3: ein + code]
end
subgraph Join[Join with Lookup]
J1[Add definition]
J2[Add definition]
J3[Add definition]
end
subgraph Output[dim_activity Table]
OUT[ein + activity_code + definition + category]
end
ACT --> C1
ACT --> C2
ACT --> C3
C1 --> R1
C2 --> R2
C3 --> R3
R1 --> J1
R2 --> J2
R3 --> J3
J1 --> OUT
J2 --> OUT
J3 --> OUT
style Input fill:#fff9e6,stroke:#fdbf11
style Split fill:#fee39b,stroke:#fdbf11
style Melt fill:#fee39b,stroke:#fdbf11
style Join fill:#fee39b,stroke:#fdbf11
style Output fill:#fdbf11,stroke:#d19c0f
4.4.2 Classification Dimension Table
flowchart TB
classDef node fill:#d2d2d2,stroke:#333,color:#000
subgraph Input[Input]
CLASS[CLASSIFICATION column]
SUB[SUBSECTION column]
end
subgraph Extract[Extract up to 4 codes]
C1[Code 1]
C2[Code 2]
C3[Code 3]
C4[Code 4]
end
subgraph Filter[Remove 0 codes]
F[Valid codes only]
end
subgraph Join[Join with Lookup]
J[Add classification_description]
end
subgraph Output[dim_classification Table]
OUT[ein + subsection_code + cl_code + description]
end
CLASS --> C1
CLASS --> C2
CLASS --> C3
CLASS --> C4
SUB --> Join
C1 --> Filter
C2 --> Filter
C3 --> Filter
C4 --> Filter
Filter --> Join
Join --> Output
style Input fill:#fff9e6,stroke:#fdbf11
style Extract fill:#fee39b,stroke:#fdbf11
style Filter fill:#fee39b,stroke:#fdbf11
style Join fill:#fee39b,stroke:#fdbf11
style Output fill:#fdbf11,stroke:#d19c0f
4.5 Data Type Transformations
| Source Type | Target Type | Transform | Notes |
|---|---|---|---|
| char(9) | char(11) | EIN formatting | Adds hyphen: XX-XXXXXXX |
| char | char | Name cleaning | Title case, suffix extraction |
| int(6) | Date | Date parsing | YYYYMM → YYYY-MM-01 |
| int(9) | multiple | Activity unpivot | One field → 3 codes + descriptions |
| int | int + char | Lookup join | Adds *_definition column |
| numeric | numeric | Amount validation | Validates numeric values, warns on unexpected negatives |
4.6 Validation Rules by Column
| Column | Validation | Action on Failure |
|---|---|---|
| EIN | 9 digits after cleaning | Warning if format mismatch |
| RULING | 6 digits (YYYYMM) | Replace with sentinel date |
| TAX_PERIOD | 6 digits (YYYYMM) | Replace with sentinel date |
| ASSET_AMT | Numeric, no negative | Warning on negative values |
| INCOME_AMT | Numeric | Allow negative (can be loss) |
| Lookup joins | Match rate > 99% | Warning if unmatched > 1% |