3  Data Lineage

4 Data Flow Overview

This chapter traces data from raw IRS columns through transformations to final output columns.

4.1 High-Level Data Flow

flowchart TB
    classDef node fill:#d2d2d2,stroke:#333,color:#000

    subgraph Sources[Data Sources]
        IRS[(IRS BMF)]
        LOOKUP[(Lookup Tables)]
    end

    subgraph Transform[Transformation Layer]
        T1[Identity Transforms]
        T2[Classification Transforms]
        T3[Activity Transforms]
        T4[Temporal Transforms]
        T5[Financial Transforms]
        T6[Filing Transforms]
    end

    subgraph Outputs[Outputs]
        FACT[(Fact Table)]
        QUALITY[(Quality Report)]
    end

    IRS --> T1
    IRS --> T2
    IRS --> T3
    IRS --> T4
    IRS --> T5
    IRS --> T6

    LOOKUP --> T2
    LOOKUP --> T3
    LOOKUP --> T5
    LOOKUP --> T6

    T1 --> FACT
    T2 --> FACT
    T3 --> FACT
    T4 --> FACT
    T5 --> FACT
    T6 --> FACT
    FACT --> QUALITY

    style Sources fill:#fff9e6,stroke:#fdbf11
    style Transform fill:#fee39b,stroke:#fdbf11
    style Outputs fill:#fdbf11,stroke:#d19c0f

4.2 Source-to-Target Column Mapping

4.2.1 Identity Fields

flowchart LR
    classDef node fill:#d2d2d2,stroke:#333,color:#000

    subgraph Raw[Raw IRS Columns]
        EIN_R[EIN]
        NAME_R[NAME]
        SORT_R[SORT_NAME]
        ICO_R[ICO]
        GROUP_R[GROUP]
        RULING_R[RULING]
    end

    subgraph Output[Output Columns]
        ein[ein]
        ein_raw[ein_raw]
        org_name[org_name_display]
        org_join[org_name_join]
        org_suffix[org_legal_suffix]
        org_parent[org_parent_name]
        dba[dba_name]
        dba_raw[dba_name_raw]
        ico[in_care_of_name_clean]
        ico_flag[in_care_of_name_provided]
        gen[group_exemption_number]
        gen_flag[group_exemption_is_member]
        ruling[ruling_date]
        ruling_flag[ruling_date_is_missing]
    end

    EIN_R --> ein
    EIN_R --> ein_raw
    NAME_R --> org_name
    NAME_R --> org_join
    NAME_R --> org_suffix
    NAME_R --> org_parent
    SORT_R --> dba
    SORT_R --> dba_raw
    ICO_R --> ico
    ICO_R --> ico_flag
    GROUP_R --> gen
    GROUP_R --> gen_flag
    RULING_R --> ruling
    RULING_R --> ruling_flag

    style Raw fill:#fff9e6,stroke:#fdbf11
    style Output fill:#fdbf11,stroke:#d19c0f

4.2.2 Subsection and Classification Fields

The CLASSIFICATION field contains up to 4 single-digit codes. These are unpivoted into an internal dimension table, joined with lookups, then aggregated back to the main table.

flowchart LR
    classDef node fill:#d2d2d2,stroke:#333,color:#000

    subgraph Raw[Raw IRS Columns]
        SUB[SUBSECTION]
        CLASS[CLASSIFICATION]
    end

    subgraph Process[Processing]
        SPLIT[Split into up to 4 codes]
    end

    subgraph Lookup[Lookup Tables]
        L1[(classification_subsection)]
    end

    subgraph DIM[Internal Dimension Table]
        DIM_CL[dim_classification]
    end

    subgraph Output[Output Columns]
        sub_code[subsection_code]
        org_type[exempt_organization_type]
        class_code[classification_code]
        all_class[all_classifications_string]
    end

    SUB --> L1 --> sub_code
    SUB --> L1 --> org_type
    CLASS --> SPLIT --> L1 --> DIM_CL
    DIM_CL --> class_code
    DIM_CL --> all_class

    style Raw fill:#fff9e6,stroke:#fdbf11
    style Process fill:#fee39b,stroke:#fdbf11
    style Lookup fill:#fee39b,stroke:#fdbf11
    style DIM fill:#fee39b,stroke:#fdbf11
    style Output fill:#fdbf11,stroke:#d19c0f

4.2.3 Other Classification Fields

Simple lookup joins for affiliation, deductibility, foundation, organization, and status codes.

flowchart LR
    classDef node fill:#d2d2d2,stroke:#333,color:#000

    subgraph Raw[Raw IRS Columns]
        AFF[AFFILIATION]
        DED[DEDUCTIBILITY]
        FND[FOUNDATION]
        ORG[ORGANIZATION]
        STAT[STATUS]
    end

    subgraph Lookup[Lookup Tables]
        L2[(affiliation_code)]
        L3[(deductibility_code)]
        L4[(foundation_code)]
        L5[(organization_code)]
        L6[(status_code)]
    end

    subgraph Output[Output Columns]
        aff_code[affiliation_code + _definition]
        ded_code[deductibility_code + _definition]
        fnd_code[foundation_code + _definition]
        org_code[organization_code + _definition]
        stat_code[status_code + _definition]
    end

    AFF --> L2 --> aff_code
    DED --> L3 --> ded_code
    FND --> L4 --> fnd_code
    ORG --> L5 --> org_code
    STAT --> L6 --> stat_code

    style Raw fill:#fff9e6,stroke:#fdbf11
    style Lookup fill:#fee39b,stroke:#fdbf11
    style Output fill:#fdbf11,stroke:#d19c0f

4.2.4 Activity Fields

The ACTIVITY field contains 9 characters (3 activity codes × 3 chars each). These are unpivoted into an internal dimension table, joined with lookups, then aggregated back to the main table as semicolon-separated strings.

flowchart LR
    classDef node fill:#d2d2d2,stroke:#333,color:#000

    subgraph Raw[Raw IRS Columns]
        ACT[ACTIVITY]
    end

    subgraph Process[Processing]
        SPLIT[Split into 3 codes]
    end

    subgraph Lookup[Lookup Tables]
        L1[(activity_code)]
    end

    subgraph DIM[Internal Dimension Table]
        DIM_ACT[dim_activity]
    end

    subgraph Output[Output Columns]
        act_code[activity_code]
        act_def[activity_code_definitions]
        act_cat[activity_code_categories]
    end

    ACT --> SPLIT --> L1 --> DIM_ACT
    DIM_ACT --> act_code
    DIM_ACT --> act_def
    DIM_ACT --> act_cat

    style Raw fill:#fff9e6,stroke:#fdbf11
    style Process fill:#fee39b,stroke:#fdbf11
    style Lookup fill:#fee39b,stroke:#fdbf11
    style DIM fill:#fee39b,stroke:#fdbf11
    style Output fill:#fdbf11,stroke:#d19c0f

4.2.5 NTEE Fields

NTEE codes are normalized and joined with multiple lookup tables to derive classifications and crosswalks.

flowchart LR
    classDef node fill:#d2d2d2,stroke:#333,color:#000

    subgraph Raw[Raw IRS Columns]
        NTEE[NTEE_CD]
    end

    subgraph Process[Processing]
        NORM[Normalize and validate]
    end

    subgraph Lookup[Lookup Tables]
        L2[(ntee_code)]
        L3[(ntee_major_group)]
        L4[(ntee_activity_type)]
    end

    subgraph Output[Output Columns]
        ntee_clean[ntee_code_clean]
        ntee_def[ntee_code_definition]
        ntee_major[ntee_code_major_group]
        naics[naics_code]
        nteev2[nteev2_code]
    end

    NTEE --> NORM
    NORM --> L2 --> ntee_clean
    NORM --> L2 --> ntee_def
    NORM --> L3 --> ntee_major
    NORM --> L4 --> naics
    NORM --> L4 --> nteev2

    style Raw fill:#fff9e6,stroke:#fdbf11
    style Process fill:#fee39b,stroke:#fdbf11
    style Lookup fill:#fee39b,stroke:#fdbf11
    style Output fill:#fdbf11,stroke:#d19c0f

4.2.6 Address Fields: STREET

flowchart LR
    classDef node fill:#d2d2d2,stroke:#333,color:#000

    subgraph Raw[Raw IRS Column]
        STREET[STREET]
    end

    subgraph Output[Output Columns]
        street_raw[org_addr_street_raw]
        street[org_addr_street]
    end

    subgraph Flags[Quality Flags]
        po_box[org_addr_is_po_box]
        rural[org_addr_is_rural_route]
        special[org_addr_has_special_chars]
        no_num[org_addr_missing_number]
    end

    STREET --> street_raw
    STREET --> street
    STREET --> po_box
    STREET --> rural
    STREET --> special
    STREET --> no_num

    style Raw fill:#fff9e6,stroke:#fdbf11
    style Output fill:#fee39b,stroke:#fdbf11
    style Flags fill:#fdbf11,stroke:#d19c0f

4.2.7 Address Fields: CITY

flowchart LR
    classDef node fill:#d2d2d2,stroke:#333,color:#000

    subgraph Raw[Raw IRS Column]
        CITY[CITY]
    end

    subgraph Output[Output Columns]
        city_raw[org_addr_city_raw]
        city[org_addr_city]
    end

    CITY --> city_raw
    CITY --> city

    style Raw fill:#fff9e6,stroke:#fdbf11
    style Output fill:#fdbf11,stroke:#d19c0f

4.2.8 Address Fields: STATE

flowchart LR
    classDef node fill:#d2d2d2,stroke:#333,color:#000

    subgraph Raw[Raw IRS Column]
        STATE[STATE]
    end

    subgraph Output[Output Columns]
        state_raw[org_addr_state_raw]
        state[org_addr_state]
    end

    subgraph Flags[Quality Flags]
        invalid[org_addr_state_invalid]
    end

    STATE --> state_raw
    STATE --> state
    STATE --> invalid

    style Raw fill:#fff9e6,stroke:#fdbf11
    style Output fill:#fee39b,stroke:#fdbf11
    style Flags fill:#fdbf11,stroke:#d19c0f

4.2.9 Address Fields: ZIP

flowchart LR
    classDef node fill:#d2d2d2,stroke:#333,color:#000

    subgraph Raw[Raw IRS Column]
        ZIP[ZIP]
    end

    subgraph Output[Output Columns]
        zip_raw[org_addr_zip_raw]
        zip5[org_addr_zip5]
        zip4[org_addr_zip4]
        zip[org_addr_zip]
    end

    ZIP --> zip_raw
    ZIP --> zip5
    ZIP --> zip4
    ZIP --> zip

    style Raw fill:#fff9e6,stroke:#fdbf11
    style Output fill:#fdbf11,stroke:#d19c0f

4.2.10 Address Fields: Derived

These fields are derived from all four address components.

flowchart LR
    classDef node fill:#d2d2d2,stroke:#333,color:#000

    subgraph Raw[Raw IRS Columns]
        STREET[STREET]
        CITY[CITY]
        STATE[STATE]
        ZIP[ZIP]
    end

    subgraph Output[Output Columns]
        full[org_addr_full]
        missing[org_addr_is_missing]
    end

    STREET --> full
    CITY --> full
    STATE --> full
    ZIP --> full
    STREET --> missing
    CITY --> missing
    STATE --> missing
    ZIP --> missing

    style Raw fill:#fff9e6,stroke:#fdbf11
    style Output fill:#fdbf11,stroke:#d19c0f

4.2.11 Temporal Fields

flowchart LR
    classDef node fill:#d2d2d2,stroke:#333,color:#000

    subgraph Raw[Raw IRS Columns]
        TAX[TAX_PERIOD]
        ACCT[ACCT_PD]
    end

    subgraph Output[Output Columns]
        tax_str[tax_period_ym_str]
        tax_ymd[tax_period_ymd]
        tax_miss[tax_period_is_missing]
        acct[accounting_period]
    end

    TAX --> tax_str
    TAX --> tax_ymd
    TAX --> tax_miss
    ACCT --> acct

    style Raw fill:#fff9e6,stroke:#fdbf11
    style Output fill:#fdbf11,stroke:#d19c0f

4.2.12 Financial Fields

flowchart LR
    classDef node fill:#d2d2d2,stroke:#333,color:#000

    subgraph Raw[Raw IRS Columns]
        ASSET_CD[ASSET_CD]
        INCOME_CD[INCOME_CD]
        ASSET_AMT[ASSET_AMT]
        INCOME_AMT[INCOME_AMT]
        REVENUE_AMT[REVENUE_AMT]
    end

    subgraph Lookup[Lookup Tables]
        L1[(asset_code)]
        L2[(income_code)]
    end

    subgraph Output[Output Columns]
        asset_code[asset_code]
        asset_def[asset_code_definition]
        income_code[income_code]
        income_def[income_code_definition]
        asset_amt[asset_amount]
        income_amt[income_amount]
        revenue_amt[revenue_amount]
    end

    ASSET_CD --> L1 --> asset_code
    ASSET_CD --> L1 --> asset_def
    INCOME_CD --> L2 --> income_code
    INCOME_CD --> L2 --> income_def
    ASSET_AMT --> asset_amt
    INCOME_AMT --> income_amt
    REVENUE_AMT --> revenue_amt

    style Raw fill:#fff9e6,stroke:#fdbf11
    style Lookup fill:#fee39b,stroke:#fdbf11
    style Output fill:#fdbf11,stroke:#d19c0f

4.2.13 Filing Fields

flowchart LR
    classDef node fill:#d2d2d2,stroke:#333,color:#000

    subgraph Raw[Raw IRS Columns]
        FILING[FILING_REQ_CD]
        PF_FILING[PF_FILING_REQ_CD]
    end

    subgraph Lookup[Lookup Tables]
        L1[(filing_requirement_code)]
        L2[(pf_filing_requirement_code)]
    end

    subgraph Output[Output Columns]
        filing_code[filing_requirement_code]
        filing_def[filing_requirement_code_definition]
        pf_code[pf_filing_requirement_code]
        pf_def[pf_filing_requirement_code_definition]
    end

    FILING --> L1 --> filing_code
    FILING --> L1 --> filing_def
    PF_FILING --> L2 --> pf_code
    PF_FILING --> L2 --> pf_def

    style Raw fill:#fff9e6,stroke:#fdbf11
    style Lookup fill:#fee39b,stroke:#fdbf11
    style Output fill:#fdbf11,stroke:#d19c0f

4.3 Complete Column Lineage Matrix

4.3.1 Raw IRS Columns (31 columns)

Raw Column Data Type Transform Output Column(s)
EIN char(9) transform_ein() ein, ein_raw
NAME char transform_organization_name() org_name_raw, org_name_join, org_name_display, org_parent_name, org_legal_suffix
ICO char transform_bmf_ico_name() in_care_of_name_raw, in_care_of_name_clean, in_care_of_name_provided
STREET char transform_address() org_addr_street_raw, org_addr_street
CITY char transform_address() org_addr_city_raw, org_addr_city
STATE char(2) transform_address() org_addr_state_raw, org_addr_state
ZIP char transform_address() org_addr_zip_raw, org_addr_zip5, org_addr_zip4, org_addr_zip
(derived) - transform_address() org_addr_full, quality flags (6 columns)
GROUP int transform_bmf_group_exemption_number() group_exemption_number_raw, group_exemption_number, group_exemption_is_member
SUBSECTION int transform_bmf_subsection_classification_codes() subsection_code, exempt_organization_type
AFFILIATION int transform_bmf_affiliation_code() affiliation_code, affiliation_code_definition
CLASSIFICATION char(4) transform_bmf_subsection_classification_codes() classification_code, all_classifications_string
RULING int(6) transform_bmf_ruling_date() ruling_date_ym_str, ruling_date, ruling_date_is_missing
DEDUCTIBILITY int transform_bmf_deductibility_code() deductibility_code, deductibility_code_definition
FOUNDATION int transform_bmf_foundation_code() foundation_code, foundation_code_definition
ACTIVITY int(9) transform_bmf_activity_code() activity_code, activity_code_definitions, activity_code_categories
ORGANIZATION int transform_bmf_organization_code() organization_code, organization_code_definition
STATUS int transform_bmf_status_code() status_code, status_code_definition
TAX_PERIOD int(6) transform_tax_period() tax_period_ym_str, tax_period_ymd, tax_period_is_missing
ASSET_CD int transform_bmf_asset_code() asset_code, asset_code_definition
INCOME_CD int transform_bmf_income_code() income_code, income_code_definition
FILING_REQ_CD int transform_bmf_filing_requirement_code() filing_requirement_code, filing_requirement_code_definition
PF_FILING_REQ_CD int transform_bmf_pf_filing_requirement_code() pf_filing_requirement_code, pf_filing_requirement_code_definition
ACCT_PD int transform_accounting_period() accounting_period
ASSET_AMT numeric transform_asset_amount() asset_amount
INCOME_AMT numeric transform_income_amount() income_amount
REVENUE_AMT numeric transform_revenue_amount() revenue_amount
NTEE_CD char(4) transform_ntee_code() ntee_code_raw, ntee_code_clean, ntee_code_definition, ntee_code_major_group, ntee_common_code, ntee_common_code_definition, naics_code, nteev2, nteev2_code, nteev2_subsector, nteev2_org_type
SORT_NAME char transform_dba_name() dba_name_raw, dba_name
REGION int (none) Removed in processed output
RYEAR int (none) Removed in processed output
ID int (none) Removed in processed output

4.3.2 Output Columns by Category

4.3.2.1 Identity Columns

Column Type Source Description
ein char(11) EIN Employer Identification Number formatted as XX-XXXXXXX
ein_raw char EIN Original 9-digit EIN value from source file without formatting
org_name_raw char NAME Original organization name exactly as it appears in the source file
org_name_join char NAME Standardized name for matching and joining (uppercase punctuation removed)
org_name_display char NAME Title-cased organization name suitable for display purposes
org_legal_suffix char NAME Legal entity suffix extracted from name (Inc Corp LLC Foundation etc.)
org_parent_name char NAME Parent organization name if this is a subordinate/chapter organization
in_care_of_name_raw char ICO Original In Care Of field from source file
in_care_of_name_clean char ICO Cleaned ICO name with standardized formatting
in_care_of_name_provided bool ICO Boolean indicating whether an ICO name was provided
group_exemption_number_raw char GROUP Original group exemption number from source file
group_exemption_number char(4) GROUP Cleaned group exemption number (GEN)
group_exemption_is_member bool GROUP Boolean indicating if organization is a member of a group filing
ruling_date_ym_str char(6) RULING Ruling date as YYYYMM string
ruling_date Date RULING Date of IRS ruling granting exempt status. If missing assigned 1900-01-01
ruling_date_is_missing bool RULING TRUE if ruling date is missing or invalid
dba_name_raw char SORT_NAME Original secondary/DBA name from source file
dba_name char SORT_NAME Cleaned Doing Business As name

4.3.2.2 Address Columns

Column Type Source Description
org_addr_street_raw char STREET Original street address from source file
org_addr_street char STREET Standardized street address with USPS abbreviations
org_addr_city_raw char CITY Original city name from source file
org_addr_city char CITY Cleaned city name
org_addr_state_raw char STATE Original state code from source file
org_addr_state char(2) STATE Two-letter state abbreviation
org_addr_zip_raw char ZIP Original ZIP code from source file
org_addr_zip5 char(5) ZIP 5-digit ZIP code
org_addr_zip4 char(4) ZIP 4-digit ZIP code extension (if available)
org_addr_zip char(10) ZIP Full ZIP code (5 or 9 digits)
org_addr_full char derived Complete formatted address string
org_addr_is_missing bool derived TRUE if street address is missing or empty
org_addr_is_po_box bool STREET TRUE if address is a P.O. Box
org_addr_is_rural_route bool STREET TRUE if address is a rural route
org_addr_has_special_chars bool STREET TRUE if address contains unusual special characters
org_addr_missing_number bool STREET TRUE if street address lacks a street number
org_addr_state_invalid bool STATE TRUE if state code is not a valid US state/territory

4.3.2.3 Classification Columns

Column Type Source Description
subsection_code char SUBSECTION IRS subsection code (e.g. 03 for 501(c)(3) 04 for 501(c)(4))
classification_code char CLASSIFICATION IRS classification code indicating organization type within subsection
exempt_organization_type char SUBSECTION + lookup Human-readable exempt organization type based on subsection
all_classifications_string char CLASSIFICATION + lookup Semicolon-separated list of all classification descriptions
affiliation_code int AFFILIATION Code indicating relationship to parent organization (1-9)
affiliation_code_definition char lookup Description of affiliation relationship
deductibility_code int DEDUCTIBILITY Code indicating deductibility status of contributions (1-4)
deductibility_code_definition char lookup Description of contribution deductibility
foundation_code int FOUNDATION Foundation status code (00-99) per IRS determination
foundation_code_definition char lookup Description of foundation/public charity status
organization_code int ORGANIZATION Code for type of organization (corporation trust etc.)
organization_code_definition char lookup Description of organization type
status_code int STATUS IRS determination status code (01-99)
status_code_definition char lookup Description of exempt status

4.3.2.4 Activity Columns

Column Type Source Description
activity_code char ACTIVITY Three 3-digit activity codes concatenated (9 characters total)
activity_code_definitions char ACTIVITY + lookup Semicolon-separated descriptions of activity codes
activity_code_categories char ACTIVITY + lookup Semicolon-separated activity categories
ntee_code_raw char NTEE_CD Original NTEE code from source file (1-4 characters)
ntee_code_clean char NTEE_CD Standardized 3-character NTEE code
ntee_code_definition char lookup Full description of NTEE classification
ntee_code_major_group char NTEE_CD NTEE major group letter (A-Z) indicating broad category
ntee_common_code char NTEE_CD Common code suffix for 4-character NTEE codes
ntee_common_code_definition char lookup Description of common code suffix
naics_code char lookup North American Industry Classification System code derived from NTEE
nteev2 char derived Full NTEEv2 code in SUBSECTOR-CODE-TYPE format
nteev2_code char derived NTEEv2 code portion (3 characters)
nteev2_subsector char derived NTEEv2 subsector code (e.g. UNI HOS ART ENV)
nteev2_org_type char derived NTEEv2 organization type (RG=Regular AA=Alliance etc.)

4.3.2.5 Temporal Columns

Column Type Source Description
tax_period_ym_str char TAX_PERIOD Tax period end date as YYYYMM string
tax_period_ymd Date TAX_PERIOD Tax period end date in YYYY-MM-DD format. If missing assigned 1900-01-01
tax_period_is_missing bool TAX_PERIOD TRUE if tax period is missing
accounting_period char ACCT_PD Month when organization’s accounting period ends (01-12)

4.3.2.6 Financial Columns

Column Type Source Description
asset_code int ASSET_CD Asset amount range code (0-9)
asset_code_definition char lookup Description of asset range (e.g. $100000 to $499999)
income_code int INCOME_CD Income amount range code (0-9)
income_code_definition char lookup Description of income range
asset_amount numeric ASSET_AMT Total assets in dollars (most recent return)
income_amount numeric INCOME_AMT Total income in dollars (can be negative)
revenue_amount numeric REVENUE_AMT Total revenue in dollars (can be negative)

4.3.2.7 Filing Columns

Column Type Source Description
filing_requirement_code int FILING_REQ_CD Code indicating required annual return form (0-6)
filing_requirement_code_definition char lookup Description of filing requirement (990 990-EZ 990-N etc.)
pf_filing_requirement_code int PF_FILING_REQ_CD Private foundation filing requirement code
pf_filing_requirement_code_definition char lookup Description of private foundation filing requirement

4.4 Dimension Table Lineage

4.4.1 Activity Code Dimension Table

flowchart TB
    classDef node fill:#d2d2d2,stroke:#333,color:#000

    subgraph Input[Input]
        ACT[ACTIVITY column]
    end

    subgraph Split[Split 9-char field]
        C1[Code 1]
        C2[Code 2]
        C3[Code 3]
    end

    subgraph Melt[Melt to Long Format]
        R1[Row 1: ein + code]
        R2[Row 2: ein + code]
        R3[Row 3: ein + code]
    end

    subgraph Join[Join with Lookup]
        J1[Add definition]
        J2[Add definition]
        J3[Add definition]
    end

    subgraph Output[dim_activity Table]
        OUT[ein + activity_code + definition + category]
    end

    ACT --> C1
    ACT --> C2
    ACT --> C3
    C1 --> R1
    C2 --> R2
    C3 --> R3
    R1 --> J1
    R2 --> J2
    R3 --> J3
    J1 --> OUT
    J2 --> OUT
    J3 --> OUT

    style Input fill:#fff9e6,stroke:#fdbf11
    style Split fill:#fee39b,stroke:#fdbf11
    style Melt fill:#fee39b,stroke:#fdbf11
    style Join fill:#fee39b,stroke:#fdbf11
    style Output fill:#fdbf11,stroke:#d19c0f

4.4.2 Classification Dimension Table

flowchart TB
    classDef node fill:#d2d2d2,stroke:#333,color:#000

    subgraph Input[Input]
        CLASS[CLASSIFICATION column]
        SUB[SUBSECTION column]
    end

    subgraph Extract[Extract up to 4 codes]
        C1[Code 1]
        C2[Code 2]
        C3[Code 3]
        C4[Code 4]
    end

    subgraph Filter[Remove 0 codes]
        F[Valid codes only]
    end

    subgraph Join[Join with Lookup]
        J[Add classification_description]
    end

    subgraph Output[dim_classification Table]
        OUT[ein + subsection_code + cl_code + description]
    end

    CLASS --> C1
    CLASS --> C2
    CLASS --> C3
    CLASS --> C4
    SUB --> Join
    C1 --> Filter
    C2 --> Filter
    C3 --> Filter
    C4 --> Filter
    Filter --> Join
    Join --> Output

    style Input fill:#fff9e6,stroke:#fdbf11
    style Extract fill:#fee39b,stroke:#fdbf11
    style Filter fill:#fee39b,stroke:#fdbf11
    style Join fill:#fee39b,stroke:#fdbf11
    style Output fill:#fdbf11,stroke:#d19c0f

4.5 Data Type Transformations

Source Type Target Type Transform Notes
char(9) char(11) EIN formatting Adds hyphen: XX-XXXXXXX
char char Name cleaning Title case, suffix extraction
int(6) Date Date parsing YYYYMM → YYYY-MM-01
int(9) multiple Activity unpivot One field → 3 codes + descriptions
int int + char Lookup join Adds *_definition column
numeric numeric Amount validation Validates numeric values, warns on unexpected negatives

4.6 Validation Rules by Column

Column Validation Action on Failure
EIN 9 digits after cleaning Warning if format mismatch
RULING 6 digits (YYYYMM) Replace with sentinel date
TAX_PERIOD 6 digits (YYYYMM) Replace with sentinel date
ASSET_AMT Numeric, no negative Warning on negative values
INCOME_AMT Numeric Allow negative (can be loss)
Lookup joins Match rate > 99% Warning if unmatched > 1%