BMF Data Pipeline Guide

Authors

Jesse Lecy

Thiyaghessan

Published

February 4, 2026

1 Overview

This guide documents the IRS Business Master File (BMF) Data Processing Pipeline developed by the National Center for Charitable Statistics (NCCS) at the Urban Institute.

The BMF is the IRS’s comprehensive database of tax-exempt organizations, containing information on approximately 1.9 million nonprofits in the United States.

1.1 What This Pipeline Does

flowchart LR
    subgraph input["Input"]
        A[("S3 Bucket<br/>nccsdata/raw/bmf")]
    end

    subgraph process["Processing"]
        B["ETL Pipeline<br/>data.table | R"]
    end

    subgraph intermediate["Intermediate"]
        C[("Parquet<br/>All Columns")]
        D[("Quality Report<br/>JSON")]
    end

    subgraph final["Final Outputs"]
        F[("Processed BMF<br/>CSV")]
        G[("Data Dictionary<br/>CSV")]
        H[("Quality Report<br/>HTML")]
    end

    subgraph storage["S3 Storage"]
        E[("nccsdata/<br/>intermediate/bmf")]
        I[("nccsdata/<br/>processed/bmf")]
    end

    A --> B
    B --> C
    B --> D
    B --> G
    C --> E
    D --> E
    C --> F
    D --> H
    F --> I
    G --> I

    style input fill:#fff9e6,stroke:#fdbf11
    style process fill:#fee39b,stroke:#fdbf11
    style intermediate fill:#fee39b,stroke:#fdbf11
    style final fill:#fdbf11,stroke:#d19c0f
    style storage fill:#fdbf11,stroke:#d19c0f

The pipeline:

Extracts raw BMF data from S3 bucket (monthly combined CSV)
Transforms 31 raw columns into ~76 standardized, validated columns
Loads intermediate data as Parquet files containing all columns for manual inspection
Validates intermediate data sets, producing .json and .html quality reports
Cuts the ~76 transformed columns into a .csv data mart with an accompanying data dictionary

1.2 Quick Start

1.2.1 Prerequisites

# Required packages
install.packages(c(
  "data.table",   # High-performance data manipulation
  "arrow",        # Parquet I/O
  "aws.s3",       # S3 bucket operations
  "stringr",      # String operations
  "lubridate",    # Date parsing
  "openxlsx",     # Excel lookup files
  "here",         # Path management
  "purrr",        # Functional programming
  "jsonlite",     # Quality report serialization
  "quarto"        # Optional: for rendering HTML quality reports
))

AWS credentials must be configured via environment variables or ~/.aws/credentials.

1.2.2 Run the Pipeline

# From the project root directory - downloads most recent BMF from S3
source("R/run_pipeline.R")

# To process a specific month:
BMF_YEAR <- 2025
BMF_MONTH <- 1
source("R/run_pipeline.R")

# To list available BMF files in S3:
source("R/config.R")
list_available_bmf_files()

The pipeline will:

Download BMF from S3 bucket nccsdata/raw/bmf/
Run pre-transformation validation
Execute 24 transformation functions across 11 phases
Save checkpoints after each phase (to data/checkpoints/)
Generate a quality report
Save intermediate outputs to data/intermediate/ containing all columns
Output final csv file with data dictionary to data/processed/ with transformed columns

1.3 Pipeline Outputs

1.3.1 Intermediate Outputs (`data/intermediate/`)

Output File	Description	Rows
`bmf_YYYY_MM_intermediate.parquet`	All columns (raw + transformed) for inspection	~1.9M

1.3.2 Processed Outputs (`data/processed/`)

Output File	Description	Rows
`bmf_YYYY_MM_processed.csv`	Transformed columns only (raw columns removed)	~1.9M
`bmf_YYYY_MM_data_dictionary.csv`	Column metadata and statistics	-

1.3.3 Quality Reports

Output File	Location	Description
`bmf_YYYY_MM_quality_report.json`	`data/quality/`	Quality metrics and validation results
`bmf_YYYY_MM_quality_report.html`	`docs/quality-reports/`	Human-readable quality report (published to GitHub Pages)

Checkpoints are saved to data/checkpoints/ after each transformation phase.

1.4 Documentation Structure

Chapter	Description
Architecture	11-phase pipeline overview, orchestration, checkpointing
Data Lineage	Column mappings, transformation flows, data dictionary
Transform Reference	All 24 transform functions with inputs/outputs
Dimension Tables	SCD Type 2 patterns, schemas, join examples
Quality Gates	Pre/post validation, troubleshooting
Configuration	Config files, lookup tables, dependencies
Developer Guide	Code patterns, contributing, testing
Lookup Tables	Complete lookup table definitions and values

1.5 IRS BMF Source

The raw BMF data is published by the IRS at: IRS Exempt Organizations BMF Extract

Field definitions are documented in the IRS Internal Revenue Manual: IRM 25.7.1

1.6 Repository Structure

nccs-data-bmf/
├── R/
│   ├── run_pipeline.R             # Pipeline orchestrator
│   ├── config.R                   # S3 configuration and lookups
│   ├── checkpoints.R              # Save/load intermediate states
│   ├── input_validation.R         # Shared validation functions
│   ├── utils/                     # Logging and utilities
│   ├── quality/                   # Pre/post quality checks
│   └── [transform files]          # 21 transformation modules (24 functions)
├── data/
│   ├── raw/                       # Downloaded BMF files
│   ├── lookup/                    # Reference/lookup tables
│   ├── dictionaries/              # Column descriptions for data dictionary
│   ├── checkpoints/               # Intermediate saves
│   ├── intermediate/              # Intermediate parquet outputs
│   ├── processed/                 # Final output files
│   └── quality/                   # Quality reports
└── docs/                          # This guidebook