BMF Data Pipeline Guide

Authors

Jesse Lecy

Thiyaghessan

Published

February 4, 2026

1 Overview

This guide documents the IRS Business Master File (BMF) Data Processing Pipeline developed by the National Center for Charitable Statistics (NCCS) at the Urban Institute.

The BMF is the IRS’s comprehensive database of tax-exempt organizations, containing information on approximately 1.9 million nonprofits in the United States.

1.1 What This Pipeline Does

flowchart LR
    subgraph input["Input"]
        A[("S3 Bucket<br/>nccsdata/raw/bmf")]
    end

    subgraph process["Processing"]
        B["ETL Pipeline<br/>data.table | R"]
    end

    subgraph intermediate["Intermediate"]
        C[("Parquet<br/>All Columns")]
        D[("Quality Report<br/>JSON")]
    end

    subgraph final["Final Outputs"]
        F[("Processed BMF<br/>CSV")]
        G[("Data Dictionary<br/>CSV")]
        H[("Quality Report<br/>HTML")]
    end

    subgraph storage["S3 Storage"]
        E[("nccsdata/<br/>intermediate/bmf")]
        I[("nccsdata/<br/>processed/bmf")]
    end

    A --> B
    B --> C
    B --> D
    B --> G
    C --> E
    D --> E
    C --> F
    D --> H
    F --> I
    G --> I

    style input fill:#fff9e6,stroke:#fdbf11
    style process fill:#fee39b,stroke:#fdbf11
    style intermediate fill:#fee39b,stroke:#fdbf11
    style final fill:#fdbf11,stroke:#d19c0f
    style storage fill:#fdbf11,stroke:#d19c0f

The pipeline:

  1. Extracts raw BMF data from S3 bucket (monthly combined CSV)
  2. Transforms 31 raw columns into ~76 standardized, validated columns
  3. Loads intermediate data as Parquet files containing all columns for manual inspection
  4. Validates intermediate data sets, producing .json and .html quality reports
  5. Cuts the ~76 transformed columns into a .csv data mart with an accompanying data dictionary

1.2 Quick Start

1.2.1 Prerequisites

# Required packages
install.packages(c(
  "data.table",   # High-performance data manipulation
  "arrow",        # Parquet I/O
  "aws.s3",       # S3 bucket operations
  "stringr",      # String operations
  "lubridate",    # Date parsing
  "openxlsx",     # Excel lookup files
  "here",         # Path management
  "purrr",        # Functional programming
  "jsonlite",     # Quality report serialization
  "quarto"        # Optional: for rendering HTML quality reports
))

AWS credentials must be configured via environment variables or ~/.aws/credentials.

1.2.2 Run the Pipeline

# From the project root directory - downloads most recent BMF from S3
source("R/run_pipeline.R")

# To process a specific month:
BMF_YEAR <- 2025
BMF_MONTH <- 1
source("R/run_pipeline.R")

# To list available BMF files in S3:
source("R/config.R")
list_available_bmf_files()

The pipeline will:

  • Download BMF from S3 bucket nccsdata/raw/bmf/
  • Run pre-transformation validation
  • Execute 24 transformation functions across 11 phases
  • Save checkpoints after each phase (to data/checkpoints/)
  • Generate a quality report
  • Save intermediate outputs to data/intermediate/ containing all columns
  • Output final csv file with data dictionary to data/processed/ with transformed columns

1.3 Pipeline Outputs

1.3.1 Intermediate Outputs (data/intermediate/)

Output File Description Rows
bmf_YYYY_MM_intermediate.parquet All columns (raw + transformed) for inspection ~1.9M

1.3.2 Processed Outputs (data/processed/)

Output File Description Rows
bmf_YYYY_MM_processed.csv Transformed columns only (raw columns removed) ~1.9M
bmf_YYYY_MM_data_dictionary.csv Column metadata and statistics -

1.3.3 Quality Reports

Output File Location Description
bmf_YYYY_MM_quality_report.json data/quality/ Quality metrics and validation results
bmf_YYYY_MM_quality_report.html docs/quality-reports/ Human-readable quality report (published to GitHub Pages)

Checkpoints are saved to data/checkpoints/ after each transformation phase.

1.4 Documentation Structure

Chapter Description
Architecture 11-phase pipeline overview, orchestration, checkpointing
Data Lineage Column mappings, transformation flows, data dictionary
Transform Reference All 24 transform functions with inputs/outputs
Dimension Tables SCD Type 2 patterns, schemas, join examples
Quality Gates Pre/post validation, troubleshooting
Configuration Config files, lookup tables, dependencies
Developer Guide Code patterns, contributing, testing
Lookup Tables Complete lookup table definitions and values

1.5 IRS BMF Source

The raw BMF data is published by the IRS at: IRS Exempt Organizations BMF Extract

Field definitions are documented in the IRS Internal Revenue Manual: IRM 25.7.1

1.6 Repository Structure

nccs-data-bmf/
├── R/
│   ├── run_pipeline.R             # Pipeline orchestrator
│   ├── config.R                   # S3 configuration and lookups
│   ├── checkpoints.R              # Save/load intermediate states
│   ├── input_validation.R         # Shared validation functions
│   ├── utils/                     # Logging and utilities
│   ├── quality/                   # Pre/post quality checks
│   └── [transform files]          # 21 transformation modules (24 functions)
├── data/
│   ├── raw/                       # Downloaded BMF files
│   ├── lookup/                    # Reference/lookup tables
│   ├── dictionaries/              # Column descriptions for data dictionary
│   ├── checkpoints/               # Intermediate saves
│   ├── intermediate/              # Intermediate parquet outputs
│   ├── processed/                 # Final output files
│   └── quality/                   # Quality reports
└── docs/                          # This guidebook