flowchart LR
subgraph input["Input"]
A[("S3 Bucket<br/>nccsdata/raw/bmf")]
end
subgraph process["Processing"]
B["ETL Pipeline<br/>data.table | R"]
end
subgraph intermediate["Intermediate"]
C[("Parquet<br/>All Columns")]
D[("Quality Report<br/>JSON")]
end
subgraph final["Final Outputs"]
F[("Processed BMF<br/>CSV")]
G[("Data Dictionary<br/>CSV")]
H[("Quality Report<br/>HTML")]
end
subgraph storage["S3 Storage"]
E[("nccsdata/<br/>intermediate/bmf")]
I[("nccsdata/<br/>processed/bmf")]
end
A --> B
B --> C
B --> D
B --> G
C --> E
D --> E
C --> F
D --> H
F --> I
G --> I
style input fill:#fff9e6,stroke:#fdbf11
style process fill:#fee39b,stroke:#fdbf11
style intermediate fill:#fee39b,stroke:#fdbf11
style final fill:#fdbf11,stroke:#d19c0f
style storage fill:#fdbf11,stroke:#d19c0f
BMF Data Pipeline Guide
1 Overview
This guide documents the IRS Business Master File (BMF) Data Processing Pipeline developed by the National Center for Charitable Statistics (NCCS) at the Urban Institute.
The BMF is the IRS’s comprehensive database of tax-exempt organizations, containing information on approximately 1.9 million nonprofits in the United States.
1.1 What This Pipeline Does
The pipeline:
- Extracts raw BMF data from S3 bucket (monthly combined CSV)
- Transforms 31 raw columns into ~76 standardized, validated columns
- Loads intermediate data as Parquet files containing all columns for manual inspection
- Validates intermediate data sets, producing
.jsonand.htmlquality reports - Cuts the ~76 transformed columns into a
.csvdata mart with an accompanying data dictionary
1.2 Quick Start
1.2.1 Prerequisites
# Required packages
install.packages(c(
"data.table", # High-performance data manipulation
"arrow", # Parquet I/O
"aws.s3", # S3 bucket operations
"stringr", # String operations
"lubridate", # Date parsing
"openxlsx", # Excel lookup files
"here", # Path management
"purrr", # Functional programming
"jsonlite", # Quality report serialization
"quarto" # Optional: for rendering HTML quality reports
))AWS credentials must be configured via environment variables or ~/.aws/credentials.
1.2.2 Run the Pipeline
# From the project root directory - downloads most recent BMF from S3
source("R/run_pipeline.R")
# To process a specific month:
BMF_YEAR <- 2025
BMF_MONTH <- 1
source("R/run_pipeline.R")
# To list available BMF files in S3:
source("R/config.R")
list_available_bmf_files()The pipeline will:
- Download BMF from S3 bucket
nccsdata/raw/bmf/ - Run pre-transformation validation
- Execute 24 transformation functions across 11 phases
- Save checkpoints after each phase (to
data/checkpoints/) - Generate a quality report
- Save intermediate outputs to data/intermediate/ containing all columns
- Output final csv file with data dictionary to
data/processed/with transformed columns
1.3 Pipeline Outputs
1.3.1 Intermediate Outputs (data/intermediate/)
| Output File | Description | Rows |
|---|---|---|
bmf_YYYY_MM_intermediate.parquet |
All columns (raw + transformed) for inspection | ~1.9M |
1.3.2 Processed Outputs (data/processed/)
| Output File | Description | Rows |
|---|---|---|
bmf_YYYY_MM_processed.csv |
Transformed columns only (raw columns removed) | ~1.9M |
bmf_YYYY_MM_data_dictionary.csv |
Column metadata and statistics | - |
1.3.3 Quality Reports
| Output File | Location | Description |
|---|---|---|
bmf_YYYY_MM_quality_report.json |
data/quality/ |
Quality metrics and validation results |
bmf_YYYY_MM_quality_report.html |
docs/quality-reports/ |
Human-readable quality report (published to GitHub Pages) |
Checkpoints are saved to data/checkpoints/ after each transformation phase.
1.4 Documentation Structure
| Chapter | Description |
|---|---|
| Architecture | 11-phase pipeline overview, orchestration, checkpointing |
| Data Lineage | Column mappings, transformation flows, data dictionary |
| Transform Reference | All 24 transform functions with inputs/outputs |
| Dimension Tables | SCD Type 2 patterns, schemas, join examples |
| Quality Gates | Pre/post validation, troubleshooting |
| Configuration | Config files, lookup tables, dependencies |
| Developer Guide | Code patterns, contributing, testing |
| Lookup Tables | Complete lookup table definitions and values |
1.5 IRS BMF Source
The raw BMF data is published by the IRS at: IRS Exempt Organizations BMF Extract
Field definitions are documented in the IRS Internal Revenue Manual: IRM 25.7.1
1.6 Repository Structure
nccs-data-bmf/
├── R/
│ ├── run_pipeline.R # Pipeline orchestrator
│ ├── config.R # S3 configuration and lookups
│ ├── checkpoints.R # Save/load intermediate states
│ ├── input_validation.R # Shared validation functions
│ ├── utils/ # Logging and utilities
│ ├── quality/ # Pre/post quality checks
│ └── [transform files] # 21 transformation modules (24 functions)
├── data/
│ ├── raw/ # Downloaded BMF files
│ ├── lookup/ # Reference/lookup tables
│ ├── dictionaries/ # Column descriptions for data dictionary
│ ├── checkpoints/ # Intermediate saves
│ ├── intermediate/ # Intermediate parquet outputs
│ ├── processed/ # Final output files
│ └── quality/ # Quality reports
└── docs/ # This guidebook