7 Configuration
8 Configuration Reference
This chapter documents all configuration files, lookup tables, and dependencies.
8.1 Pipeline Configuration
File: R/run_pipeline.R (top section)
# Enable checkpointing (saves intermediate results to parquet)
ENABLE_CHECKPOINTS <- TRUE
CHECKPOINT_DIR <- "data/checkpoints"
# Enable strict quality gates (stops on validation failure)
STRICT_QUALITY_GATES <- TRUE
# Enable S3 upload of processed data and quality report
ENABLE_S3_UPLOAD <- TRUE
# BMF source configuration - set before sourcing to override defaults
# If not set, downloads most recent BMF file from S3
if (!exists("BMF_YEAR")) BMF_YEAR <- NULL # e.g., 2025
if (!exists("BMF_MONTH")) BMF_MONTH <- NULL # e.g., 1
# Processing year/month - set automatically from downloaded file
PROCESSING_YEAR <- NULL
PROCESSING_MONTH <- NULL| Setting | Type | Default | Description |
|---|---|---|---|
ENABLE_CHECKPOINTS |
logical | TRUE | Save intermediate results after each phase |
CHECKPOINT_DIR |
character | “data/checkpoints” | Directory for checkpoint files |
STRICT_QUALITY_GATES |
logical | TRUE | Stop pipeline on validation failure |
BMF_YEAR |
integer | NULL | Override to download specific year (otherwise most recent) |
BMF_MONTH |
integer | NULL | Override to download specific month (otherwise most recent) |
8.2 Data Source Configuration
File: R/config.R
8.2.1 S3 Bucket Configuration
BMF files are downloaded from S3 bucket nccsdata. A Lambda function ingests monthly BMF files from the IRS and deposits them at raw/bmf/YYYY-MM-BMF.csv.
# S3 bucket configuration
BMF_S3_BUCKET <- "nccsdata"
BMF_S3_PREFIX <- "raw/bmf/"
BMF_S3_INTERMEDIATE_PREFIX <- "intermediate/bmf/"
BMF_S3_PROCESSED_PREFIX <- "processed/bmf/"8.2.2 S3 Download Functions
# Download BMF from S3 (defaults to most recent file)
bmf_raw <- download_bmf_from_s3(
bucket = BMF_S3_BUCKET,
prefix = BMF_S3_PREFIX,
year = NULL, # Optional: specific year
month = NULL # Optional: specific month
)
# List available BMF files in S3
list_available_bmf_files(bucket = BMF_S3_BUCKET, prefix = BMF_S3_PREFIX)
# [1] "2025-01" "2024-12" "2024-11" ...AWS Credentials: Requires AWS credentials configured via environment variables or ~/.aws/credentials.
8.2.3 S3 Upload Configuration
Processed BMF files and quality reports can be automatically uploaded to S3 after processing. The pipeline uploads to two locations:
- Intermediate - All columns (raw + transformed) as parquet
- Processed - Transformed columns only as CSV, plus data dictionary
# S3 upload prefixes (in config.R)
BMF_S3_INTERMEDIATE_PREFIX <- "intermediate/bmf/"
BMF_S3_PROCESSED_PREFIX <- "processed/bmf/"
# Enable/disable upload in run_pipeline.R
ENABLE_S3_UPLOAD <- TRUE| Setting | Type | Default | Description |
|---|---|---|---|
BMF_S3_INTERMEDIATE_PREFIX |
character | "intermediate/bmf/" |
S3 prefix for intermediate outputs (all columns) |
BMF_S3_PROCESSED_PREFIX |
character | "processed/bmf/" |
S3 prefix for processed outputs (transformed only) |
ENABLE_S3_UPLOAD |
logical | TRUE |
Upload processed files to S3 after pipeline completes |
8.2.3.1 Upload Functions
# Generic upload function
upload_to_s3(local_file, s3_key, bucket = BMF_S3_BUCKET)
# Upload intermediate BMF (all columns) and quality report
upload_bmf_results(parquet_path, quality_report_path, year, month)
# Upload processed BMF (transformed columns only), quality report, and data dictionary
upload_processed_bmf(csv_path, quality_report_path, dictionary_path, year, month)Uploaded Files:
| File | S3 Location | Description |
|---|---|---|
bmf_YYYY_MM_intermediate.parquet |
s3://nccsdata/intermediate/bmf/YYYY_MM/ |
All columns (raw + transformed) |
bmf_YYYY_MM_quality_report.json |
s3://nccsdata/intermediate/bmf/YYYY_MM/ |
Quality metrics |
bmf_YYYY_MM_processed.csv |
s3://nccsdata/processed/bmf/YYYY_MM/ |
Transformed columns only |
bmf_YYYY_MM_data_dictionary.csv |
s3://nccsdata/processed/bmf/YYYY_MM/ |
Column metadata |
bmf_YYYY_MM_quality_report.json |
s3://nccsdata/processed/bmf/YYYY_MM/ |
Quality metrics (copy) |
AWS Permissions Required: s3:PutObject on the nccsdata bucket.
8.3 Lookup Table Configuration
File: R/config.R
All lookup tables are stored in a single Excel workbook (bmf_code_lookup.xlsx) and loaded at pipeline startup. One CSV file (state_abbreviations.csv) is used for address validation.
8.3.1 Excel Lookup Workbook
lookup_path <- "data/lookup/bmf_code_lookup.xlsx"
lookup_ls <- openxlsx::getSheetNames(lookup_path) |>
purrr::set_names() |>
purrr::map(~ {
df <- openxlsx::read.xlsx(xlsxFile = lookup_path, sheet = .x)
data.table::setDT(df)
})Sheets in bmf_code_lookup.xlsx:
| Sheet Name | Used By | Key Column |
|---|---|---|
status_code |
transform_bmf_status_code() |
status_code |
asset_code |
transform_bmf_asset_code() |
asset_code |
income_code |
transform_bmf_income_code() |
income_code |
filing_requirement_code |
transform_bmf_filing_requirement_code() |
filing_requirement_code |
pf_filing_requirement_code |
transform_bmf_filing_requirement_code() |
pf_filing_requirement_code |
ntee_code |
transform_bmf_ntee_code() |
ntee_code |
ntee_code_major_group |
transform_bmf_ntee_code() |
ntee_code_major_group |
ntee_common_code |
transform_bmf_ntee_code() |
ntee_common_code |
affiliation_code |
transform_bmf_affiliation_code() |
affiliation_code |
deductibility_code |
transform_bmf_deductibility_code() |
deductibility_code |
foundation_code |
transform_bmf_foundation_code() |
foundation_code |
organization_code |
transform_bmf_organization_code() |
organization_code |
activity_code |
transform_bmf_activity_code() |
activity_code |
subsection_classification_code |
transform_bmf_subsection_classification_codes() |
subsection_code, classification_code |
parent_organization |
transform_organization_name() |
group_exemption_number |
8.3.2 CSV Lookup Files
# State abbreviations for address validation
state_abbreviations_path <- "data/lookup/state_abbreviations.csv"9 Lookup Table Schemas
All lookup table schemas and their complete values are documented in the Lookup Tables Reference appendix.
9.1 state_abbreviations.csv
The only CSV lookup file, used for address validation:
| Column | Type | Description |
|---|---|---|
state_abbr |
character | Two-letter state/territory abbreviation |
state_name |
character | Full state/territory name |
10 Directory Structure
nccs-data-bmf/
├── R/ # Source code
│ ├── run_pipeline.R # Orchestrator (11 phases)
│ ├── config.R # S3 configuration and lookups
│ ├── checkpoints.R # Checkpoint save/load utilities
│ ├── input_validation.R # Validation functions
│ ├── utils/
│ │ ├── logging.R # Logging functions
│ │ └── transform_utils.R # Shared utilities
│ ├── quality/
│ │ ├── pre_checks.R # Pre-validation
│ │ ├── post_checks.R # Post-validation
│ │ └── quality_report_template.qmd # HTML/PDF report template
│ └── [transform files] # 24 transform functions
│
├── data/
│ ├── raw/ # Downloaded BMF files
│ │ └── bmf_2025_01.csv # Local copy from S3
│ │
│ ├── lookup/ # Reference tables
│ │ ├── bmf_code_lookup.xlsx # Multi-sheet workbook (15 sheets)
│ │ └── state_abbreviations.csv
│ │
│ ├── dictionaries/ # Column metadata
│ │ └── column_descriptions.csv # Static column descriptions
│ │
│ ├── checkpoints/ # Intermediate saves (7 checkpoints)
│ │ ├── bmf_2025_01_01_raw.parquet
│ │ ├── bmf_2025_01_02_identity.parquet
│ │ ├── bmf_2025_01_03_classification.parquet
│ │ ├── bmf_2025_01_04_activity.parquet
│ │ ├── bmf_2025_01_05_temporal.parquet
│ │ ├── bmf_2025_01_06_financial.parquet
│ │ └── bmf_2025_01_07_filing.parquet
│ │
│ ├── intermediate/ # Intermediate outputs (all columns)
│ │ └── bmf_2025_01_intermediate.parquet
│ │
│ ├── processed/ # Final outputs (transformed only)
│ │ ├── bmf_2025_01_processed.csv
│ │ └── bmf_2025_01_data_dictionary.csv
│ │
│ └── quality/ # Quality reports
│ └── bmf_2025_01_quality_report.json
│
└── docs/ # This guidebook
├── _quarto.yml
├── index.qmd
└── [chapter files]
11 Dependencies
11.1 Required R Packages
# Core data manipulation
library(data.table) # High-performance data.table operations
# File I/O
library(arrow) # Parquet read/write
library(openxlsx) # Excel file reading
library(aws.s3) # S3 bucket operations
# String operations
library(stringr) # String manipulation
# Date handling
library(lubridate) # Date parsing
# Utilities
library(here) # Project-relative paths
library(purrr) # Functional programming
library(jsonlite) # Quality report serialization
# Optional: Excel reading alternative
library(readxl) # For organization name lookup
# Quality report rendering (optional)
library(quarto) # Quarto document rendering
library(ggplot2) # Visualization for quality reports
library(scales) # Number formatting for charts11.2 Installation
# Core packages
install.packages(c(
"data.table",
"arrow",
"openxlsx",
"aws.s3",
"stringr",
"lubridate",
"here",
"purrr",
"jsonlite",
"readxl"
))
# For quality report rendering (optional)
install.packages(c("quarto", "ggplot2", "scales"))11.3 Version Requirements
| Package | Minimum Version | Notes |
|---|---|---|
| R | 4.1.0 | For native pipe |> |
| data.table | 1.14.0 | For := and .SD operations |
| arrow | 10.0.0 | For Parquet support |
12 Environment Variables
No environment variables are currently required. All configuration is done in R files.
12.1 Future Considerations
For production deployment, consider:
# Example .Renviron file
BMF_YEAR=2025
BMF_STRICT_MODE=TRUE
BMF_CHECKPOINT_DIR=/data/bmf/checkpoints13 Updating for New Years
When processing a new BMF year/month:
Run pipeline with new data:
# Pipeline auto-detects most recent file from S3 source("R/run_pipeline.R") # Or specify year/month explicitly BMF_YEAR <- 2026 BMF_MONTH <- 1 source("R/run_pipeline.R")Verify S3 file exists:
source("R/config.R") list_available_bmf_files() # Should show "2026-01" if Lambda has ingested itCheck for schema changes:
- If pipeline fails on column validation
- Compare columns to
BMF_REQUIRED_COLUMNSin pre_checks.R - Update if IRS added/removed columns
Update lookup tables:
- Check for new codes in IRS documentation
- Add any new values to lookup CSVs/XLSX