7  Configuration

8 Configuration Reference

This chapter documents all configuration files, lookup tables, and dependencies.

8.1 Pipeline Configuration

File: R/run_pipeline.R (top section)

# Enable checkpointing (saves intermediate results to parquet)
ENABLE_CHECKPOINTS <- TRUE
CHECKPOINT_DIR <- "data/checkpoints"

# Enable strict quality gates (stops on validation failure)
STRICT_QUALITY_GATES <- TRUE

# Enable S3 upload of processed data and quality report
ENABLE_S3_UPLOAD <- TRUE

# BMF source configuration - set before sourcing to override defaults
# If not set, downloads most recent BMF file from S3
if (!exists("BMF_YEAR")) BMF_YEAR <- NULL    # e.g., 2025
if (!exists("BMF_MONTH")) BMF_MONTH <- NULL  # e.g., 1

# Processing year/month - set automatically from downloaded file
PROCESSING_YEAR <- NULL
PROCESSING_MONTH <- NULL
Setting Type Default Description
ENABLE_CHECKPOINTS logical TRUE Save intermediate results after each phase
CHECKPOINT_DIR character “data/checkpoints” Directory for checkpoint files
STRICT_QUALITY_GATES logical TRUE Stop pipeline on validation failure
BMF_YEAR integer NULL Override to download specific year (otherwise most recent)
BMF_MONTH integer NULL Override to download specific month (otherwise most recent)

8.2 Data Source Configuration

File: R/config.R

8.2.1 S3 Bucket Configuration

BMF files are downloaded from S3 bucket nccsdata. A Lambda function ingests monthly BMF files from the IRS and deposits them at raw/bmf/YYYY-MM-BMF.csv.

# S3 bucket configuration
BMF_S3_BUCKET <- "nccsdata"
BMF_S3_PREFIX <- "raw/bmf/"
BMF_S3_INTERMEDIATE_PREFIX <- "intermediate/bmf/"
BMF_S3_PROCESSED_PREFIX <- "processed/bmf/"

8.2.2 S3 Download Functions

# Download BMF from S3 (defaults to most recent file)
bmf_raw <- download_bmf_from_s3(
  bucket = BMF_S3_BUCKET,
  prefix = BMF_S3_PREFIX,
  year = NULL,   # Optional: specific year
  month = NULL   # Optional: specific month
)

# List available BMF files in S3
list_available_bmf_files(bucket = BMF_S3_BUCKET, prefix = BMF_S3_PREFIX)
# [1] "2025-01" "2024-12" "2024-11" ...

AWS Credentials: Requires AWS credentials configured via environment variables or ~/.aws/credentials.

8.2.3 S3 Upload Configuration

Processed BMF files and quality reports can be automatically uploaded to S3 after processing. The pipeline uploads to two locations:

  1. Intermediate - All columns (raw + transformed) as parquet
  2. Processed - Transformed columns only as CSV, plus data dictionary
# S3 upload prefixes (in config.R)
BMF_S3_INTERMEDIATE_PREFIX <- "intermediate/bmf/"
BMF_S3_PROCESSED_PREFIX <- "processed/bmf/"

# Enable/disable upload in run_pipeline.R
ENABLE_S3_UPLOAD <- TRUE
Setting Type Default Description
BMF_S3_INTERMEDIATE_PREFIX character "intermediate/bmf/" S3 prefix for intermediate outputs (all columns)
BMF_S3_PROCESSED_PREFIX character "processed/bmf/" S3 prefix for processed outputs (transformed only)
ENABLE_S3_UPLOAD logical TRUE Upload processed files to S3 after pipeline completes

8.2.3.1 Upload Functions

# Generic upload function
upload_to_s3(local_file, s3_key, bucket = BMF_S3_BUCKET)

# Upload intermediate BMF (all columns) and quality report
upload_bmf_results(parquet_path, quality_report_path, year, month)

# Upload processed BMF (transformed columns only), quality report, and data dictionary
upload_processed_bmf(csv_path, quality_report_path, dictionary_path, year, month)

Uploaded Files:

File S3 Location Description
bmf_YYYY_MM_intermediate.parquet s3://nccsdata/intermediate/bmf/YYYY_MM/ All columns (raw + transformed)
bmf_YYYY_MM_quality_report.json s3://nccsdata/intermediate/bmf/YYYY_MM/ Quality metrics
bmf_YYYY_MM_processed.csv s3://nccsdata/processed/bmf/YYYY_MM/ Transformed columns only
bmf_YYYY_MM_data_dictionary.csv s3://nccsdata/processed/bmf/YYYY_MM/ Column metadata
bmf_YYYY_MM_quality_report.json s3://nccsdata/processed/bmf/YYYY_MM/ Quality metrics (copy)

AWS Permissions Required: s3:PutObject on the nccsdata bucket.


8.3 Lookup Table Configuration

File: R/config.R

All lookup tables are stored in a single Excel workbook (bmf_code_lookup.xlsx) and loaded at pipeline startup. One CSV file (state_abbreviations.csv) is used for address validation.

8.3.1 Excel Lookup Workbook

lookup_path <- "data/lookup/bmf_code_lookup.xlsx"
lookup_ls <- openxlsx::getSheetNames(lookup_path) |>
  purrr::set_names() |>
  purrr::map(~ {
    df <- openxlsx::read.xlsx(xlsxFile = lookup_path, sheet = .x)
    data.table::setDT(df)
  })

Sheets in bmf_code_lookup.xlsx:

Sheet Name Used By Key Column
status_code transform_bmf_status_code() status_code
asset_code transform_bmf_asset_code() asset_code
income_code transform_bmf_income_code() income_code
filing_requirement_code transform_bmf_filing_requirement_code() filing_requirement_code
pf_filing_requirement_code transform_bmf_filing_requirement_code() pf_filing_requirement_code
ntee_code transform_bmf_ntee_code() ntee_code
ntee_code_major_group transform_bmf_ntee_code() ntee_code_major_group
ntee_common_code transform_bmf_ntee_code() ntee_common_code
affiliation_code transform_bmf_affiliation_code() affiliation_code
deductibility_code transform_bmf_deductibility_code() deductibility_code
foundation_code transform_bmf_foundation_code() foundation_code
organization_code transform_bmf_organization_code() organization_code
activity_code transform_bmf_activity_code() activity_code
subsection_classification_code transform_bmf_subsection_classification_codes() subsection_code, classification_code
parent_organization transform_organization_name() group_exemption_number

8.3.2 CSV Lookup Files

# State abbreviations for address validation
state_abbreviations_path <- "data/lookup/state_abbreviations.csv"

9 Lookup Table Schemas

All lookup table schemas and their complete values are documented in the Lookup Tables Reference appendix.

9.1 state_abbreviations.csv

The only CSV lookup file, used for address validation:

Column Type Description
state_abbr character Two-letter state/territory abbreviation
state_name character Full state/territory name

10 Directory Structure

nccs-data-bmf/
├── R/                              # Source code
│   ├── run_pipeline.R             # Orchestrator (11 phases)
│   ├── config.R                    # S3 configuration and lookups
│   ├── checkpoints.R               # Checkpoint save/load utilities
│   ├── input_validation.R          # Validation functions
│   ├── utils/
│   │   ├── logging.R              # Logging functions
│   │   └── transform_utils.R      # Shared utilities
│   ├── quality/
│   │   ├── pre_checks.R           # Pre-validation
│   │   ├── post_checks.R          # Post-validation
│   │   └── quality_report_template.qmd  # HTML/PDF report template
│   └── [transform files]           # 24 transform functions
│
├── data/
│   ├── raw/                        # Downloaded BMF files
│   │   └── bmf_2025_01.csv        # Local copy from S3
│   │
│   ├── lookup/                     # Reference tables
│   │   ├── bmf_code_lookup.xlsx   # Multi-sheet workbook (15 sheets)
│   │   └── state_abbreviations.csv
│   │
│   ├── dictionaries/               # Column metadata
│   │   └── column_descriptions.csv # Static column descriptions
│   │
│   ├── checkpoints/                # Intermediate saves (7 checkpoints)
│   │   ├── bmf_2025_01_01_raw.parquet
│   │   ├── bmf_2025_01_02_identity.parquet
│   │   ├── bmf_2025_01_03_classification.parquet
│   │   ├── bmf_2025_01_04_activity.parquet
│   │   ├── bmf_2025_01_05_temporal.parquet
│   │   ├── bmf_2025_01_06_financial.parquet
│   │   └── bmf_2025_01_07_filing.parquet
│   │
│   ├── intermediate/               # Intermediate outputs (all columns)
│   │   └── bmf_2025_01_intermediate.parquet
│   │
│   ├── processed/                  # Final outputs (transformed only)
│   │   ├── bmf_2025_01_processed.csv
│   │   └── bmf_2025_01_data_dictionary.csv
│   │
│   └── quality/                    # Quality reports
│       └── bmf_2025_01_quality_report.json
│
└── docs/                           # This guidebook
    ├── _quarto.yml
    ├── index.qmd
    └── [chapter files]

11 Dependencies

11.1 Required R Packages

# Core data manipulation
library(data.table)     # High-performance data.table operations

# File I/O
library(arrow)          # Parquet read/write
library(openxlsx)       # Excel file reading
library(aws.s3)         # S3 bucket operations

# String operations
library(stringr)        # String manipulation

# Date handling
library(lubridate)      # Date parsing

# Utilities
library(here)           # Project-relative paths
library(purrr)          # Functional programming
library(jsonlite)       # Quality report serialization

# Optional: Excel reading alternative
library(readxl)         # For organization name lookup

# Quality report rendering (optional)
library(quarto)         # Quarto document rendering
library(ggplot2)        # Visualization for quality reports
library(scales)         # Number formatting for charts

11.2 Installation

# Core packages
install.packages(c(
  "data.table",
  "arrow",
  "openxlsx",
  "aws.s3",
  "stringr",
  "lubridate",
  "here",
  "purrr",
  "jsonlite",
  "readxl"
))

# For quality report rendering (optional)
install.packages(c("quarto", "ggplot2", "scales"))

11.3 Version Requirements

Package Minimum Version Notes
R 4.1.0 For native pipe |>
data.table 1.14.0 For := and .SD operations
arrow 10.0.0 For Parquet support

12 Environment Variables

No environment variables are currently required. All configuration is done in R files.

12.1 Future Considerations

For production deployment, consider:

# Example .Renviron file
BMF_YEAR=2025
BMF_STRICT_MODE=TRUE
BMF_CHECKPOINT_DIR=/data/bmf/checkpoints

13 Updating for New Years

When processing a new BMF year/month:

  1. Run pipeline with new data:

    # Pipeline auto-detects most recent file from S3
    source("R/run_pipeline.R")
    
    # Or specify year/month explicitly
    BMF_YEAR <- 2026
    BMF_MONTH <- 1
    source("R/run_pipeline.R")
  2. Verify S3 file exists:

    source("R/config.R")
    list_available_bmf_files()
    # Should show "2026-01" if Lambda has ingested it
  3. Check for schema changes:

    • If pipeline fails on column validation
    • Compare columns to BMF_REQUIRED_COLUMNS in pre_checks.R
    • Update if IRS added/removed columns
  4. Update lookup tables:

    • Check for new codes in IRS documentation
    • Add any new values to lookup CSVs/XLSX