11 EC2 Batch Processing

Note

TODO: Full EC2 bootstrap, IAM setup, and batch entry-point walkthrough.

11.1 Bootstrap

scripts/setup_ec2.sh installs R, system deps, and the required CRAN packages: data.table, arrow, aws.s3, paws, openxlsx, rio, here, purrr, stringr, lubridate, jsonlite, quarto, duckdb, DBI, log4r, tidyverse, data.validator, assertr.

11.2 Entry point

scripts/run_pipeline.sh --years 2012-2024 --forms 990,990ez,990pf [--no-upload] [--strict]

11.3 Rehydrate intermediate state before running

When running on a fresh EC2 instance (or any environment where data/intermediate/unpacked/ doesn’t already contain every prior processing year), pull state from S3 first — the harmonize step builds each output from the union of all unpacked sources on disk, so missing prior years drop silently:

aws s3 sync s3://nccsdata/intermediate/core/unpacked/ data/intermediate/unpacked/
aws s3 sync s3://nccsdata/raw/core/soi-extracts/      data/raw/soi_extracts/

See Developer Guide for the full SOP.

11.4 IAM

EC2 instance role needs:

s3:GetObject on gt990datalake-rawdata (downstream BMF inputs).
s3:GetObject, s3:PutObject, s3:ListBucket on nccsdata.