11 EC2 Batch Processing
Note
TODO: Full EC2 bootstrap, IAM setup, and batch entry-point walkthrough.
11.1 Bootstrap
scripts/setup_ec2.sh installs R, system deps, and the required CRAN packages: data.table, arrow, aws.s3, paws, openxlsx, rio, here, purrr, stringr, lubridate, jsonlite, quarto, duckdb, DBI, log4r, tidyverse, data.validator, assertr.
11.2 Entry point
scripts/run_pipeline.sh --years 2012-2024 --forms 990,990ez,990pf [--no-upload] [--strict]11.3 Rehydrate intermediate state before running
When running on a fresh EC2 instance (or any environment where data/intermediate/unpacked/ doesn’t already contain every prior processing year), pull state from S3 first — the harmonize step builds each output from the union of all unpacked sources on disk, so missing prior years drop silently:
aws s3 sync s3://nccsdata/intermediate/core/unpacked/ data/intermediate/unpacked/
aws s3 sync s3://nccsdata/raw/core/soi-extracts/ data/raw/soi_extracts/See Developer Guide for the full SOP.
11.4 IAM
EC2 instance role needs:
s3:GetObjectongt990datalake-rawdata(downstream BMF inputs).s3:GetObject,s3:PutObject,s3:ListBucketonnccsdata.