11 EC2 Batch Processing
This chapter describes how to run the legacy BMF harmonization pipeline across all vintages in s3://nccsdata/legacy/bmf/ on an EC2 instance.
The local pipeline is fine for processing a single vintage, but running the full historical archive (1989–2016, currently 70+ vintages) on a laptop is impractical: each vintage takes several minutes and consumes 4–8 GB of RAM during transforms.
Two scripts in scripts/ automate the workflow:
| Script | Purpose |
|---|---|
scripts/setup_ec2.sh |
One-shot bootstrap of a fresh Ubuntu 22.04 EC2 box |
scripts/run_all_legacy.sh |
Iterate every legacy vintage in S3 serially, one Rscript subprocess each |
11.1 Recommended EC2 configuration
| Setting | Recommended |
|---|---|
| AMI | Ubuntu 22.04 LTS (or Amazon Linux 2023 with adapted package commands) |
| Instance type | m6i.xlarge (16 GB RAM) minimum; m6i.2xlarge (32 GB) for headroom |
| EBS volume | 100 GB gp3 minimum (raw CSVs + 7 checkpoints + outputs per vintage ≈ 1–2 GB) |
| IAM role | Read/write on arn:aws:s3:::nccsdata and arn:aws:s3:::nccsdata/* |
| Region | Same as the nccsdata bucket (avoids cross-region transfer costs) |
Attaching an IAM role to the instance is the cleanest way to grant S3 access — both aws.s3 (R) and the AWS CLI pick it up automatically via the instance metadata service, with no keys to manage.
11.2 Step 1 — Spin up the instance
Launch the instance with the configuration above. SSH in:
ssh -i your-key.pem ubuntu@<ec2-public-dns>11.3 Step 2 — Clone the repository
git clone https://github.com/UrbanInstitute/nccs-data-bmf.git
cd nccs-data-bmf11.4 Step 3 — Bootstrap the environment
bash scripts/setup_ec2.shThis installs:
- System libraries needed by R packages (curl/ssl/xml2/font stack, cmake)
- R and R development headers
- AWS CLI v2 (if not already present)
- Quarto CLI (for quality-report HTML rendering; default v1.6.40 — override with
QUARTO_VERSION=1.7.0 bash scripts/setup_ec2.sh) - The 10 R packages required by the pipeline
The script ends by verifying that all R packages load, that AWS credentials are detected, and that the bucket prefix is readable. If the AWS check fails, configure credentials (see next step).
11.5 Step 4 — Configure AWS credentials (only if no IAM role)
If you attached an IAM role in step 1, skip this. Otherwise pick one:
# Option A — persistent, interactive
aws configure
# Option B — environment variables (current shell only)
export AWS_ACCESS_KEY_ID="..."
export AWS_SECRET_ACCESS_KEY="..."
export AWS_DEFAULT_REGION="us-east-1"Verify:
aws sts get-caller-identity
aws s3 ls s3://nccsdata/legacy/bmf/ | head11.6 Step 5 — Run the batch
The batch driver runs every available legacy vintage, one fresh Rscript subprocess per vintage. Running each vintage in its own process — rather than looping inside a single R session — guarantees that memory and file connections are released between runs. (Stacking runs in one session has previously crashed machines and leaked connections.)
# Inside a tmux session so the batch survives SSH disconnects:
tmux new -s legacy-batch
bash scripts/run_all_legacy.sh 2>&1 | tee logs/legacy/_master.log
# detach: Ctrl-b d reattach later: tmux attach -t legacy-batchPer-vintage stdout and stderr go to logs/legacy/bmf_legacy_YYYY_MM.log. A roll-up logs/legacy/run_summary.tsv records vintage, status, runtime in seconds, and start time.
11.6.1 Options
bash scripts/run_all_legacy.sh # serial, oldest first
bash scripts/run_all_legacy.sh --newest-first # serial, newest first
JOBS=8 bash scripts/run_all_legacy.sh # 8 vintages in parallel
SKIP_EXISTING=1 JOBS=8 bash scripts/run_all_legacy.sh # parallel + resume
SKIP_VINTAGES="2017-09,2017-12,2018-12" \
bash scripts/run_all_legacy.sh # exclude specific monthsSKIP_VINTAGES defaults to "2017-09,2017-12,2018-12" — three upstream NCCS files with sequence-ID EINs and non-standard TAXPER encodings that are structurally incompatible with the pipeline. See chapter 9 for details.
SKIP_EXISTING=1 is the way to resume after a failure: re-running the full batch will re-do work, but with this flag the driver checks S3 for the processed CSV per vintage and skips ones already uploaded.
11.6.1.1 Choosing JOBS
The pipeline is largely single-threaded R (data.table’s internal threading uses ~4 cores during fread/keyed joins, but the rest of each vintage runs single-threaded). Each subprocess peaks at ~6–8 GB RAM. On a serial run, a single vintage takes ~5 minutes; the full archive of ~70 vintages serial takes ~6 hours.
Run multiple vintages concurrently with JOBS=N. Size N to keep total RAM under ~70 % of the host:
| Host | Suggested JOBS | Approx full-archive runtime |
|---|---|---|
| 16 GB RAM laptop | 1 (serial) | ~6 hours |
m6i.2xlarge (32 GB) |
3 | ~2 hours |
m6i.4xlarge (64 GB) |
6 | ~1 hour |
c5.18xlarge (144 GB) |
8–12 | ~30–45 min |
Above ~12 concurrent jobs you’ll likely hit S3 throughput as the bottleneck rather than CPU or RAM.
SKIP_EXISTING=1 is the way to resume after a failure: re-running the full batch will re-do work, but with this flag the driver checks S3 for s3://nccsdata/processed/bmf-legacy/YYYY_MM/bmf_legacy_YYYY_MM_processed.csv and skips vintages already uploaded.
11.7 Step 6 — Verify outputs
Each successful vintage produces, locally and in S3:
| Local path | S3 path |
|---|---|
data/intermediate/bmf_legacy_YYYY_MM_intermediate.parquet |
s3://nccsdata/intermediate/bmf-legacy/YYYY_MM/…parquet |
data/processed/bmf_legacy_YYYY_MM_processed.csv |
s3://nccsdata/processed/bmf-legacy/YYYY_MM/…csv |
data/processed/bmf_legacy_YYYY_MM_data_dictionary.csv |
s3://nccsdata/processed/bmf-legacy/YYYY_MM/…dictionary.csv |
data/quality/bmf_legacy_YYYY_MM_quality_report.json |
both intermediate and processed prefixes |
docs/quality-reports/bmf_legacy_YYYY_MM_quality_report.html |
(not uploaded — local only) |
After the batch completes, summarize results:
column -t -s $'\t' logs/legacy/run_summary.tsv | sort -k211.8 Why per-vintage subprocesses
Earlier runs of the legacy pipeline that loaded multiple vintages into the same R session caused two recurring problems:
- Memory exhaustion — the ~1.4M-row legacy data.tables and their intermediate copies were not released between runs, so each subsequent vintage started with less free RAM. With 70+ vintages this eventually OOMed the host.
- Leaked file connections —
aws.s3::put_object(multipart = TRUE)does not always close its read connection on success, leaving orphanedfile()handles that R’s GC eventually warns about (closing unused connection N).
scripts/run_all_legacy.sh sidesteps both by exec’ing a fresh Rscript --vanilla per vintage. When the subprocess exits, the OS reclaims everything.
11.9 Cost notes
A single legacy vintage takes ~3–6 minutes on m6i.xlarge. The full historical archive (~70 vintages) finishes in roughly 4–7 hours, which on an m6i.xlarge (≈ $0.20/hour on-demand in us-east-1 as of this writing) is a few dollars of compute plus the egress for S3 uploads (intra-region, so free if the bucket is in the same region as the instance). Stop or terminate the instance after the batch completes.