8 Developer Guide
TODO: Local setup, common workflows, debugging tips.
8.1 Local quick-iterate
Rscript R/run_pipeline.R --years 2024 --forms 990ez --no-upload990-EZ is the smallest extract; fastest end-to-end loop.
8.2 Editing a crosswalk
- Edit
data/crosswalks/soi_<form>_crosswalk_OVERRIDES.csvin place. Save. - Re-run
scripts/draft_<form>_crosswalk.Rto regenerate BASELINE and report drift. - FINAL is rewritten from OVERRIDES verbatim.
Never delete OVERRIDES without explicit user permission.
8.3 Rolling out a new processing year (e.g. 2025+)
Critical step before running the pipeline: hydrate data/intermediate/unpacked/ from S3 first.
The harmonize step rebuilds each (tax_year, form) output from the union of every data/intermediate/unpacked/{processing_year}/{form}/ directory on disk. If a previous processing year’s unpacked file is missing locally, its rows silently drop from the output. This isn’t a bug — it’s the intentional rebuild-from-current-state design — but it means you must restore the durable state before adding new data.
SOP for adding a new processing-year extract (e.g. when 2025 lands):
# 1. Pull every prior processing year's unpacked source from S3 to local disk.
# aws s3 sync only copies files that aren't already local, so this is cheap
# on subsequent runs.
aws s3 sync s3://nccsdata/intermediate/core/unpacked/ data/intermediate/unpacked/
# 2. (Optional but recommended) Also rehydrate raw zips, in case unpack needs
# to be re-run.
aws s3 sync s3://nccsdata/raw/core/soi-extracts/ data/raw/soi_extracts/
# 3. Bump LATEST_YEAR in R/config.R if appropriate, and add the new
# (year, filename_stem) row to SOI_FILENAME_STEMS for each form
# (URL pattern drifts year-to-year; verify against the IRS SOI page).
# 4. Run the pipeline. Phase 1 will fetch the new year's extract; phase 3
# will rebuild every (tax_year, form) output as the union of all extracts.
Rscript R/run_pipeline.R --years 2012-2025 --forms 990,990ez,990pf --strict
# 5. (Optional) Refresh the blank-forms archive to pick up the new year's
# IRS-published PDFs (forms + schedules + instructions). Idempotent; skips
# files already on disk. See docs/02-data-lineage.qmd "Forms archive".
Rscript scripts/download_irs_forms.RWhy this matters: rerunning the pipeline against a partially-hydrated data/intermediate/unpacked/ will silently drop rows for the missing processing years. The output files get rewritten with fewer rows; no error, no warning. Always rehydrate first.
Quick sanity check after rehydration: confirm the row counts in the new harmonized output are at least as large as the prior run for every (tax_year, form) pair. The phase-5 YoY check (against quality_*.prev.rds snapshots) will warn if any partition shrank by more than 20%.
8.4 Re-rendering a quality report without re-harmonizing
Rscript R/07_render_report.R --rds data/logs/quality_990_2023.rds8.5 Tuning render parallelism
Phase 7 renders each (form, tax_year) Quarto template in a forked worker. The worker count resolves in this order:
NCCS_RENDER_WORKERSenv var (if a positive integer), or- The
workersarg torun_render_reports()(if non-NULL), or - Default:
min(parallel::detectCores() - 1, 8).
The env var takes precedence over the function arg so cron / CI can tune without code changes:
NCCS_RENDER_WORKERS=8 bash scripts/run_pipeline.sh --strictThe cap at 8 reflects diminishing returns past that point — Quarto wall time is dominated by per-render subprocess startup (~5–7s each) rather than CPU, so additional workers mostly contend on fork/exec. On a small-RAM host, lower the cap to avoid OOM; each worker peaks at ~500MB–1GB.
Each render is isolated in tempdir() — the template is copied per-render so concurrent renders do not collide on Quarto’s .quarto/<template-stem>/ intermediate cache.
8.6 Gzip-on-upload for HTML reports
Phase 8 uploads the processed/ tier to S3 in two passes when ENABLE_GZIP_HTML_UPLOAD is TRUE (the default):
- Non-HTML files sync normally.
*.htmlfiles are gzipped into a tempdir mirror and uploaded with--content-encoding gzip --content-type text/html --metadata-directive REPLACE. Browsers decompress transparently on load.
Real-world compression on the embed-resources quality reports is ~3× (1.9 MB → 0.6 MB) because the bulk of each HTML is base64-encoded fonts and CSS that don’t compress as well as raw text. Cumulative transfer reduction across the 109 reports in a full sweep is ~37 MB → ~12 MB on the wire.
Trade-off: aws s3 cp and the S3 web console return the compressed bytes — analysts pulling reports through those tools get a gzipped blob with .html extension and have to gunzip manually. Set ENABLE_GZIP_HTML_UPLOAD = FALSE in R/config.R for the single-pass uncompressed upload if that UX is more important for your use case than the transfer savings.
8.7 Running the test suite
Unit tests live in tests/, one file per pipeline phase (or set of helpers). Each file uses a lightweight check(label, expr) framework — no testthat dependency. To run everything from the repo root:
Rscript tests/run_all.ROr from inside RStudio / an R session:
source("tests/run_all.R")The harness prints per-file PASS / FAIL counts and a combined total. From a shell it exits with status 1 if any test failed (cron / CI-friendly); from an interactive session it does not call quit().
Individual test files also run standalone:
Rscript tests/test_harmonize.RCoverage is Tier 1 + Tier 2: phases 2.5, 3, 4, 5, 6 (pure data-shape and validator logic). Phases 1, 2, 7, 8 are subprocess-bound (network / fread / Quarto / aws CLI) and not unit-tested — they are smoke-tested via the full pipeline run.
8.8 Skipping phases
Set ENABLE_DOWNLOAD = FALSE (etc.) to re-use existing intermediate files.