8  Developer Guide

Note

TODO: Local setup, common workflows, debugging tips.

8.1 Local quick-iterate

Rscript R/run_pipeline.R --years 2024 --forms 990ez --no-upload

990-EZ is the smallest extract; fastest end-to-end loop.

8.2 Editing a crosswalk

  1. Edit data/crosswalks/soi_<form>_crosswalk_OVERRIDES.csv in place. Save.
  2. Re-run scripts/draft_<form>_crosswalk.R to regenerate BASELINE and report drift.
  3. FINAL is rewritten from OVERRIDES verbatim.

Never delete OVERRIDES without explicit user permission.

8.3 Rolling out a new processing year (e.g. 2025+)

Critical step before running the pipeline: hydrate data/intermediate/unpacked/ from S3 first.

The harmonize step rebuilds each (tax_year, form) output from the union of every data/intermediate/unpacked/{processing_year}/{form}/ directory on disk. If a previous processing year’s unpacked file is missing locally, its rows silently drop from the output. This isn’t a bug — it’s the intentional rebuild-from-current-state design — but it means you must restore the durable state before adding new data.

SOP for adding a new processing-year extract (e.g. when 2025 lands):

# 1. Pull every prior processing year's unpacked source from S3 to local disk.
#    aws s3 sync only copies files that aren't already local, so this is cheap
#    on subsequent runs.
aws s3 sync s3://nccsdata/intermediate/core/unpacked/ data/intermediate/unpacked/

# 2. (Optional but recommended) Also rehydrate raw zips, in case unpack needs
#    to be re-run.
aws s3 sync s3://nccsdata/raw/core/soi-extracts/ data/raw/soi_extracts/

# 3. Bump LATEST_YEAR in R/config.R if appropriate, and add the new
#    (year, filename_stem) row to SOI_FILENAME_STEMS for each form
#    (URL pattern drifts year-to-year; verify against the IRS SOI page).

# 4. Run the pipeline. Phase 1 will fetch the new year's extract; phase 3
#    will rebuild every (tax_year, form) output as the union of all extracts.
Rscript R/run_pipeline.R --years 2012-2025 --forms 990,990ez,990pf --strict

# 5. (Optional) Refresh the blank-forms archive to pick up the new year's
#    IRS-published PDFs (forms + schedules + instructions). Idempotent; skips
#    files already on disk. See docs/02-data-lineage.qmd "Forms archive".
Rscript scripts/download_irs_forms.R

Why this matters: rerunning the pipeline against a partially-hydrated data/intermediate/unpacked/ will silently drop rows for the missing processing years. The output files get rewritten with fewer rows; no error, no warning. Always rehydrate first.

Quick sanity check after rehydration: confirm the row counts in the new harmonized output are at least as large as the prior run for every (tax_year, form) pair. The phase-5 YoY check (against quality_*.prev.rds snapshots) will warn if any partition shrank by more than 20%.

8.4 Re-rendering a quality report without re-harmonizing

Rscript R/07_render_report.R --rds data/logs/quality_990_2023.rds

8.5 Tuning render parallelism

Phase 7 renders each (form, tax_year) Quarto template in a forked worker. The worker count resolves in this order:

  1. NCCS_RENDER_WORKERS env var (if a positive integer), or
  2. The workers arg to run_render_reports() (if non-NULL), or
  3. Default: min(parallel::detectCores() - 1, 8).

The env var takes precedence over the function arg so cron / CI can tune without code changes:

NCCS_RENDER_WORKERS=8 bash scripts/run_pipeline.sh --strict

The cap at 8 reflects diminishing returns past that point — Quarto wall time is dominated by per-render subprocess startup (~5–7s each) rather than CPU, so additional workers mostly contend on fork/exec. On a small-RAM host, lower the cap to avoid OOM; each worker peaks at ~500MB–1GB.

Each render is isolated in tempdir() — the template is copied per-render so concurrent renders do not collide on Quarto’s .quarto/<template-stem>/ intermediate cache.

8.6 Gzip-on-upload for HTML reports

Phase 8 uploads the processed/ tier to S3 in two passes when ENABLE_GZIP_HTML_UPLOAD is TRUE (the default):

  1. Non-HTML files sync normally.
  2. *.html files are gzipped into a tempdir mirror and uploaded with --content-encoding gzip --content-type text/html --metadata-directive REPLACE. Browsers decompress transparently on load.

Real-world compression on the embed-resources quality reports is ~3× (1.9 MB → 0.6 MB) because the bulk of each HTML is base64-encoded fonts and CSS that don’t compress as well as raw text. Cumulative transfer reduction across the 109 reports in a full sweep is ~37 MB → ~12 MB on the wire.

Trade-off: aws s3 cp and the S3 web console return the compressed bytes — analysts pulling reports through those tools get a gzipped blob with .html extension and have to gunzip manually. Set ENABLE_GZIP_HTML_UPLOAD = FALSE in R/config.R for the single-pass uncompressed upload if that UX is more important for your use case than the transfer savings.

8.7 Running the test suite

Unit tests live in tests/, one file per pipeline phase (or set of helpers). Each file uses a lightweight check(label, expr) framework — no testthat dependency. To run everything from the repo root:

Rscript tests/run_all.R

Or from inside RStudio / an R session:

source("tests/run_all.R")

The harness prints per-file PASS / FAIL counts and a combined total. From a shell it exits with status 1 if any test failed (cron / CI-friendly); from an interactive session it does not call quit().

Individual test files also run standalone:

Rscript tests/test_harmonize.R

Coverage is Tier 1 + Tier 2: phases 2.5, 3, 4, 5, 6 (pure data-shape and validator logic). Phases 1, 2, 7, 8 are subprocess-bound (network / fread / Quarto / aws CLI) and not unit-tested — they are smoke-tested via the full pipeline run.

8.8 Skipping phases

Set ENABLE_DOWNLOAD = FALSE (etc.) to re-use existing intermediate files.