7 Configuration
All pipeline configuration lives in two files. R/config.R holds runtime toggles, local paths, the S3 layout, and IRS URL templates. R/data.R holds form-related constants (crosswalk file paths, the SOI form list, the subsection code whitelist). Anything tunable across runs without code edits lives in one of these two places — or, for cron/CI use cases, in an environment variable.
7.1 Phase toggles (R/config.R::CONFIG)
Each pipeline phase has an ENABLE_* flag. Phase 8 (upload) has additional per-tier sub-toggles so you can sync processed/ to S3 without also uploading the raw zips on every run.
| Flag | Default | Effect when FALSE |
|---|---|---|
ENABLE_DOWNLOAD |
TRUE |
Skip phase 1; assume zips are already at data/raw/soi_extracts/ |
ENABLE_UNPACK |
TRUE |
Skip phase 2; assume unpacked files are at data/intermediate/unpacked/ |
ENABLE_HARMONIZE |
TRUE |
Skip phase 3; no harmonized CSVs written |
ENABLE_COMBINED |
TRUE |
Skip phase 4; no 990combined series produced |
ENABLE_MERGE |
TRUE |
(used only by R/run_build_panel.R) Skip the legacy/SOI-current column-merge phase; no harmonized_merged/ written |
ENABLE_QUALITY |
TRUE |
Skip phase 5; no quality_*.rds files |
ENABLE_DICTIONARY |
TRUE |
Skip phase 6; no dictionary CSVs |
ENABLE_RENDER_REPORT |
TRUE |
Skip phase 7; no HTML quality reports |
ENABLE_S3_UPLOAD |
FALSE |
Skip all of phase 8’s S3 syncs; promotion to processed/ still happens |
ENABLE_UPLOAD_RAW |
FALSE |
Don’t sync data/raw/soi_extracts/ to S3 |
ENABLE_UPLOAD_FORMS |
TRUE |
Don’t sync data/raw/forms/ to S3 |
ENABLE_UPLOAD_INTERMEDIATE |
FALSE |
Don’t sync data/intermediate/{unpacked,harmonized}/ to S3 |
ENABLE_UPLOAD_PROCESSED |
TRUE |
Don’t sync data/processed/ to S3 |
ENABLE_UPLOAD_LOGS |
TRUE |
Don’t sync data/logs/ (incl. quality RDS) to s3://.../logs/core/{run_timestamp}/ |
ENABLE_GZIP_HTML_UPLOAD |
TRUE |
Skip the two-pass HTML-gzip on phase 8; HTMLs upload uncompressed |
STRICT_QUALITY_GATES |
TRUE |
Quality hard-fails downgrade to warnings; the run continues |
ENABLE_CHECKPOINTS |
TRUE |
Phase orchestrator does not honor per-phase resume markers |
CLI overrides on R/run_pipeline.R map to a subset of these (--strict, --no-strict, --upload, --no-upload, --no-{download,unpack,harmonize,combined,quality,dictionary,render}). Sub-toggles for the upload tiers don’t have CLI flags — edit R/config.R or override via env if you need them.
7.2 Year window
| Constant | Default | Note |
|---|---|---|
EARLIEST_YEAR |
2012L |
First IRS SOI extract year |
LATEST_YEAR |
bump each release | Add new processing-year filename stems to SOI_FILENAME_STEMS at the same time |
7.3 Forms
| Constant | Value |
|---|---|
FORMS |
c("990", "990ez", "990pf") |
The fourth output series, 990combined, is derived in phase 4 from the 990 + 990-EZ harmonized outputs on their 53 shared columns. It is not listed in FORMS because it’s not separately downloaded / unpacked / harmonized.
7.4 Local paths (PATHS)
All paths are relative to the repo root. The orchestrator and individual scripts both cd to the repo root before running, so absolute-vs-relative is not an operational concern.
| Key | Path |
|---|---|
data |
data |
raw |
data/raw |
soi_extracts |
data/raw/soi_extracts |
soi_dictionaries |
data/raw/soi_dictionaries |
forms |
data/raw/forms |
intermediate |
data/intermediate |
unpacked |
data/intermediate/unpacked |
harmonized |
data/intermediate/harmonized |
legacy_raw |
data/raw/legacy/core |
harmonized_legacy |
data/intermediate/harmonized_legacy |
harmonized_merged |
data/intermediate/harmonized_merged |
processed |
data/processed |
processed_legacy |
data/processed_legacy |
processed_merged |
data/processed_merged |
logs |
data/logs |
logs_legacy |
data/logs/legacy |
logs_merged |
data/logs/merged |
quality_reports |
docs/quality-reports |
quality_reports_legacy |
docs/quality-reports/legacy |
quality_reports_merged |
docs/quality-reports/merged |
crosswalks |
data/crosswalks |
docs |
docs |
7.5 S3 layout (S3)
All under bucket nccsdata.
| Key | Prefix | What’s there |
|---|---|---|
raw_prefix |
raw/core/soi-extracts |
One zip per (processing_year, form) |
forms_prefix |
raw/core/forms |
IRS form PDFs + text extractions (current + historical) |
legacy_prefix |
legacy/core |
Pre-2012 NCCS legacy files — read-only, consumed by the future legacy pipeline |
unpacked_prefix |
intermediate/core/unpacked |
Decompressed extracts |
harmonized_prefix |
intermediate/core/harmonized |
Per-(tax_year, form) harmonized CSVs |
processed_prefix |
processed/core |
The canonical SOI-current published artifact tier — CSV + dictionary CSV + quality HTML per (tax_year, form) |
processed_merged_prefix |
processed_merged/core |
Merged-panel published tier (legacy ∪ SOI-current via Option D column-merge); separate prefix while merged output is in bake-in. Phase 8 upload of this tier is not yet wired. |
logs_prefix |
logs/core |
Per-run timestamped logs + quality RDS snapshots; accumulates across runs |
7.6 IRS URL template
build_soi_url(processing_year, form) returns the IRS SOI zip URL. The filename stem varies by year — the IRS rebranded the program from eofinextract to eoextract between py2017 and py2018, and the form tag varies casing (990ez / EZ / ez / 990EZ). Per-(year, form) filename stems are tabulated in R/config.R::SOI_FILENAME_STEMS.
Pattern observations:
- py2012–py2017:
https://www.irs.gov/pub/irs-soi/{YY}eofinextract{FORM_TAG}.zip(theeofinextractera). - py2018+:
https://www.irs.gov/pub/irs-soi/{YY}eoextract{FORM_TAG}.zip(theeoextractera). - py2017–py2019 990-PF: not published; the corresponding URL returns 404 and
build_soi_urlreturnsNA_character_.
When a new processing year is released, bump LATEST_YEAR, add the new filename stems to SOI_FILENAME_STEMS, and run the pipeline. Phase 1 fetches the new zip; phase 3 rebuilds every (tax_year, form) output as the union of all extracts.
7.7 Environment variables
Read at runtime; useful for cron / CI tuning without code edits.
| Variable | Effect | Default if unset |
|---|---|---|
NCCS_RENDER_WORKERS |
Worker count for phase 7 parallel Quarto rendering. Positive integer. Overrides the workers arg to run_render_reports(). |
min(parallel::detectCores() - 1, 8) |
AWS_* |
Standard AWS SDK credential vars (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_DEFAULT_REGION, etc.) |
None — falls back to the EC2 instance role or ~/.aws/credentials |