2 Architecture
nccs-data-core produces the NCCS CORE Series — harmonized panels of Form 990, 990-EZ, and 990-PF fields built from the IRS Statistics of Income (SOI) annual extracts. Outputs are per (tax_year, form) CSV plus a companion data dictionary and a quality report. The pipeline targets data engineers and researchers who need a consistently-shaped panel across years rather than year-specific raw extracts whose schemas drift annually.
Two design choices set the shape of everything downstream:
- Tax year, not filing year, is the partition key. A given output file’s rows all have a fiscal period ending in that calendar year, regardless of which IRS extract or processing year supplied them.
- Snake_case readable column names, derived through an explicitly-edited crosswalk rather than passed through from the IRS source variables. The crosswalk does the rename, so downstream consumers don’t have to track per-vintage source-variable drift themselves. See
docs/04-crosswalks.qmd.
The rest of this chapter describes how the pipeline gets from one to the other.
2.1 Three orchestrators
nccs-data-core is split into two upstream pipelines plus a standalone merge orchestrator that combines them:
R/run_pipeline.R— the SOI-current pipeline. Ingests IRS Statistics of Income annual extracts published atirs.gov/pub/irs-soi/from 2012 onward. Writes todata/intermediate/harmonized/anddata/processed/.R/run_legacy_pipeline.R— pipeline for the raw legacy NCCS files (1989–2011 PZ and PF only; 2012+ files ins3://nccsdata/legacy/core/are NCCS+SOI hybrids and are skipped). Writes todata/intermediate/harmonized_legacy/anddata/processed_legacy/. Seedocs/09-legacy-harmonization.qmd.R/run_build_panel.R— column-merge orchestrator that joins the two harmonized trees per(tax_year, form)on(ein, tax_period)with SOI precedence, producingdata/intermediate/harmonized_merged/anddata/processed_merged/. Depends on both upstream pipelines having produced output; rerun whenever either side changes. Seedocs/09-legacy-harmonization.qmd(“Merge phase”).
Both upstream pipelines partition by the first 4 chars of TAXPER (not publication year), so overlap on (tax_year, form) exists in both directions across the 2011/2012 boundary — the merge phase resolves it via column-level coalesce with a per-(year, form) disagreement audit.
2.2 SOI-current pipeline phases
R/run_pipeline.R runs nine phases in sequence. Each is a standalone, idempotent script under R/ that can also be invoked directly (Rscript R/01_download.R, etc.) for debugging or re-runs.
+-------------+
IRS SOI page -->| 1 download | data/raw/soi_extracts/{processing_year}/*.zip
+-------------+
|
v
+-------------+
| 2 unpack | data/intermediate/unpacked/{processing_year}/{form}/*.{csv,dat}
+-------------+
|
v
+-------------+
| 2.5 pre- | reads unpacked, validates file shape against the IRS dict's
| checks | per-vintage var matrix. Hard-fails the run on dup headers,
+-------------+ missing files, col-count tolerance.
|
v
+-------------+
| 3 harmonize | applies FINAL crosswalk per form: rename, coalesce synonyms,
+-------------+ NA-pad vintage gaps, type-coerce, derive (tax_year, is_501c3,
| is_amendment, ...), partition by tax_year.
v Output: data/intermediate/harmonized/{tax_year}/{form}/
+-------------+
| 4 combined | stacks 990 + 990-EZ on their 53 shared harmonized columns.
+-------------+ Output: data/intermediate/harmonized/{tax_year}/990combined/
|
v
+-------------+
| 5 quality | per-(tax_year, form) post-checks: schema, EIN format,
+-------------+ tax_period range, subsection whitelist, type validation,
| YoY tripwire. Writes RDS to data/logs/.
v
+-------------+
| 6 dictionary| per-output data dictionary CSV.
+-------------+ data/processed/{tax_year}/{form}/core_*_dictionary.csv
|
v
+-------------+
| 7 render | per-output Quarto HTML report rendered from the RDS.
+-------------+ data/processed/{tax_year}/{form}/core_*_quality.html
|
v
+-------------+
| 8 upload | (a) promote harmonized CSV -> processed/ so all three
+-------------+ artifacts live in one dir, (b) per-tier aws s3 sync.
2.3 Phase toggles
Every phase has an ENABLE_* flag in R/config.R::CONFIG, plus a --no-{phase} CLI override on R/run_pipeline.R. This makes phases freely skippable for iterative work:
Rscript R/run_pipeline.R --no-download --no-unpack --years 2012 --forms 990ez --no-uploadA phase that’s skipped via flag does not produce or check its output — downstream phases see whatever state was previously on disk. This is the intended pattern for re-running just the late phases after upstream output is already in place.
2.4 Idempotence and the “rebuild from union” semantics
The pipeline is deliberately re-runnable on the same inputs without producing different outputs:
- Phase 1 skips downloads when the zip is already at the target path.
- Phase 2 skips unzips when the unpacked file already exists.
- Phase 3 reads the union of every
data/intermediate/unpacked/{processing_year}/{form}/on disk and produces the harmonized output from scratch. Same input set → identical output, regardless of how many times you re-run. - Phase 4 rebuilds
990combinedfrom the union of alldata/intermediate/harmonized/{tax_year}/{990,990ez}/. - Phases 5–7 overwrite their per-(form, tax_year) outputs each run.
- Phase 8 uses
aws s3 syncwhich by default only transfers changed files (no--delete, so old S3 objects persist until manually removed).
Important consequence: Phase 3 reading the union of all unpacked sources means that if you delete data/intermediate/unpacked/2012/ between runs, the 2012 rows silently drop from the output. Treat data/intermediate/unpacked/ as durable state. The Developer Guide documents the S3 rehydration SOP for fresh-machine or new-year scenarios.
2.5 Quality gates and STRICT_QUALITY_GATES
The pre-check phase (2.5) and post-check phase (5) both honor the STRICT_QUALITY_GATES flag. In strict mode, hard-failure checks (schema integrity, EIN format, type correctness, subsection whitelist, tax_period range) call stop() and halt the run. Soft checks (row-count YoY delta, duplicate-EIN counts) always log a warning but never abort. See docs/05-quality-gates.qmd for the full check catalog.
2.6 Logging
Every phase writes a per-phase log to data/logs/<step>_log.txt via log4r (configured in R/create_logger.R). The orchestrator additionally writes run_pipeline_log.txt with START / OK / SKIP / FAIL markers and elapsed seconds per phase. On EC2 these logs upload to s3://nccsdata/logs/core/{run_timestamp}/ for post-mortem.
2.7 Where the rest of the architecture lives
This chapter sketches the pipeline at the orchestrator level. Topic-specific detail is in sibling chapters:
| Topic | Chapter |
|---|---|
| Tier-by-tier file/S3 layout | docs/02-data-lineage.qmd |
| Per-field transform behavior | docs/03-transforms-reference.qmd |
| Crosswalk file set + BASELINE/OVERRIDES/FINAL workflow | docs/04-crosswalks.qmd |
| Quality validators (pre + post check catalog) | docs/05-quality-gates.qmd |
| Phase toggles, paths, env vars | docs/06-configuration.qmd |
| Day-to-day developer SOPs | docs/07-developer-guide.qmd |
| Output column conventions, NA semantics, year-column distinctions | docs/08-output-schema.qmd |
| Pre-2012 legacy pipeline (planned, not built) | docs/09-legacy-harmonization.qmd |
| EC2 batch processing + cron | docs/10-ec2-batch-processing.qmd |
| Known IRS source-file oddities the pipeline compensates for | docs/11-upstream-source-quirks.qmd |