6  Quality Gates

Built in Phase 6 (2026-05-11). Entry points: - R/quality/pre_checks.R::run_pre_checks_one() — runs against an unpacked source file. - R/quality/post_checks.R::run_post_checks() — runs against a harmonized data.table. - R/05_quality.R::run_quality() — orchestrates post-checks across every (form, tax_year) output, writes RDS per (form, tax_year) to the logs_dir parameter (default data/logs/). The legacy orchestrator and the merge orchestrator pass data/logs/legacy/ and data/logs/merged/ respectively, so each pipeline’s RDS files stay isolated at shared (form, tax_year) keys (e.g. 990combined/2011). - R/quality/stat_helpers.R — eight reusable type-specific stat calculators (completeness, numeric, character, code, boolean, date, plus column / category report builders).

6.1 Pre-checks (before harmonization)

Per (processing_year, form):

  • File exists and non-empty.
  • Header row present.
  • Column count within ±5% of expected per-vintage value.
  • Row count > 0.
  • No duplicate header column names.

6.2 Post-checks (after harmonization)

Per (tax_year, form):

  • Schema: every harmonized column from FINAL crosswalk is present.
  • Type: numeric / logical / character as declared; tax_year in [1989, current_year + 1].
  • EIN format: 9 digits, no nulls, no duplicates.
  • tax_period format: 6 chars, parseable.
  • Subsection codes in KNOWN_SUBSECTION_CODES.
  • Row-count plausibility: ±20% vs prior-year same (tax_year, form) (warn).
  • Null rates within historical bounds (warn).

6.3 Completeness metric

The report’s overall_completeness is vintage-aware: per-cohort (extract_year, source_form) completeness over the harmonized columns expected for that vintage (from the crosswalk’s years_present field), then row-count-weighted across cohorts.

A clean run produces 100% completeness across all outputs — because IRS uses 0-not-NA for financial fields, every expected column is populated. The metric therefore functions as a pipeline-integrity tripwire: any drop below 100% indicates the pipeline lost rows or columns for some vintage cohort.

The raw overall_completeness_raw field is preserved on the report for transparency — it computes the simple all-cols/all-rows fraction without vintage adjustment, useful for spotting NA-pad ratio drift across years.

Per-cohort breakdown is in summary_stats$completeness_by_cohort.

6.4 Strict mode

STRICT_QUALITY_GATES = TRUE aborts on schema, EIN, and type failures. Soft checks (row-count delta, null rate, duplicate EINs) always warn, never abort.

6.5 Duplicate EINs are soft, not hard

Same (ein, tax_period) filed twice in one extract is a legitimate amendment-within-year and is preserved in the output. is_amendment (set in harmonize) flags cross-extract duplicates; within-extract duplicates surface in summary_stats$duplicate_eins but do not fail the gate.

6.6 Year-over-year delta is run-archive based, not tax-year shift

The plan’s “±20% vs prior year” check compares the same (form, tax_year) against the report from a prior pipeline run, not against a different tax_year. Until pipeline-run archive infrastructure exists, this check returns status = "no_baseline" and is a no-op.