Master BMF Quality Report

Generated: 2026-05-06T20:46:11+0000 Quality gate: PASSED Total rows (unique EINs): 3,672,933 Source files stacked: 117 Vintages contributing rows: 113

Summary

The Master BMF is a single-row-per-EIN consolidation drawn from the most-recent vintage in which each EIN appears across both the current monthly BMF pipeline and the legacy 501CX-NONPROFIT-PX pipeline. See chapter 11 for the full design.

Metric Value
Total rows 3,672,933
Distinct EINs 3,672,933
Rows == distinct EINs (uniqueness gate) ✅ yes
Input CSV files 117
Vintages with at least one surviving row 113

Rows by source pipeline

Source Rows Percent
current 2,193,075 59.71
legacy 1,479,858 40.29

First / last year coverage

Metric Year
Earliest first_year_in_bmf 1989
Latest first_year_in_bmf 2026
Earliest last_year_in_bmf 1989
Latest last_year_in_bmf 2026

Vintages observed per EIN

How many distinct BMF vintages did each EIN appear in?

Metric Value
Minimum 1
Average 49.8
Maximum 213
EINs that appear in only one vintage 263,399

Vintage histogram

Number of surviving master rows contributed by each vintage. A vintage with many rows contributed many EINs whose latest sighting was in that month.

Vintage (YYYY-MM) Source Rows contributed
2026-05 current 1,952,238
1989-06 legacy 193,984
2010-11 legacy 135,506
2016-08 legacy 97,834
2020-04 legacy 94,227
2022-01 legacy 54,788
2024-08 current 46,513
2019-08 legacy 42,746
1996-06 legacy 42,536
2011-06 legacy 42,426
2025-08 current 37,546
1998-09 legacy 34,539
2000-05 legacy 32,319
2023-08 current 28,365
2011-12 legacy 26,594
2022-08 legacy 25,109
1995-08 legacy 23,316
1997-10 legacy 22,987
2007-09 legacy 22,466
2014-06 legacy 20,218
2007-04 legacy 20,011
2012-12 legacy 19,608
2016-04 legacy 19,385
2013-07 legacy 18,737
2004-12 legacy 18,217
2003-11 legacy 18,166
2003-01 legacy 18,142
2004-04 legacy 16,696
2013-02 legacy 16,395
2018-12 legacy 16,307
1999-12 legacy 15,078
2015-07 legacy 13,439
2006-05 legacy 12,673
2023-07 current 12,497
2026-01 current 12,400
2010-08 legacy 11,730
2008-01 legacy 10,791
2002-07 legacy 10,687
2008-10 legacy 10,672
2009-10 legacy 10,533
2001-07 legacy 9,981
2026-03 current 9,820
2024-03 current 9,077
2010-07 legacy 8,990
2006-01 legacy 8,963
2010-05 legacy 8,946
2012-08 legacy 8,895
2008-04 legacy 8,791
2007-01 legacy 8,770
2002-01 legacy 8,570
2024-04 current 8,457
2024-02 current 8,300
2015-02 legacy 8,240
2025-03 current 8,048
2009-04 legacy 8,046
2012-06 legacy 7,988
2003-07 legacy 7,964
2014-02 legacy 7,694
2014-12 legacy 7,541
2023-12 current 6,984
2010-04 legacy 6,934
2015-09 legacy 6,894
2009-01 legacy 6,615
2008-06 legacy 6,585
2011-09 legacy 6,505
2025-02 current 6,419
2012-04 legacy 6,302
2015-05 legacy 6,224
2014-04 legacy 6,138
2015-12 legacy 6,060
2006-11 legacy 6,009
2009-07 legacy 5,812
2011-08 legacy 5,755
2005-07 legacy 5,634
2024-12 current 5,572
2005-11 legacy 5,423
2013-12 legacy 5,396
2016-03 legacy 5,026
2014-09 legacy 4,656
2011-07 legacy 4,608
2023-11 current 4,576
2013-05 legacy 4,529
2008-12 legacy 4,225
2025-06 current 4,131
2012-07 legacy 4,071
2013-09 legacy 3,745
2013-10 legacy 3,740
2025-12 current 3,656
2012-02 legacy 3,628
2010-01 legacy 3,578
2015-11 legacy 3,542
2024-06 current 3,537
2025-11 current 3,511
2013-06 legacy 3,408
2012-10 legacy 3,378
2025-04 current 3,370
2015-04 legacy 3,261
2025-07 current 3,172
2013-08 legacy 3,110
2025-05 current 3,108
2013-03 legacy 3,079
2011-10 legacy 3,052
2011-11 legacy 2,966
2012-03 legacy 2,874
2012-11 legacy 2,768
2023-10 current 2,679
2024-11 current 2,451
2023-09 current 2,436
2013-04 legacy 2,434
2025-09 current 2,143
2024-07 current 2,069
2014-11 legacy 2,058
2016-02 legacy 1,595

Column completeness

Top 30 most-complete columns (highest non-null fraction):

Column Non-null %
ein_raw 100.00
ein 100.00
org_name_raw 100.00
org_name_join 100.00
org_name_display 100.00
org_addr_zip_raw 100.00
org_addr_state_invalid 100.00
ntee_code_clean 100.00
ntee_common_code 100.00
ntee_common_code_definition 100.00
ntee_code_major_group 100.00
naics_code 100.00
ntee_code_definition 100.00
nteev2_code 100.00
nteev2_subsector 100.00
nteev2_org_type 100.00
nteev2 100.00
bmf_source 100.00
combined_first_vintage_ym 100.00
combined_last_vintage_ym 100.00
bmf_vintages_observed 100.00
first_vintage_ym 100.00
last_vintage_ym 100.00
bmf_vintage_ym 100.00
first_year_in_bmf 100.00
last_year_in_bmf 100.00
org_addr_city_raw 99.92
org_addr_city 99.92
org_addr_state_raw 99.88
org_addr_state 99.81

Bottom 30 least-complete columns:

Column Non-null %
org_parent_name 3.08
activity_code_definitions 20.94
activity_code_categories 20.94
dba_name_raw 25.09
dba_name 25.09
org_legal_suffix 36.41
in_care_of_name_clean 38.45
in_care_of_name_raw 38.53
revenue_amount 52.28
all_classifications_string 59.68
in_care_of_name_provided 59.71
org_addr_street_raw 59.71
org_addr_has_special_chars 59.71
org_addr_is_po_box 59.71
org_addr_is_rural_route 59.71
org_addr_street 59.71
org_addr_zip4 59.71
org_addr_full 59.71
org_addr_is_missing 59.71
org_addr_missing_number 59.71
classification_code 59.71
affiliation_code 59.71
affiliation_code_definition 59.71
deductibility_code 59.71
deductibility_code_definition 59.71
organization_code 59.71
organization_code_definition 59.71
status_code 59.71
status_code_definition 59.71
activity_code 59.71

Notes

  • All output columns are VARCHAR. The master build forces every column to string type so DuckDB’s UNION ALL BY NAME can combine the legacy slim per-vintage schema with the current full schema without per-column type-mismatch errors. Cast in your downstream tool: as.numeric(asset_amount) in R, pd.to_numeric(df['asset_amount'], errors='coerce') in pandas, etc. Lat/lon in the geocoded master output are explicitly cast to DOUBLE.
  • Legacy-only EINs have sparse columns. EINs whose latest appearance is in a pre-2014 legacy vintage carry only the legacy slim schema; current-pipeline-only columns (e.g. NTEEv2 fields, modern address quality flags) are NULL. The bmf_source column flags the source of each surviving row.