Pollution Analysis Methodology
Pollution Analyst is built around three commitments: plain English for readers, federal-only data with explicit attribution, and equity context paired with releases rather than hidden. The platform-wide rules are first; per-source caveats follow at the bottom.
Pollutant Taxonomy
Every page on this site uses the same top-level categorization, chosen so cross-source comparisons remain coherent. New sources extend the taxonomy rather than reshape it.
- Criteria air pollutants (NAAQS). PM2.5, PM10, ozone, NO₂, SO₂, CO, lead. The six pollutants that the Clean Air Act sets ambient standards for.
- Hazardous air pollutants (HAPs). ~187 chemicals listed under Clean Air Act §112. Rolled up by health endpoint (carcinogen, neurotoxin, respiratory irritant) on facility pages.
- Greenhouse gases. CO₂, methane, N₂O, HFCs. Reported under the GHGRP for large emitters.
- Drinking-water contaminants. Lead, nitrates, arsenic, disinfection byproducts, microbial. Reported via SDWIS as MCL or treatment-technique violations per public water system.
- Toxic releases (TRI). ~800 chemicals reported by industrial facilities under EPCRA §313. Rolled up by chemical group (carcinogens, persistent bioaccumulative toxins, dioxins).
Anomaly Engine
Pollution data has different shapes from crime data. Crime is point-incident, monthly, with weak seasonality; pollution is annual (TRI, GHGRP), event-driven (SDWIS), or sparse-monitor (AQS, when it lands). The flag taxonomy reflects that. Four flag types ship today; two more (smoke days, NAAQS exceedance days) defer until air-monitor ingest is in.
A flag is editorial attention, not a regulatory finding. Where EPA has issued an enforcement action — SDWIS Tier 1 violations, ECHO actions — we link to the federal record so readers can verify the actual compliance posture rather than infer it from our card.
Long-Arc Shift
Triggers when a geography's most-recent-year value differs from a baseline year (≥10 years prior) by ≥50%. Absolute floors keep the percent change meaningful: ≥50,000 lb baseline for TRI pathways, ≥100,000 mtCO₂e for GHG. Surfaced on facility, county, and state pages. Severity: improvement for declines, regression for rises.
Release Shift
Year-over-year facility × chemical TRI shift. Three combined floors against tiny-base noise: ≥50% change AND ≥10,000 lb absolute delta AND ≥1,000 lb prior-year baseline. Surfaced on facility pages only — YoY at county/state aggregation is too noisy to be editorial. Severity: surge or drop.
Violation Event
SDWIS health-based or unresolved violation. Event-driven, not statistical — the violation itself is the signal. Surfaced on water-utility pages with a link to the EPA SDWIS record. Severity ordering: unresolved > health-based within 1 year > health-based within 5 years. Monitoring failures and returned-to-compliance violations don't flag.
GHG Step
County-level GHGRP year-over-year shift, ≥30% with both years ≥10,000 mtCO₂e. Typically reflects industrial commissioning, decommissioning, or fuel switching. Facility-level deferred until a TRI↔GHGRP facility-ID join is built. Severity: surge or drop.
Calibration Commitment
Average flag count is targeted at: ≤3 per facility, ≤5 per county, ≤8 per state. Water utilities are uncapped — violation events are events, not anomalies. The first emission cycle intentionally runs lax thresholds; the engine logs per-geography counts so over-cap entities can drive a threshold tune before re-publishing.
Deferred
Smoke days and NAAQS exceedance days require AQS air-monitor ingest, which is not yet in the pipeline. Sustained shift / streak break require a monthly cadence; TRI is annual. Cross-pathway flags (e.g., a high-TRI facility near a public water system with violations) wait until the individual lanes are calibrated. None of these are technical blockers — all are scope choices for v1.
Equity Overlay (The Deliberate Inversion From Our Crime Site)
Our companion site, Public Analyst.ai, refuses to overlay race or income with neighborhood crime. The reasoning there is specific to crime: it's reported by police, subject to enforcement-pattern bias, and the juxtaposition reads correlation as causation regardless of authorial intent.
For pollution, the opposite is the legitimate, well-documented framing. Environmental justice analysis explicitly correlates pollution exposure with race and income. EPA published EJScreen for exactly this purpose for over a decade. Refusing the overlay would undermine the value of the site.
The mechanism differs in two important ways:
- Pollution is measured at the receptor or the source. A TRI release is reported by the facility under federal mandate; an EJ index pairs that release with the demographics of the surrounding block groups. Neither measurement is filtered through enforcement priorities.
- Concentration is the object of inquiry.The question on a pollution page is whether a population bears disproportionate exposure — that's an empirical question with an empirical answer, computed at federally defined geographies.
Three Layers, Demographics-Leading
EPA retired the public-facing EJScreen tool in 2025. We're now a primary-source compositor for the equity overlay rather than a re-presenter of an EPA-blessed index. Every facility, county, city, and state page renders the overlay in three layers, in order of prominence:
- Demographic context (lead).Census ACS 2018–2022 (5-year): population total + share low-income / people of color / under age 5 / over age 64. Always rendered. This is the “who lives here” surface — the most legible to readers and the most defensible methodologically.
- National percentiles, per environmental indicator. Each indicator (PM2.5, ozone, NO₂, diesel particulate, RSEI toxic releases, lead-paint risk, NPL/RMP/TSDF/NPDES proximity) ranked against the national distribution of all US block groups, population-weighted. Rendered as “in the highest 10% nationally” — the framing EPA's original EJScreen used. Computed in-pipeline against the raw indicator block-group table from
USEPA-clone/EJAM-open/data/blockgroupstats.rda; the same upstream EPA team maintains both this and the disparity-score file. - EJ disparity scores (statistical detail). EPA's newer disparity-score metric, sourced from the
USEPA-clone/ejamdataGitHub mirror that the open-source EJAM package consumes. Population-weighted to state, county, and city. Centered on 100 = the population-weighted reference burden; higher = greater disparate exposure. ~150 is widely considered notable; 200+ is severe. Surfaced as a table at the bottom of the equity section so readers who want the formal stat can read it directly.
Why Both Percentile And Disparity
Percentile and disparity score answer different questions. Percentile says “how does this place rank against the country?” — directly legible, easy to cite. Disparity scoresays “does the population at this place bear more burden than a population-weighted reference?” — a stronger statement about distributive equity, harder to compress into one phrase. Surfacing both lets the reader hold them up against each other.
Geography Preference (Per-Page)
Each page picks the tightest geography that has data available, in this preference order: Census Place (city) → containing County → State. Facility pages currently use the containing-county overlay as a proxy; a 3-mile-buffer aggregation is a future iteration.
The methodology page tracks every substitution explicitly. The pipeline reads USEPA-clone/ejamdata directly, so any reader can audit the raw block-group inputs against our computed rollups.
What We Deliberately Exclude
- Real-time alerting and AQI dashboards. AirNow already does that well. We are an analytics-and-narrative layer, not a hazard-of-the-hour service.
- Health diagnoses or medical advice. We reference exposure thresholds (NAAQS, MCLs, EJ disparity scores); we do not interpret them for individual readers.
- Per-individual exposure modeling. Our unit of analysis is aggregate (facility, utility, county, place). The block-group EJ data and NATA are tract / block-group rollups; we do not model personal exposure.
- Speculative attribution to specific facilities beyond what TRI / ECHO publishes directly. We do not assert that a given facility caused a given health outcome; that exceeds what the data can support.
- Real-estate value framing. Pollution data is not a property-listing feature. The framing risks turning environmental harm into a market signal.
- Commercial / non-federal sensors. PurpleAir and similar community sources are out of scope. The federal corpus is sufficient and licensing-clean.
Sources We Currently Use
Every dataset on this site is a federal public-domain work (17 USC §105). We attribute the originating agency on every page; we do not relabel federal observations as proprietary.
TRI · Toxics Release Inventory
Owner. EPA, under EPCRA §313 and PPA §6607.
What it is. Self-reported annual releases of ~800 listed chemicals by industrial facilities meeting NAICS-code and employee thresholds. Reported quantities are estimates (mass balance, emission factors, monitoring data) rather than continuous measurements.
Cadence. Annual. Reporting year T is published in the second half of year T+1; preliminary data lands earlier. We mark the reporting year on every facility and county page.
Caveats we surface.
- Self-reported.A facility's TRI total is only as good as its Form R submission. EPA does QC checks but the underlying record is the facility's. Anomalous year-over-year shifts often reflect reporting-method changes, not actual emission shifts.
- Threshold-coverage gaps.Below-threshold facilities don't report. A county with a TRI total of zero may still have meaningful smaller emitters.
- Pounds, not concentrations. TRI tells you how much was released — not where it ended up or what people inhaled. We pair TRI with the EJ disparity overlay for population-exposure context.
- Long-arc improvements are real. Multi-decade declines on the order of −30% to −60% reflect both Clean Air Act controls and the progressive electrification of heavy industry. We report the long arc explicitly because the year-over-year noise can hide it.
Reference. EPA TRI Program.
GHGRP · Greenhouse Gas Reporting Program
Owner. EPA, under 40 CFR Part 98.
What it is. Annual self-reported greenhouse-gas emissions (CO₂, methane, N₂O, HFCs, others) from large emitters — power plants, refineries, chemicals, cement, landfills, and other facilities above the 25,000 mtCO₂e threshold. Reported in metric tons of CO₂ equivalent per facility per gas per sector subpart.
Cadence. Annual. Reporting year T historically published in the second half of T+1.
Status — 2024 reporting year is unavailable. EPA's Envirofacts API returns zero rows nationally for GHGRP year 2024, for every state. The Trump administration halted the Greenhouse Gas Reporting Program in 2025; the year-2024 dataset that would normally have been published in late 2025 was not released. The Environmental Integrity Project subsequently obtained and released the 2024 industrial GHG data independently. We have not yet ingested the EIP release; for now, every state's GHG pathway tile shows 2023 as the most-recent available year. We will revisit when EPA either resumes publication or we wire EIP's release into the pipeline.
Caveats we surface.
- Self-reported, threshold-gated.Like TRI, GHGRP captures large emitters above a federal threshold. Below-threshold sources — small commercial boilers, distributed agriculture, on-road vehicles — are not in this dataset. We label GHG totals as “reported large-emitter CO₂e” rather than “total greenhouse gases.”
- CO₂e is a single number across many gases.Methane and N₂O are aggregated into the CO₂-equivalent total via 100-year global-warming potentials. Methane-heavy sectors (oil & gas, landfills) shift more under higher-GWP accounting; we use EPA's reported CO₂e as published.
- Federal-only. This is the reported federal figure, not a top-down inventory. State and tribal inventories may differ; we do not blend.
Reference. EPA GHGRP.
SDWIS · Safe Drinking Water Information System
Owner. EPA + state primacy agencies, under the Safe Drinking Water Act.
What it is. Compliance status for ~150,000 public water systems. Records each violation against an MCL (maximum contaminant level), treatment-technique requirement, or monitoring rule, with year, contaminant, and resolution status.
Cadence. Quarterly federal aggregation; states report as violations occur and resolve.
How we interpret it.
- Ownership type chip.SDWIS classifies every public water system by owner — local government, private, tribal, state, federal, or mixed. We surface that as a chip on every utility row (Municipal / Private / Tribal / State-owned / Federal / Mixed) so a reader can tell "City of Stockton" from a mobile-home park or HOA-run system without inferring from the name. Private systems are roughly 45% of the national CWS roster — many are small operators serving a single subdivision or trailer park; their compliance posture deserves a different read than a municipal utility's.
Caveats we surface.
- Health-based vs monitoring violations.A health-based violation means an MCL or TT was exceeded — actionable signal. A monitoring violation means the utility didn't collect or report a sample on time — concerning for transparency, but not a measured exceedance. We label these distinctly on every utility page.
- Active is not the same as crisis. Many utilities with health-based violations are mid-remediation under EPA-approved plans. We link to the EPA SDWIS record so readers can verify return-to-compliance status before forming a conclusion.
- State reporting variance. States vary in how aggressively they record monitoring violations. Cross-state comparisons of monitoring-violation counts are noisier than health-based counts.
- SDWIS is not the tap. SDWIS records utility-level compliance against federal standards. It does not measure what comes out of any individual tap, which depends on premise plumbing and corrosion conditions.
Reference. EPA SDWIS.
EJScreen · Environmental-Justice Screening (Post-2025 Substitution)
Owner of original EJScreen. EPA Office of Environmental Justice & External Civil Rights.
Status. EPA retired the public-facing EJScreen tool and its prebuilt CSV exports in 2025. The underlying block-group tables are still maintained by the same EPA team that built EJAM (the open-source successor), published via the USEPA-clone/ejamdata GitHub repo. We pull data/bgej.arrow directly and aggregate to state, county, and place by population-weighted mean.
What we currently render. Two-layer environmental burden, per indicator (PM2.5, ozone, NO₂, diesel particulate, RSEI toxic releases, lead-paint risk, NPL/RMP/TSDF/NPDES proximity, USTs, drinking-water non-compliance):
- National percentile — population-weighted mean of the indicator for the geography, ranked against the national CDF of all US block-group means. Computed in-pipeline from the raw indicator table at
USEPA-clone/EJAM-open/data/blockgroupstats.rda. Mirrors the framing EPA's original EJScreen tool used. - EJ disparity score — EPA's newer metric centered on 100 = population-weighted reference burden; higher = greater disparate exposure. Sourced from
bgej.arrowin the same upstream repo.
Cadence. Underlying indicators update on their source cadence (ACS 5-year, NATA modeling cycle, AQS rollups). The USEPA-clone/ejamdata Arrow file is refreshed when EPA pushes a new compilation; we re-pull on demand.
Caveats we surface.
- Block-group rollup. Not a personal exposure model. The underlying data rolls modeled exposure to block groups and pairs that with ACS demographics. Inside a block group there is variation we do not capture.
- Disparity score, not absolute level.A score of 150 says “the population here bears notably more burden than the population-weighted reference,” not “X concentration above standard.” We always show the underlying indicator label and the “reference burden” framing on every page.
- Substitution is explicit.Every page that uses the overlay credits “Census ACS 2018-2022 + USEPA-clone EJ disparity mirror” in the source line, with a link back to the upstream Arrow file. Readers can verify our rollups against the raw block-group inputs without going through the pipeline.
- No source laundering.We do not relabel disparity scores as “our index.” The metric is EPA's; the population weighting and place/county/state aggregation are ours, and the pipeline code is open.
Reference. USEPA-clone/ejamdata (current source of truth) · EPA EJScreen historical landing page (deprecated 2025).
Co-Located Health Indicators · CDC PLACES
What this section is.County and city pages render five chronic-disease prevalence estimates from CDC's Population Level Analysis and Community Estimates (PLACES) program — adult asthma, COPD, coronary heart disease, diabetes, and frequent mental distress. These sit immediately after the equity overlay so that pollution, demographics, and health-outcome context can be read together.
Why it's here.The pollution-vs-health correlation is the question readers actually arrive with. Surfacing co-located prevalence at the same geography as the pollution data makes the structural pattern visible without forcing readers to cross-reference three different government tools. The pattern is striking: Kern County's COPD prevalence is 38% above the California mean; Palo Alto's is 42% below. That gradient mirrors the pollution gradient closely.
What the data is — and isn't.
- Modeled, not measured. CDC PLACES uses a multi-level small-area regression on BRFSS (Behavioral Risk Factor Surveillance System) responses to produce a synthetic prevalence estimate per county and per Census place. It is not a count of diagnosed cases at the geography. Confidence intervals widen with smaller populations, and rural geographies with thin BRFSS samples should be read with more care.
- Crude vs age-adjusted. The headline tile value is crude prevalence — the actual local rate as published. Both the “vs state mean” and “vs US mean” comparators use age-adjusted prevalence on both sides so geographies with different age structures stay apples-to-apples. PLACES publishes both; we render both, in different roles.
- State and national comparators.Each tile carries two pills — vs state mean and vs US mean. State answers “how does this place stack up against the rest of its state?”; US answers “is this place's health profile typical of America, or an outlier in either direction?”. Both means are population-weighted across counties (PLACES has no published national or state aggregate row) — same methodology, different denominator. California averages slightly healthier than the US on most measures, so a tile that looks “flat vs CA” can still be “below the US mean”, and we render both so readers can see that.
- Ecological correlation, not causation. A higher pollution reading and a higher disease prevalence in the same county do not establish that the pollution caused the disease. Causal attribution requires individual-level data, exposure histories, and confounder controls that an area-level dataset cannot provide. We say this out loud on every section header so the framing carries through to anyone who lands directly on a county page.
- Vintage. The 2025 PLACES release is built on BRFSS 2022 and 2023 — different measures use different years depending on questionnaire rotation. Each tile labels its underlying data year. Re-pulled on each pipeline cycle.
Why these five measures. They map most directly to the air-pollution surfaces already on the page: asthma and COPD are the canonical PM2.5 / ozone-adjacent respiratory outcomes; CHD is the canonical air-pollution cardiovascular outcome; diabetes co-varies with the same socioeconomic structure that predicts pollution exposure; frequent mental distress captures the broader psychosocial cost of living next to industrial sites. The full PLACES catalog has 40 measures — we deliberately picked the smallest editorially defensible set rather than a wall of tiles.
What we do NOT render.
- State-level health tiles. Aggregated to a whole state, chronic-disease prevalence is too smoothed out to be editorially interesting — every state lands within a narrow band of the national mean. State-level PLACES values appear only as the comparator on county and city tiles.
- Tract-level rollups onto county or city pages. PLACES does publish at tract level, but rolling those tract values up to a coarser geography forces a population-weighting decision that PLACES already made at the coarser-geography level — re-doing it would just introduce noise. We use PLACES-published county and place data directly.
- Mortality at city level. CDC WONDER's actual cancer / cardiovascular mortality is published at county only. A city-page tile would have to fall back to the containing county, which would surprise readers comparing two cities in the same county. Mortality is on the roadmap as a county-only addition; we do not force-fit it onto city pages.
- Causal lag analyses.“PM2.5 in 2010 predicts cancer in 2025” is the most-clicked-on chart on environmental-health sites and the most likely to be misread as causal. We do not render lag charts on per-place pages. If we ever do, it will be on a separate “correlations” surface with the methodology disclaimer at the top, not nested inside a county page.
Reference. CDC PLACES program landing · County dataset on chronicdata.cdc.gov · Place dataset · BRFSS source survey.
Data Rights And Attribution
Every dataset on this site is a federal public-domain work under 17 USC §105. There is no licensing fee, royalty, or commercial-use restriction on the underlying observations. The operational rules are about API civility and attribution norms, not licensing.
- We attribute every source on the methodology page and in per-page footers, with the originating agency named and a link to the canonical EPA dataset.
- We do not paywall federal data behind a login or subscription.
- We do not relabel federal observations as proprietary. What we add is normalization, anomaly detection, narrative, and presentation.
- We use bulk downloads where available rather than scraping APIs, to minimize impact on EPA infrastructure.