Tuning Thresholds for False Positive Reduction

The threshold-tuning stage is the deterministic noise filter that converts normalized cost deltas into high-confidence PASS / WARN / FAIL verdicts, suppressing the volatile, low-signal alerts that otherwise erode trust in an automated regression pipeline.

This stage lives inside the Regression Detection & Rule Engines subsystem, immediately after delta computation and immediately before CI gating. It consumes a normalized delta vector, applies a fixed cascade of statistical boundary conditions, and emits a single classified verdict payload. It never captures plans, never queries the source database for cost data, and never triggers remediation. Its one job is to decide, reproducibly, whether an observed change is real degradation or noise — so that a FAIL verdict downstream is worth stopping a deploy for.

Architectural Boundaries

The evaluation function must be pure with respect to its input payload: identical delta vectors must always produce identical verdicts, independent of deployment epoch, worker identity, or wall-clock time. The only external state the stage reads is an append-only exponentially weighted moving average (EWMA) baseline keyed by query fingerprint — and that read is idempotent, never mutated inside the verdict path.

Upstream, this stage consumes the output of Tracking Cost Deltas Across Baseline Versions, which guarantees the incoming delta is already normalized against schema drift and aligned by the deterministic plan fingerprint. Where a query fans out across multiple relations, the delta arrives pre-weighted by the weighted cost delta method so a regression on a high-cardinality scan is not diluted by cheap operations.

Downstream, the emitted verdict fans out to three consumers: the CI/CD gate, the alert router, and the plan-stabilization controller. The stage sets the initial PASS/WARN/FAIL band and a confidence score; consumers apply their own policy on top. Verdicts are cross-referenced with structural signals such as join-type shifts and index-usage regressions before final routing priority is assigned.

A NormalizedDeltaVector enters a pure evaluator whose ordered cascade — execution floor, EWMA smoothing, variance-adjusted z-score, percentile boundary — reads only a read-only prior from the EWMA Baseline Store, then emits one ClassificationPayload that fans out to the CI Gate, Alert Router, and Plan Stabilizer.

Any deviation from this contract — embedding raw plan parsing, opening a write transaction, or holding stateful cooldown timers inside the evaluator — breaks determinism and reintroduces the exact non-reproducible alert states this stage exists to eliminate.

Deterministic Routing and Schema Enforcement

Every verdict is a function of a strictly typed input contract. The evaluator rejects any payload that fails schema validation rather than guessing at defaults, because a malformed delta silently coerced to zero is the single most common source of false negatives.

The input delta vector is validated against this JSON Schema before evaluation:

JSON

{
  "$id": "https://queryplan.org/schemas/normalized-delta-vector.json",
  "type": "object",
  "additionalProperties": false,
  "required": ["query_hash", "baseline_cost", "observed_cost", "execution_count", "historical_variance"],
  "properties": {
    "query_hash":          { "type": "string", "pattern": "^[a-f0-9]{64}$" },
    "baseline_cost":       { "type": "number", "exclusiveMinimum": 0 },
    "observed_cost":       { "type": "number", "minimum": 0 },
    "execution_count":     { "type": "integer", "minimum": 0 },
    "historical_variance": { "type": "number", "minimum": 0 }
  }
}

The emitted classification payload is equally rigid: a verdict enum of PASS, WARN, or FAIL; a confidence float in [0.0, 1.0]; a rule_id naming the exact cascade branch that fired (EXEC_FLOOR, BASELINE_STABLE, MARGIN_DEGRADATION, THRESHOLD_EXCEEDED, FALLBACK_DEFAULT); and a routing tag set. Because the query_hash is a 64-character SHA-256 fingerprint, sharding across evaluator workers uses a stable partition key derived directly from it:

partition_key = int(query_hash[:8], 16) % SHARD_COUNT

This formula guarantees that all deltas for a given query fingerprint land on the same worker and therefore read the same EWMA baseline row, eliminating cross-shard drift without any coordination. The cascade itself is a strict, ordered sequence — a query that is suppressed by the execution floor never reaches the z-score branch:

Execution Floor Suppression — queries below min_exec_count (50) are passed unconditionally, keeping cold-start and infrequent batch jobs out of the statistical baseline.
EWMA Smoothing — a per-fingerprint moving average with ewma_alpha of 0.2 absorbs transient spikes from cache misses or lock contention while preserving sensitivity to sustained degradation.
Variance-Adjusted Z-Score — the boundary scales with observed standard deviation, so high-variance analytical queries get wider tolerance bands than tight OLTP paths.
Percentile Boundary — a rolling 95th-percentile floor replaces static thresholds so the band tracks workload evolution instead of fighting it.

Production-Ready Implementation

The evaluator below is an asyncio service. It reads the EWMA baseline through an asyncpg pool, instruments every evaluation with OpenTelemetry spans and metrics, and emits structured logs through structlog. The verdict path is pure; the only I/O is the read-only baseline fetch and an idempotent EWMA upsert performed after the verdict is decided.

PYTHON

from __future__ import annotations

import math
import os
from dataclasses import dataclass, field
from typing import Literal

import asyncpg
import structlog
from opentelemetry import metrics, trace

log = structlog.get_logger(__name__)
tracer = trace.get_tracer(__name__)
meter = metrics.get_meter(__name__)

EVAL_DURATION = meter.create_histogram("threshold_eval_duration_ms", unit="ms")
VERDICT_TOTAL = meter.create_counter("threshold_verdict_total", unit="1")
FALLBACK_TOTAL = meter.create_counter("threshold_fallback_triggered_total", unit="1")

# --- Fixed cascade constants (mirror thresholds.yaml) ---------------------
MIN_EXEC_COUNT = 50
EWMA_ALPHA = 0.2
Z_SCORE_THRESHOLD = 2.5
MAX_COST_DELTA_PCT = 15.0
QUARANTINE_Z = 5.0
FALLBACK_CONFIDENCE = 0.45


@dataclass(frozen=True)
class NormalizedDeltaVector:
    query_hash: str
    baseline_cost: float
    observed_cost: float
    execution_count: int
    historical_variance: float


@dataclass
class ClassificationPayload:
    verdict: Literal["PASS", "WARN", "FAIL"]
    confidence: float
    rule_id: str
    routing_tags: list[str]
    metadata: dict = field(default_factory=dict)


PASS_TAGS = ["stable", "monitor"]
WARN_TAGS = ["degraded", "review_required"]
FAIL_TAGS = ["critical", "auto_rollback_candidate"]


class ThresholdEvaluator:
    """Stateless verdict engine over a shared, read-mostly EWMA baseline."""

    def __init__(self, pool: asyncpg.Pool) -> None:
        self._pool = pool

    async def _prior_ewma(self, query_hash: str) -> float | None:
        row = await self._pool.fetchrow(
            "SELECT ewma_delta_pct FROM threshold_ewma WHERE query_hash = $1",
            query_hash,
        )
        return None if row is None else float(row["ewma_delta_pct"])

    async def _persist_ewma(self, query_hash: str, ewma: float) -> None:
        # Idempotent upsert; runs only after the verdict is fixed.
        await self._pool.execute(
            """
            INSERT INTO threshold_ewma (query_hash, ewma_delta_pct, updated_at)
            VALUES ($1, $2, now())
            ON CONFLICT (query_hash)
            DO UPDATE SET ewma_delta_pct = EXCLUDED.ewma_delta_pct,
                          updated_at = now()
            """,
            query_hash,
            ewma,
        )

    async def evaluate(self, delta: NormalizedDeltaVector) -> ClassificationPayload:
        with tracer.start_as_current_span("threshold.evaluate") as span:
            span.set_attribute("query_hash", delta.query_hash)
            try:
                # 1. Execution-floor suppression — never touches the baseline.
                if delta.execution_count < MIN_EXEC_COUNT:
                    VERDICT_TOTAL.add(1, {"verdict": "PASS", "rule_id": "EXEC_FLOOR"})
                    return ClassificationPayload(
                        "PASS", 0.90, "EXEC_FLOOR", PASS_TAGS, {"suppressed": True}
                    )

                if delta.baseline_cost <= 0:
                    raise ValueError("baseline_cost must be > 0")

                cost_delta_pct = (
                    (delta.observed_cost - delta.baseline_cost) / delta.baseline_cost
                ) * 100.0

                std_dev = math.sqrt(delta.historical_variance)
                z_score = (
                    (delta.observed_cost - delta.baseline_cost) / std_dev
                    if std_dev > 0
                    else 0.0
                )

                # 2. Quarantine obviously corrupt inputs instead of forcing a verdict.
                if abs(z_score) >= QUARANTINE_Z and std_dev == 0.0:
                    raise ValueError("statistical divergence with zero variance")

                # 3. EWMA smoothing against the persisted per-fingerprint prior.
                prior = await self._prior_ewma(delta.query_hash)
                if prior is None:
                    # Unprofiled query: refuse a silent PASS, default to WARN.
                    span.set_attribute("cold_baseline", True)
                    await self._persist_ewma(delta.query_hash, cost_delta_pct)
                    return ClassificationPayload(
                        "WARN", FALLBACK_CONFIDENCE, "MARGIN_DEGRADATION",
                        WARN_TAGS, {"cold_baseline": True, "z_score": z_score},
                    )

                smoothed = (EWMA_ALPHA * cost_delta_pct) + ((1 - EWMA_ALPHA) * prior)

                # 4. Deterministic cascade over the smoothed delta and z-score.
                if smoothed <= MAX_COST_DELTA_PCT and abs(z_score) < Z_SCORE_THRESHOLD:
                    verdict, confidence, rule_id, tags = "PASS", 0.95, "BASELINE_STABLE", PASS_TAGS
                elif smoothed <= MAX_COST_DELTA_PCT * 1.5 and abs(z_score) < Z_SCORE_THRESHOLD * 1.2:
                    verdict, confidence, rule_id, tags = "WARN", 0.75, "MARGIN_DEGRADATION", WARN_TAGS
                else:
                    verdict, confidence, rule_id, tags = "FAIL", 0.98, "THRESHOLD_EXCEEDED", FAIL_TAGS

                await self._persist_ewma(delta.query_hash, smoothed)
                VERDICT_TOTAL.add(1, {"verdict": verdict, "rule_id": rule_id})
                span.set_attribute("verdict", verdict)
                return ClassificationPayload(
                    verdict, confidence, rule_id, tags,
                    {"z_score": z_score, "smoothed_delta_pct": smoothed},
                )

            except Exception as exc:  # noqa: BLE001 — fail closed, never crash the gate
                FALLBACK_TOTAL.add(1)
                span.record_exception(exc)
                log.error(
                    "threshold_eval_failed",
                    query_hash=delta.query_hash,
                    error=str(exc),
                )
                return ClassificationPayload(
                    "WARN", FALLBACK_CONFIDENCE, "FALLBACK_DEFAULT",
                    WARN_TAGS, {"error": str(exc)},
                )


async def build_pool() -> asyncpg.Pool:
    return await asyncpg.create_pool(
        dsn=os.environ["THRESHOLD_DSN"],
        min_size=2,
        max_size=int(os.environ.get("THRESHOLD_POOL_MAX", "10")),
        command_timeout=5.0,
    )

Two properties make this safe under load. First, the fallback path returns WARN rather than raising, so a corrupt delta or a transient database blip never crashes the CI gate — it fails closed to a reviewable state. Second, cold baselines (no prior EWMA row) never emit a silent PASS; an unprofiled query is always surfaced as a WARN so a genuinely regressed new query cannot slip through unmeasured.

Threshold Reference: Exact SLOs and Alert Rules

The stage carries its own operational SLOs independent of the verdicts it produces. These are the numbers alerting is wired against — they define when the evaluator itself is misbehaving.

Signal	Metric	Pass	Warn	Block
Evaluation latency	`threshold_eval_duration_ms` p95	≤ 25 ms	25–75 ms	> 75 ms
Fallback rate	`threshold_fallback_triggered_total` / total	< 0.5 %	0.5–2 %	> 2 %
Cold-baseline rate	share of `cold_baseline` verdicts	< 5 %	5–15 %	> 15 %
`FAIL` verdict share	`threshold_verdict_total{verdict="FAIL"}`	< 3 %	3–8 %	> 8 %
Baseline read latency	`asyncpg` fetchrow p99	≤ 8 ms	8–20 ms	> 20 ms

The FAIL share bands are the practical false-positive dial: a sustained FAIL share above 8 % almost always means max_cost_delta_pct or z_score_threshold is set too tight for the workload, not that the fleet suddenly regressed. Encode that as a Prometheus alert:

YAML

groups:
  - name: threshold-evaluator
    rules:
      - alert: ThresholdFalsePositiveStorm
        expr: |
          (
            sum(rate(threshold_verdict_total{verdict="FAIL"}[15m]))
            /
            sum(rate(threshold_verdict_total[15m]))
          ) > 0.08
        for: 10m
        labels:
          severity: page
        annotations:
          summary: "FAIL verdict share above 8% for 10m — thresholds likely too tight"
          runbook: "Widen z_score_threshold to 3.0 or raise max_cost_delta_pct before re-gating."

      - alert: ThresholdEvaluatorFallbacks
        expr: |
          sum(rate(threshold_fallback_triggered_total[5m]))
          /
          sum(rate(threshold_verdict_total[5m])) > 0.02
        for: 5m
        labels:
          severity: page
        annotations:
          summary: "Evaluator falling back to WARN >2% — check schema validation and DSN health"

For turning these static bands into workload-adaptive boundaries, pair this reference with Setting Dynamic Thresholds for Query Regression Alerts, which drives the percentile floors this stage reads.

Failure Scenarios and Root Cause Analysis

1. False-positive storm after a statistics refresh. Symptom: FAIL share jumps from ~2 % to ~30 % within minutes of an ANALYZE across the fleet. Root cause: a global cost re-estimation shifts every baseline at once, and the EWMA prior lags the new reality. Diagnose by correlating the spike against the stats timestamp:

SQL

SELECT relname, last_analyze
FROM pg_stat_user_tables
WHERE last_analyze > now() - interval '30 minutes'
ORDER BY last_analyze DESC;

Mitigation: gate a fleet-wide EWMA reset behind the stats event so priors are re-seeded from the new plans rather than fighting them.

2. Silent false negatives from zero-variance inputs. Symptom: obvious regressions pass as BASELINE_STABLE. Root cause: historical_variance arrives as 0, collapsing the z-score branch to 0.0 and letting the smoothed-delta branch dominate. Diagnose by counting zero-variance deltas:

BASH

grep -c '"historical_variance": 0' /var/log/threshold/deltas.jsonl

Mitigation: the schema already forbids negatives; add a producer-side check that emits variance from at least 5 historical samples, and treat single-sample fingerprints as cold baselines (forced WARN).

3. Cross-shard EWMA divergence. Symptom: the same query_hash yields different verdicts on retries. Root cause: SHARD_COUNT changed without draining, so deltas for one fingerprint now hash to a different worker reading a stale prior. Diagnose by inspecting the partition assignment for the fingerprint and comparing updated_at across replicas. Mitigation: treat SHARD_COUNT as immutable at runtime; re-shard only through a full drain-and-reseed of threshold_ewma.

4. Fallback flood from DSN saturation. Symptom: threshold_fallback_triggered_total climbs while verdicts collapse to WARN. Root cause: the asyncpg pool is exhausted, _prior_ewma times out at command_timeout, and every evaluation trips the fallback path. Diagnose with:

SQL

SELECT state, count(*) FROM pg_stat_activity
WHERE application_name = 'threshold_evaluator'
GROUP BY state;

Mitigation: raise THRESHOLD_POOL_MAX, add a read replica for the baseline store, and confirm the fallback stays WARN (never PASS) so the flood degrades safely.

5. Cooldown leakage breaking determinism. Symptom: replaying an identical delta produces a different verdict than the original run. Root cause: someone added a time-based cooldown inside the evaluator, violating purity. Mitigation: move all cooldown and dedupe logic into the downstream alert router; the evaluator must remain a function of its typed input plus the read-only prior only.

Configuration Reference

Key / Env Var	Default	Purpose
`min_exec_count`	`50`	Execution floor below which queries pass unconditionally.
`ewma_alpha`	`0.2`	Weight of the current observation vs. the persisted prior.
`z_score_threshold`	`2.5`	Standard-deviation boundary for the stable band.
`max_cost_delta_pct`	`15.0`	Smoothed cost-delta ceiling for a `PASS` verdict.
`percentile_floor`	`0.95`	Rolling percentile that replaces static latency floors.
`QUARANTINE_Z`	`5.0`	Divergence beyond which a zero-variance input is quarantined.
`fallback_confidence`	`0.45`	Confidence stamped on `FALLBACK_DEFAULT` and cold-baseline verdicts.
`THRESHOLD_DSN`	—	Connection string for the read-mostly EWMA baseline store.
`THRESHOLD_POOL_MAX`	`10`	Upper bound on the `asyncpg` pool; raise under fallback pressure.
`SHARD_COUNT`	—	Immutable-at-runtime worker count in the partition-key formula.

Widen z_score_threshold and max_cost_delta_pct first when chasing false positives; tighten min_exec_count and ewma_alpha when chasing false negatives. Never adjust both directions in the same change — a single-variable move keeps the calibration reproducible against replayed traffic.

Tracking Cost Deltas Across Baseline Versions — the upstream comparator that produces the normalized delta vector this stage consumes.
Detecting Join Type Shifts in Execution Plans — the structural signal cross-referenced when escalating a FAIL verdict.
Monitoring Index Usage Changes for Regression Signals — a sibling detector whose findings raise routing priority.
Setting Dynamic Thresholds for Query Regression Alerts — how the percentile floors and adaptive bands are derived.
Defining Regression Thresholds for Query Plans — the baseline threshold model this tuning stage refines.

← Back to Regression Detection & Rule Engines

Architectural Boundaries #

Deterministic Routing and Schema Enforcement #

Production-Ready Implementation #

Threshold Reference: Exact SLOs and Alert Rules #

Failure Scenarios and Root Cause Analysis #

Configuration Reference #

Related #