Defining Regression Thresholds for Query Plans

The threshold evaluation stage is the deterministic decision gate that turns normalized baseline metrics into a graded regression verdict, sitting strictly between plan capture and any operational response.

In an automated query performance system this stage owns exactly one responsibility: consume a fingerprinted plan plus its historical baseline, apply mathematically bounded evaluation functions, and emit a routable verdict. It does not capture plans, compute hashes, or recommend indexes. Keeping that boundary sharp is what guarantees idempotent computation, deterministic routing, and explicit failure modes — the three properties that stop a regression pipeline from decaying into alert fatigue or false-positive rollbacks. This page defines the input contract, the scoring model, a runnable async evaluator, the numeric service-level objectives it must hold, and the failure modes you will actually page on in production.

Architectural Boundaries

The evaluation stage consumes a strictly typed payload emitted by upstream capture and normalization: a stable plan signature, the historical baseline distribution for that signature, the current execution metrics, and the normalized cost vector. It emits a single graded verdict plus the original telemetry snapshot and a checksum, published to one of three routing destinations.

Everything the stage depends on is produced elsewhere. The plan signature arrives from the SHA-256 fingerprinting approach in Plan Hashing Algorithms for SQL Engines; the cost fields are already engine-normalized by Cost Estimation Mapping Across PostgreSQL and MySQL; and the raw execution plans themselves are collected by the Automated EXPLAIN Capture & Storage Workflows pipeline. Because normalization has already happened, the evaluator can remain stateless and horizontally scalable — no cache lookups, no network calls, no engine-specific branching inside the hot path.

Downstream, the verdict feeds the routing tiers described later on this page and, ultimately, the rule engines in Regression Detection Rule Engines that act on sustained signals. The stage never reaches back upstream, and it never performs the remediation itself — that separation is what makes historical telemetry safe to replay for threshold tuning.

Normalized inputs pass a schema gate — malformed payloads divert to quarantine — then latency, cost, and structural scores fold into one weighted composite that routes deterministically to P0_CRITICAL, P1_DEGRADED, or STABLE.

Deterministic Routing and Schema Enforcement

Input validation is the first line of determinism. A payload missing baseline percentiles, carrying an incomplete cost vector, or bearing an unverified execution timestamp is rejected before any arithmetic runs and routed to a quarantine queue for manual reconciliation — never silently defaulted. The contract is enforced as a JSON Schema at ingress:

JSON

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "title": "ThresholdEvaluationPayload",
  "type": "object",
  "additionalProperties": false,
  "required": ["plan_hash", "baseline_window_id", "current_p99_ms",
               "baseline_p99_ms", "cost_deviation_pct", "structural_hash_match",
               "captured_at"],
  "properties": {
    "plan_hash":            { "type": "string", "pattern": "^[0-9a-f]{64}$" },
    "baseline_window_id":   { "type": "string", "minLength": 8 },
    "current_p99_ms":       { "type": "number", "exclusiveMinimum": 0 },
    "baseline_p99_ms":      { "type": "number", "exclusiveMinimum": 0 },
    "cost_deviation_pct":   { "type": "number", "minimum": 0 },
    "structural_hash_match":{ "type": "boolean" },
    "captured_at":          { "type": "string", "format": "date-time" }
  }
}

Routing is a pure function of the composite score. Verdicts map to three destinations, and the partition key that fans work across evaluator shards is derived only from immutable fields so that identical payloads always land on the same shard and produce the same result:

partition_key = crc32(plan_hash[:8] + ":" + baseline_window_id) % SHARD_COUNT

The three destinations are a critical queue that can trigger an immediate block or rollback, a degraded queue that opens an investigation ticket, and a stable path that promotes the current metrics into the rolling baseline. Because routing derives solely from the payload and an immutable threshold configuration, the same telemetry replayed a year later yields the same queue assignment — the property that makes regression forensics and threshold back-testing trustworthy.

Threshold Taxonomy and Deterministic Evaluation

Effective detection relies on multi-dimensional thresholds rather than a single latency comparison. The evaluator scores three orthogonal signal classes and folds them into one composite:

Execution latency — percentile bounds (P50, P90, P99) taken from the rolling baseline window. A regression accrues score as the current P99 exceeds the baseline, scaled by a configurable sigma multiplier so noisy queries do not trip on ordinary variance.
Cost estimation deviation — the relative gap between baseline and current normalized cost. Because the cost vector is already dimensionless when it reaches this stage, the threshold compares like with like rather than raw PostgreSQL planner units against MySQL cost_info values.
Plan structural stability — a boolean derived from the plan signature that captures node reordering, join-method substitution, or access-path changes. Structural drift is the strongest early signal of a genuine regression, so a mismatch applies a fixed penalty independent of the timing signals.

Each class yields a normalized deviation score; the stage aggregates them with fixed weights (0.5 latency, 0.3 cost, 0.2 structural) and applies hard cutoffs. This weighting keeps latency dominant while ensuring a structural break alone can lift an otherwise-quiet query into the degraded band. Calibrating the sigma multiplier and window length against query volatility is its own discipline, documented in Setting Dynamic Thresholds for Query Regression Alerts.

Production-Ready Implementation

The evaluator below is stateless, async-safe, and instrumented. It runs under asyncio, emits structured logs through structlog, and traces every decision boundary through OpenTelemetry. The happy path derives all outputs from the payload alone; the exception path fails conservatively into a DEGRADED verdict rather than suppressing a possible regression.

PYTHON

import asyncio
from dataclasses import dataclass, asdict

import structlog
from opentelemetry import trace, metrics

log = structlog.get_logger("query.regression.evaluator")
tracer = trace.get_tracer("query.regression.evaluator")
meter = metrics.get_meter("query.regression.evaluator")

composite_score_gauge = meter.create_gauge("threshold.composite_score")
routing_decision_counter = meter.create_counter("threshold.routing_decision")
fallback_counter = meter.create_counter("threshold.fallback_invocation")

# Immutable production thresholds. Version these alongside the binary.
THRESHOLDS = {
    "latency_sigma": 2.5,
    "cost_deviation_max_pct": 15.0,
    "structural_mismatch_penalty": 0.8,
    "critical_composite_floor": 1.0,
    "degraded_composite_floor": 0.5,
    "weight_latency": 0.5,
    "weight_cost": 0.3,
    "weight_structural": 0.2,
}


@dataclass(frozen=True)
class EvaluationPayload:
    plan_hash: str
    baseline_window_id: str
    current_p99_ms: float
    baseline_p99_ms: float
    cost_deviation_pct: float
    structural_hash_match: bool


@dataclass(frozen=True)
class EvaluationResult:
    plan_hash: str
    status: str            # STABLE | DEGRADED | CRITICAL
    composite_score: float
    routing_queue: str     # P0_CRITICAL | P1_DEGRADED | STABLE
    fallback_triggered: bool = False


async def evaluate_regression(payload: EvaluationPayload) -> EvaluationResult:
    with tracer.start_as_current_span("evaluate_query_plan_thresholds") as span:
        span.set_attribute("plan.hash", payload.plan_hash)
        span.set_attribute("baseline.window_id", payload.baseline_window_id)
        try:
            if payload.baseline_p99_ms <= 0:
                raise ValueError("invalid baseline P99")

            # 1. Latency deviation, normalized to baseline and scaled by sigma.
            latency_ratio = payload.current_p99_ms / payload.baseline_p99_ms
            latency_score = max(0.0, (latency_ratio - 1.0) / THRESHOLDS["latency_sigma"])

            # 2. Cost deviation, clamped to [0, 1].
            cost_score = min(1.0, payload.cost_deviation_pct / THRESHOLDS["cost_deviation_max_pct"])

            # 3. Structural stability penalty (fixed on mismatch).
            structural_penalty = (
                0.0 if payload.structural_hash_match
                else THRESHOLDS["structural_mismatch_penalty"]
            )

            # 4. Weighted composite aggregation.
            composite = (
                THRESHOLDS["weight_latency"] * latency_score
                + THRESHOLDS["weight_cost"] * cost_score
                + THRESHOLDS["weight_structural"] * structural_penalty
            )

            # 5. Deterministic routing.
            if composite >= THRESHOLDS["critical_composite_floor"]:
                status, queue = "CRITICAL", "P0_CRITICAL"
            elif composite >= THRESHOLDS["degraded_composite_floor"]:
                status, queue = "DEGRADED", "P1_DEGRADED"
            else:
                status, queue = "STABLE", "STABLE"

            composite_score_gauge.set(composite, {"plan_hash": payload.plan_hash})
            routing_decision_counter.add(1, {"queue": queue})
            result = EvaluationResult(payload.plan_hash, status, composite, queue)
            await log.ainfo("evaluated", **asdict(result))
            return result

        except (ValueError, ZeroDivisionError) as exc:
            # Conservative fallback: never suppress a possible regression.
            fallback_counter.add(1, {"reason": type(exc).__name__})
            await log.awarning("fallback", plan_hash=payload.plan_hash, error=str(exc))
            span.record_exception(exc)
            return EvaluationResult(
                plan_hash=payload.plan_hash,
                status="DEGRADED",
                composite_score=THRESHOLDS["degraded_composite_floor"] + 0.1,
                routing_queue="P1_DEGRADED",
                fallback_triggered=True,
            )


async def evaluate_batch(payloads: list[EvaluationPayload]) -> list[EvaluationResult]:
    """Fan out a capture window across the async pool; order is preserved."""
    return await asyncio.gather(*(evaluate_regression(p) for p in payloads))

The composite floors are chosen so that a P99 at 2.5x baseline (latency_score = 1.0) alone reaches CRITICAL, while a structural break plus a moderate cost drift lands in DEGRADED. Fallback verdicts carry fallback_triggered=True so downstream automation can skip irreversible actions such as automated query blocking while still opening a human-in-the-loop ticket.

Threshold Table and Alerting SLOs

The evaluator itself has operational SLOs, and the verdicts it produces map to fixed action bands. Both are numeric — nothing here is “tune to taste.”

Signal / SLO	Pass	Warn	Block
Composite score	`< 0.5`	`0.5 – 0.99`	$\ge 1.0$
Latency ratio (current P99 / baseline P99)	$< 1.35\times$	$1.35\times – 2.5\times$	$> 2.5\times$
Normalized cost deviation	`< 7.5%`	`7.5% – 15%`	`> 15%`
Structural signature match	match	—	mismatch
Evaluator p95 decision latency	`< 15 ms`	`15 – 40 ms`	`> 40 ms`
Fallback invocation rate (5-min)	`< 0.1%`	`0.1% – 1%`	`> 1%`
Quarantine (schema-reject) rate	`< 0.05%`	`0.05% – 0.5%`	`> 0.5%`

The operational SLOs are enforced by alerts on the emitted metrics. A representative Prometheus rule set:

YAML

groups:
  - name: threshold-evaluator
    rules:
      - alert: EvaluatorFallbackSpike
        expr: |
          sum(rate(threshold_fallback_invocation_total[5m]))
            / sum(rate(threshold_routing_decision_total[5m])) > 0.01
        for: 10m
        labels: { severity: page }
        annotations:
          summary: "Threshold fallback rate >1% — baseline data likely stale"
      - alert: EvaluatorDecisionLatencyHigh
        expr: |
          histogram_quantile(0.95,
            sum by (le) (rate(threshold_decision_latency_seconds_bucket[5m]))) > 0.040
        for: 15m
        labels: { severity: ticket }
        annotations:
          summary: "Evaluator p95 decision latency above 40ms SLO"
      - alert: QuarantineRateHigh
        expr: |
          sum(rate(threshold_schema_reject_total[5m]))
            / sum(rate(threshold_ingress_total[5m])) > 0.005
        for: 10m
        labels: { severity: ticket }
        annotations:
          summary: "Schema-reject rate >0.5% — upstream contract drift"

Failure Scenarios and Root Cause Analysis

Four failure modes account for nearly every real page against this stage.

1. Stale baseline window. Symptom: fallback rate climbs and a burst of CRITICAL verdicts appears with no corresponding production change. Cause: the baseline promotion job stalled, so baseline_p99_ms reflects an old, faster distribution. Diagnose: SELECT max(captured_at) FROM baseline_windows WHERE window_id = $1; and compare against now(). Mitigate: gate promotions on freshness, and let the conservative fallback hold verdicts at DEGRADED until the window refreshes rather than auto-blocking.

2. Upstream contract drift. Symptom: the quarantine rate breaches 0.5% right after an upstream deploy. Cause: the capture or normalization stage changed a field name or dropped the cost vector. Diagnose: inspect a rejected sample against the JSON Schema with check-jsonschema --schemafile payload.schema.json sample.json. Mitigate: version the payload schema and fail the upstream CI gate on contract changes before they reach the evaluator.

3. Cold-start metric gaps. Symptom: a newly captured plan signature has no baseline and every evaluation falls back. Cause: the plan hash is genuinely new — a fresh query shape from a deploy. Diagnose: count fallbacks grouped by plan_hash age. Mitigate: route unknown signatures to a “baseline-establishing” path that observes N windows before it is eligible for CRITICAL, so a new query cannot trigger a rollback on its first appearance.

4. Sigma miscalibration on volatile queries. Symptom: a high false-positive rate concentrated on a handful of analytical queries. Cause: a global latency_sigma of 2.5 is too tight for inherently bursty workloads. Diagnose: replay the last week of telemetry through the evaluator and inspect the composite-score distribution per signature. Mitigate: apply per-signature sigma overrides, the adaptive-windowing technique detailed in Setting Dynamic Thresholds for Query Regression Alerts, and cross-check against Tuning Thresholds for False-Positive Reduction.

Configuration Reference

All tuning knobs are immutable at runtime and deployed alongside the evaluator binary so that a verdict is always reproducible from a known configuration version.

Key / Env var	Default	Purpose
`QP_LATENCY_SIGMA`	`2.5`	Divisor applied to the latency ratio; higher tolerates more variance
`QP_COST_DEVIATION_MAX_PCT`	`15.0`	Cost deviation that maps to a full cost_score of 1.0
`QP_STRUCTURAL_PENALTY`	`0.8`	Fixed penalty added when the plan signature mismatches
`QP_CRITICAL_FLOOR`	`1.0`	Composite score at or above which a verdict is `CRITICAL`
`QP_DEGRADED_FLOOR`	`0.5`	Composite score at or above which a verdict is `DEGRADED`
`QP_SHARD_COUNT`	`16`	Partition count for the `partition_key` fan-out formula
`QP_BASELINE_MAX_AGE_S`	`3600`	Freshness ceiling; older windows force the fallback path
`QP_OTEL_EXPORTER_ENDPOINT`	—	OTLP collector endpoint for traces and metrics

Because every output is a pure function of the payload and these immutable values, the stage guarantees idempotency: no external state, no cache lookups, and no network calls occur during computation. Downstream consumers receive the verdict, the original telemetry snapshot, and a checksum of the input — enough to support audit trails, compliance reporting, and automated rollback verification without ever re-reading upstream state.

Plan Hashing Algorithms for SQL Engines — produces the structural signature this stage consumes.
Cost Estimation Mapping Across PostgreSQL and MySQL — normalizes the cost vector before evaluation.
Setting Dynamic Thresholds for Query Regression Alerts — calibrating sigma and window length for volatile queries.
Tuning Thresholds for False-Positive Reduction — sibling techniques for cutting noisy verdicts.
Tracking Cost Deltas Across Baseline Versions — how downstream engines act on sustained signals.

← Back to Core Architecture & Baselining Fundamentals

Architectural Boundaries #

Deterministic Routing and Schema Enforcement #

Threshold Taxonomy and Deterministic Evaluation #

Production-Ready Implementation #

Threshold Table and Alerting SLOs #

Failure Scenarios and Root Cause Analysis #

Configuration Reference #

Related #