Guide
Defining Regression Thresholds for Query Plans
In automated query performance pipelines, the regression evaluation stage operates as the deterministic decision gate between raw telemetry ingestion and operational response. Defining Regression Thresholds for Query Plans requires strict isolation from upstream capture mechanisms and downstream remediation workflows. This stage consumes normalized baseline metrics, applies mathematically bounded evaluation functions, and emits binary or graded regression signals. The architecture must guarantee idempotent threshold computation, deterministic routing, and explicit failure modes to prevent alert fatigue or false-positive rollbacks.
Stage Isolation & Input Contract
The threshold evaluation stage accepts a strictly typed payload: plan identifiers, historical baseline distributions, current execution metrics, and cost model outputs. It does not perform plan capture, hash computation, or index recommendation. Those responsibilities belong to upstream modules. The evaluation boundary begins only after a stable plan signature has been resolved and persisted. By decoupling threshold logic from Core Architecture & Baselining Fundamentals, the pipeline ensures that metric normalization and threshold computation remain stateless and horizontally scalable. Input validation rejects payloads missing baseline percentiles, incomplete cost vectors, or unverified execution timestamps, routing malformed records to a quarantine queue for manual reconciliation.
Threshold Taxonomy & Deterministic Evaluation
Effective regression detection relies on multi-dimensional thresholds rather than single-point latency comparisons. The evaluation engine applies three primary threshold classes:
- Execution Latency Thresholds: Percentile-based bounds (P50, P90, P99) derived from rolling baseline windows. A regression triggers when current metrics exceed the upper confidence interval by a configurable sigma multiplier.
- Cost Estimation Thresholds: Relative deviation limits between estimated and actual execution costs. Cross-engine normalization requires careful mapping, particularly when evaluating Cost Estimation Mapping Across PostgreSQL and MySQL. Thresholds here account for optimizer-specific cost scaling factors rather than raw numerical equality.
- Plan Stability Thresholds: Structural deviation metrics that measure node reordering, join method substitution, or access path changes. These thresholds rely on deterministic plan signatures, typically generated via Plan Hashing Algorithms for SQL Engines, to isolate structural regressions from transient performance variance.
Each threshold class produces a normalized deviation score. The evaluation stage aggregates these scores using a weighted composite function, applying hard cutoffs for critical regressions and soft thresholds for advisory signals. Routing logic maps composite scores to specific operational queues: P0_CRITICAL (immediate rollback/block), P1_DEGRADED (investigation ticket), and P0_STABLE (baseline update).
Implementation Blueprint
Production-grade threshold evaluation requires idempotent execution, explicit observability instrumentation, and deterministic fallback protocols. The following Python implementation demonstrates a stateless evaluator with OpenTelemetry tracing, metric emission, and circuit-breaker fallbacks.
import time
import logging
from dataclasses import dataclass
from typing import Optional, Dict
from opentelemetry import trace, metrics
tracer = trace.get_tracer("query.regression.evaluator")
meter = metrics.get_meter("query.regression.evaluator")
latency_deviation_counter = meter.create_counter("threshold.latency_deviation")
fallback_active_gauge = meter.create_up_down_counter("threshold.fallback_active")
@dataclass(frozen=True)
class EvaluationPayload:
plan_hash: str
current_p99_ms: float
baseline_p99_ms: float
cost_deviation_pct: float
structural_hash_match: bool
@dataclass(frozen=True)
class EvaluationResult:
status: str # STABLE, DEGRADED, CRITICAL
composite_score: float
routing_queue: str
fallback_triggered: bool = False
# Production thresholds (configurable via feature flags or config management)
THRESHOLDS = {
"latency_sigma": 2.5,
"cost_deviation_max_pct": 15.0,
"structural_mismatch_penalty": 0.8,
"critical_composite_floor": 1.0,
"degraded_composite_floor": 0.5
}
def evaluate_regression(payload: EvaluationPayload) -> EvaluationResult:
with tracer.start_as_current_span("evaluate_query_plan_thresholds") as span:
span.set_attribute("plan.hash", payload.plan_hash)
try:
# 1. Latency deviation (normalized to baseline)
if payload.baseline_p99_ms <= 0:
raise ValueError("Invalid baseline P99")
latency_ratio = payload.current_p99_ms / payload.baseline_p99_ms
latency_score = max(0.0, (latency_ratio - 1.0) / THRESHOLDS["latency_sigma"])
# 2. Cost estimation deviation
cost_score = min(1.0, payload.cost_deviation_pct / THRESHOLDS["cost_deviation_max_pct"])
# 3. Structural stability
structural_penalty = THRESHOLDS["structural_mismatch_penalty"] if not payload.structural_hash_match else 0.0
# 4. Composite aggregation (weighted)
composite = (0.5 * latency_score) + (0.3 * cost_score) + (0.2 * structural_penalty)
# 5. Deterministic routing
if composite >= THRESHOLDS["critical_composite_floor"]:
status, queue = "CRITICAL", "P0_CRITICAL"
elif composite >= THRESHOLDS["degraded_composite_floor"]:
status, queue = "DEGRADED", "P1_DEGRADED"
else:
status, queue = "STABLE", "P0_STABLE"
latency_deviation_counter.add(latency_score, {"plan_hash": payload.plan_hash})
return EvaluationResult(status=status, composite_score=composite, routing_queue=queue)
except Exception as e:
# Safe fallback: default to DEGRADED to prevent silent failures
logging.error(f"Threshold evaluation failed for {payload.plan_hash}: {e}")
fallback_active_gauge.add(1)
return EvaluationResult(
status="DEGRADED",
composite_score=0.6,
routing_queue="P1_DEGRADED",
fallback_triggered=True
)Observability Hooks & Safe Fallback Protocols
Threshold evaluation must emit structured telemetry at every decision boundary. The evaluator should publish threshold.composite_score as a gauge, threshold.routing_decision as a counter, and threshold.fallback_invocation for circuit-breaker tracking. Tracing spans must propagate the plan_hash and baseline_window_id to enable correlation with upstream ingestion and downstream remediation.
Safe fallback protocols activate when baseline data is stale, missing, or mathematically invalid. In these scenarios, the evaluator defaults to a conservative DEGRADED state rather than suppressing alerts. This prevents silent regressions during cold starts or metric pipeline outages. Fallback states are explicitly tagged in routing payloads, allowing downstream automation to bypass irreversible actions (e.g., automated query blocking) while still triggering human-in-the-loop review.
For advanced alerting configurations, teams should implement adaptive windowing that adjusts sigma multipliers based on query volatility. Detailed strategies for calibrating these parameters are documented in Setting Dynamic Thresholds for Query Regression Alerts. Additionally, all threshold definitions must be version-controlled and deployed alongside the evaluation binary to maintain auditability. When integrating with external monitoring stacks, align metric naming conventions with established standards such as the OpenTelemetry Semantic Conventions to ensure cross-platform compatibility.
Routing Determinism & Idempotency Guarantees
The evaluation stage guarantees idempotency by deriving all outputs solely from the input payload and immutable threshold configurations. No external state, cache lookups, or network calls are permitted during computation. Routing decisions are deterministic: identical payloads always yield identical composite scores and queue assignments. This property enables safe replay of historical telemetry for threshold tuning and regression forensics.
Downstream consumers receive payloads containing the evaluation result, original telemetry snapshot, and a cryptographic checksum of the input. This design supports audit trails, compliance reporting, and automated rollback verification. By enforcing strict boundaries around threshold computation, platform teams eliminate race conditions, reduce false-positive rates, and establish a reliable foundation for autonomous query performance management.
For deeper reference on optimizer behavior and cost model interpretation, consult the official PostgreSQL EXPLAIN documentation to align threshold expectations with actual execution engine semantics.