Runbook
Setting Dynamic Thresholds for Query Regression Alerts
Static latency or execution-time thresholds fail in production because query performance is inherently non-linear. Data volume shifts, parameter sniffing, index fragmentation, and optimizer statistics updates all introduce variance that fixed alerting rules cannot absorb. Setting Dynamic Thresholds for Query Regression Alerts requires a baseline-driven, statistically adaptive approach that correlates plan stability with execution metrics. This runbook operationalizes the principles established in Core Architecture & Baselining Fundamentals and provides deterministic workflows for detection, triage, and safe override.
Telemetry Capture & Multi-Signal Correlation
Production regressions rarely manifest as isolated metric spikes. They appear as correlated deviations across execution profiles. To eliminate false positives, telemetry must be captured at the query-plan fingerprint level, not the raw statement level. Implement the following signal collection pipeline using OpenTelemetry database instrumentation and eBPF-based wait-state tracers:
- Plan Hash Instability: Monitor
plan_hash_value(SQL Server) orqueryid/planid(PostgreSQL). Critical paths tolerate0drift. Any hash change triggers immediate baseline re-evaluation. - Logical Read Explosion: Track
buffer_gets(Oracle/SQL Server) orshared_blks_hit + shared_blks_read(PostgreSQL). Threshold:>35%deviation from baseline on identical parameter distributions. - Execution Time Variance: Monitor
p95_durationagainst a rolling 7-day median. Threshold: sustained crossing of1.35xfor consecutive execution windows (typically 5-minute intervals). - Row Examination Ratio: Calculate
rows_examined / rows_returned. Threshold:>10:1for OLTP point-lookups,>50:1for analytical range scans. - Wait Profile Shift: Detect sudden dominance of
PAGEIOLATCH,CPU, orLOCKwaits whereASYNC_NETWORK_IOorNETWORKpreviously dominated.
Correlation Rule: Alerts fire only when signals breach thresholds within a 15-minute sliding window. Single-metric deviations route to a low-priority telemetry bucket for trend analysis, not incident response.
Dynamic Threshold Calculation Engine
Fixed multipliers ignore seasonal traffic patterns and gradual data growth. Replace them with an Exponentially Weighted Moving Average (EWMA) baseline paired with Median Absolute Deviation (MAD) for variance normalization. The following Python engine computes adaptive upper/lower bounds:
import numpy as np
import pandas as pd
from scipy.stats import median_abs_deviation
def compute_dynamic_thresholds(metric_series: pd.Series,
ewma_alpha: float = 0.15,
mad_multiplier: float = 2.5) -> dict:
"""
Computes adaptive thresholds for query regression detection.
metric_series: Time-indexed series of execution metrics (e.g., p95_duration_ms)
"""
# 1. EWMA Baseline (smooths gradual data growth)
ewma_baseline = metric_series.ewm(alpha=ewma_alpha, adjust=False).mean()
# 2. Rolling MAD (robust to outliers, unlike standard deviation)
rolling_mad = metric_series.rolling(window=1008).apply(median_abs_deviation, raw=True)
# 3. Dynamic Upper Bound (adaptive ceiling)
upper_bound = ewma_baseline + (mad_multiplier * rolling_mad)
# 4. Hysteresis buffer prevents alert flapping during recovery
lower_bound = ewma_baseline - (0.5 * rolling_mad)
return {
"baseline": ewma_baseline.iloc[-1],
"upper_threshold": upper_bound.iloc[-1],
"lower_threshold": lower_bound.iloc[-1],
"current_variance_pct": ((metric_series.iloc[-1] - ewma_baseline.iloc[-1]) / ewma_baseline.iloc[-1]) * 100
}Deploy this engine as a sidecar or scheduled Lambda/Cloud Function that ingests metrics from your time-series database. Persist computed thresholds in a version-controlled configuration store to enable auditability and rollback.
Deterministic Root Cause Analysis
When dynamic thresholds trigger, execute the following triage sequence to isolate degradation vectors. Do not proceed to remediation until the root cause is classified.
Step 1: Verify Plan Fingerprint
Compare the current plan_hash against the registry baseline. A mismatch confirms optimizer re-evaluation (join order change, scan type flip, or parallelism adjustment).
Step 2: Validate Statistics Freshness
Stale column statistics on high-cardinality fields frequently force nested-loop joins over hash joins.
PostgreSQL:
SELECT relname, last_analyze, last_autoanalyze, n_mod_since_analyze
FROM pg_stat_user_tables
WHERE relname = 'target_table'
ORDER BY n_mod_since_analyze DESC;SQL Server:
SELECT name, last_updated, modification_counter, rows
FROM sys.stats s
JOIN sys.dm_db_stats_properties(s.object_id, s.stats_id) sp ON s.object_id = sp.object_id
WHERE object_id = OBJECT_ID('dbo.target_table');Mitigation: Trigger ANALYZE or UPDATE STATISTICS with FULLSCAN if n_mod_since_analyze > 10% of total rows.
Step 3: Evaluate Parameter Sniffing
Compare cached plan parameters against current execution parameters. A >20% deviation in filtered column selectivity invalidates cardinality estimates.
Mitigation: Implement OPTION (RECOMPILE) (SQL Server) or plan_cache_mode = force_generic_plan (PostgreSQL) for highly skewed parameter sets. Alternatively, use query store plan forcing.
Step 4: Correlate Infrastructure Contention
Cross-reference CPU_ready, I/O latency, and memory pressure with the execution window. Use Prometheus Recording Rules to pre-aggregate node-level resource saturation. If infrastructure metrics breach 85% utilization, classify the regression as environmental, not query-level. Route to platform capacity alerts instead of query optimization queues.
Observability Integration & Alert Routing
Wire the threshold engine into your alerting pipeline using multi-signal correlation and hysteresis to prevent alert fatigue.
Prometheus Alertmanager Configuration:
groups:
- name: query_regression_dynamic
rules:
- record: job:query_p95_duration:ewma
expr: ewma_over_time(query_p95_duration_ms[7d], 0.15)
- alert: QueryPlanRegressionDetected
expr: |
(query_logical_reads > job:query_logical_reads:ewma * 1.35)
and
(query_p95_duration_ms > job:query_p95_duration:ewma * 1.35)
and
(count_over_time(query_plan_hash_changes[15m]) > 0)
for: 5m
labels:
severity: critical
team: db-sre
annotations:
summary: "Dynamic threshold breach: "
description: "Plan regression detected. Execute RCA runbook. Baseline: ms"Route alerts to PagerDuty/Opsgenie with a 15-minute cooldown and require explicit acknowledgment before auto-escalation. Tag alerts with query_id, plan_hash, and threshold_type to enable automated runbook attachment.
Safe Fallback Chains & CI/CD Automation
Never allow automated threshold detection to trigger untested schema changes or aggressive query rewrites. Implement a tiered fallback chain that prioritizes plan stability over aggressive optimization.
Fallback Hierarchy
- Plan Guide / SPM Forcing: Lock the known-good execution plan using SQL Server Query Store or PostgreSQL
pg_hint_plan. - Parameterized Query Rewrite: Force generic plans or add
OPTIMIZE FOR UNKNOWNto neutralize sniffing. - Targeted Index Maintenance: Rebuild fragmented indexes (
>30%fragmentation) or add covering indexes for highrows_examinedqueries. - Compute Scale-Out: Last resort. Add read replicas or increase memory allocation only after plan-level mitigations fail.
CI/CD Gate Integration
Integrate threshold validation into your deployment pipeline. Block merges that introduce plan regressions beyond 1.2x baseline in staging.
# ci_regression_gate.py
import requests
import sys
def validate_plan_regression(query_id: str, threshold_url: str) -> bool:
"""Blocks CI/CD if staging execution exceeds dynamic thresholds."""
resp = requests.get(f"{threshold_url}/api/v1/thresholds/{query_id}")
data = resp.json()
if data["current_variance_pct"] > 20.0:
print(f"BLOCKED: Query {query_id} exceeds dynamic threshold by {data['current_variance_pct']:.1f}%")
print(f"Action: Attach plan guide or optimize join predicates before merge.")
return False
return True
if __name__ == "__main__":
if not validate_plan_regression(sys.argv[1], sys.argv[2]):
sys.exit(1)For production emergencies, maintain a plan_override_registry with time-bound TTLs. Auto-expire forced plans after 72 hours and re-evaluate against fresh statistics. Document every override in the incident post-mortem to refine baseline calculations.