Setting Dynamic Thresholds for Query Regression Alerts

Static latency ceilings fail in production because query performance is non-linear: data volume shifts, parameter sniffing, index fragmentation, and optimizer statistics refreshes all inject variance that a fixed rule cannot absorb. Setting dynamic thresholds means computing a per-fingerprint baseline that adapts to seasonality and drift, then alerting only when several correlated signals breach it at once. This runbook is the operational layer beneath the threshold evaluation stage in Defining Regression Thresholds for Query Plans: where that page defines the scoring contract, this one shows how to derive the numeric bounds, wire them into alerting, and gate deploys against them. It sits within the broader Core Architecture & Baselining Fundamentals reference architecture and consumes the fingerprints emitted by the SHA-256 plan hashing approach.

Symptom Identification and Production Thresholds

Dynamic-threshold alerts must key on the plan fingerprint, not the raw statement, and fire only on correlated deviation. The following breach conditions are the exact numeric bands to encode. A single band crossing is a trend signal, not an incident.

Plan hash instability: any change to queryid/planid (PostgreSQL pg_stat_statements) or query_hash/query_plan_hash (SQL Server) on a critical-path fingerprint. Tolerance is zero drift — one hash change triggers immediate baseline re-evaluation.
Logical read explosion: shared_blks_hit + shared_blks_read (PostgreSQL) or buffer_gets (Oracle) exceeding $1.35\times$ the rolling baseline on an identical parameter distribution.
Execution-time variance: p95_duration_ms sustaining a crossing of $1.35\times$ the EWMA baseline for $\ge 3$ consecutive 5-minute windows.
Row examination ratio: rows_examined / rows_returned above 10:1 for OLTP point-lookups or above 50:1 for analytical range scans.
Wait-profile shift: sudden dominance of PAGEIOLATCH, CPU, or LOCK waits where ASYNC_NETWORK_IO/NETWORK previously dominated.
Correlation gate: an incident fires only when $\ge 2$ of signals 1–5 breach within a 15-minute sliding window. Single-signal deviations route to a low-priority telemetry bucket. This gate is the single most effective lever for reducing false positives.

Capturing these signals cheaply matters: sample plans through the low-overhead path described in capturing EXPLAIN plans without impacting production performance so the telemetry collection does not itself become the regression.

Root Cause Analysis

When the correlation gate fires, classify the degradation before remediating. Four failure domains account for nearly all dynamic-threshold breaches.

Fingerprint drift (optimizer re-plan)

A plan_hash mismatch against the registry baseline confirms the optimizer chose a new shape — a join-order flip, a scan-type change, or a parallelism adjustment. This is the domain covered in depth by detecting join-type shifts in execution plans.

SQL

-- PostgreSQL: confirm the live plan id no longer matches the stored baseline
SELECT queryid, planid, calls, mean_exec_time
FROM pg_stat_statements
WHERE queryid = $1
ORDER BY mean_exec_time DESC;

Stale statistics

Out-of-date column statistics on high-cardinality fields push the planner toward nested-loop joins over hash joins.

SQL

-- PostgreSQL: modifications since last analyze
SELECT relname, last_analyze, last_autoanalyze, n_mod_since_analyze
FROM pg_stat_user_tables
WHERE relname = 'target_table'
ORDER BY n_mod_since_analyze DESC;

SQL

-- SQL Server: staleness and modification counter
SELECT s.name, sp.last_updated, sp.modification_counter, sp.rows
FROM sys.stats s
CROSS APPLY sys.dm_db_stats_properties(s.object_id, s.stats_id) sp
WHERE s.object_id = OBJECT_ID('dbo.target_table');

Run ANALYZE (PostgreSQL) or UPDATE STATISTICS ... WITH FULLSCAN (SQL Server) when n_mod_since_analyze exceeds 10% of total rows.

Parameter sniffing

Compare the cached plan’s parameters against the live execution parameters. A > 20% deviation in filtered-column selectivity invalidates the cardinality estimate and is the classic cause of a bimodal latency distribution.

Infrastructure contention

Cross-reference CPU_ready, I/O latency, and memory pressure against the execution window. If node-level saturation breaches 85% utilization, the regression is environmental — route it to capacity alerts, not the query optimization queue, so the cost-delta signals stay clean.

Step-by-Step Remediation

1. Compute adaptive bounds

Replace fixed multipliers with an Exponentially Weighted Moving Average (EWMA) baseline paired with Median Absolute Deviation (MAD) for outlier-robust variance. The engine runs async, emits an OpenTelemetry span per fingerprint, and logs structured events so every computed bound is auditable.

PYTHON

import numpy as np
import pandas as pd
import structlog
from opentelemetry import trace
from scipy.stats import median_abs_deviation

log = structlog.get_logger(__name__)
tracer = trace.get_tracer("queryplan.dynamic_thresholds")


async def compute_dynamic_thresholds(
    query_id: str,
    metric_series: pd.Series,
    ewma_alpha: float = 0.15,
    mad_multiplier: float = 2.5,
    mad_window: int = 1008,  # 7 days of 10-minute samples
) -> dict:
    """Compute adaptive upper/lower bounds for one plan fingerprint.

    ewma_alpha 0.15 weights history heavily, suiting slowly drifting OLTP.
    mad_multiplier 2.5 sets the ceiling in MAD units above the EWMA baseline.
    """
    if metric_series.empty:
        raise ValueError("metric_series must not be empty")

    with tracer.start_as_current_span("compute_dynamic_thresholds") as span:
        span.set_attribute("query_id", query_id)

        ewma = metric_series.ewm(alpha=ewma_alpha, adjust=False).mean()
        rolling_mad = metric_series.rolling(
            window=mad_window, min_periods=30
        ).apply(median_abs_deviation, raw=True)

        upper = ewma + (mad_multiplier * rolling_mad)
        lower = ewma - (0.5 * rolling_mad)  # hysteresis buffer stops flapping

        current = float(metric_series.iloc[-1])
        base = float(ewma.iloc[-1])
        variance_pct = ((current - base) / base) * 100 if base else 0.0

        result = {
            "query_id": query_id,
            "baseline": base,
            "upper_threshold": float(upper.iloc[-1]),
            "lower_threshold": float(lower.iloc[-1]),
            "current_variance_pct": variance_pct,
        }
        span.set_attribute("variance_pct", variance_pct)
        await log.ainfo("thresholds.computed", **result)
        return result

Expected output for a healthy fingerprint (p95_duration_ms):

JSON

{"query_id": "q_8f21", "baseline": 42.7, "upper_threshold": 71.4,
 "lower_threshold": 35.1, "current_variance_pct": 4.3}

Persist each result to a version-controlled configuration store so bounds are diffable and revertable.

2. Wire the bounds into alerting

PromQL has no built-in ewma_over_time, so export the engine’s computed baseline as a gauge and reference it from a recording rule. The alert enforces the correlation gate from the symptom section.

YAML

groups:
  - name: query_regression_dynamic
    rules:
      - record: job:query_p95_duration:baseline
        expr: query_dynamic_baseline_ms   # gauge exported by the Python engine

      - alert: QueryPlanRegressionDetected
        expr: |
          (query_p95_duration_ms > job:query_p95_duration:baseline * 1.35)
          and
          (query_logical_reads > job:query_logical_reads:baseline * 1.35)
          and
          (increase(query_plan_hash_changes_total[15m]) > 0)
        for: 5m
        labels: { severity: critical, team: db-sre }
        annotations:
          summary: "Dynamic threshold breach: "
          description: "Two-signal correlation + plan drift. Run the RCA runbook."

Route to PagerDuty/Opsgenie with a 15-minute cooldown and require explicit acknowledgment before auto-escalation. Tag every alert with query_id, plan_hash, and threshold_type for automated runbook attachment.

3. Gate deploys and keep a bounded override path

Block merges whose staging execution exceeds $1.2\times$ baseline, and apply plan-stability fixes before any aggressive rewrite: force the known-good plan (Query Store plan forcing or pg_hint_plan), then neutralize sniffing (OPTIMIZE FOR UNKNOWN / force_generic_plan), then targeted index maintenance, and only last resort compute scale-out.

PYTHON

import httpx
import sys


async def validate_plan_regression(query_id: str, base_url: str) -> bool:
    """Fail the CI gate if staging variance exceeds the dynamic bound."""
    async with httpx.AsyncClient(timeout=10.0) as client:
        resp = await client.get(f"{base_url}/api/v1/thresholds/{query_id}")
        resp.raise_for_status()
        data = resp.json()

    if data["current_variance_pct"] > 20.0:
        print(
            f"BLOCKED: {query_id} exceeds dynamic threshold by "
            f"{data['current_variance_pct']:.1f}%. Attach a plan guide "
            f"or optimize join predicates before merge."
        )
        return False
    return True

For production emergencies, keep a plan_override_registry with time-bound TTLs: auto-expire forced plans after 72 hours and re-evaluate against fresh statistics, documenting every override in the post-mortem so the baseline math improves over time.

Verification Checklist

[ ] EWMA baseline and MAD band recompute on the same cadence as the metric scrape interval (default 10 minutes).
[ ] The exported query_dynamic_baseline_ms gauge is non-null for every critical-path query_id.
[ ] The alert rule requires $\ge 2$ correlated signals plus plan drift, not a single-metric crossing.
[ ] Alert cooldown is set to 15 minutes and acknowledgment is required before escalation.
[ ] The CI gate blocks at $1.2\times$ baseline and the endpoint returns within the 10s client timeout.
[ ] Every forced plan in plan_override_registry carries a TTL $\le 72 hours$ and a linked ticket.
[ ] Computed thresholds are written to a version-controlled store and diff cleanly against the prior run.

Compatibility and Engine-Specific Notes

The signal names differ per engine; the adaptive math does not.

Signal	PostgreSQL	MySQL	Distributed SQL
Fingerprint source	`pg_stat_statements.queryid` / `planid`	`events_statements_summary_by_digest.DIGEST`	CockroachDB `crdb_internal.node_statement_statistics.fingerprint_id`
Logical reads	`shared_blks_hit + shared_blks_read`	`SUM_ROWS_EXAMINED` (P_S)	per-node `rows read` in statement bundle
Plan capture	`EXPLAIN (FORMAT JSON)`	`EXPLAIN FORMAT=JSON`	`EXPLAIN ANALYZE (DISTSQL)`
Stats refresh	`ANALYZE`	`ANALYZE TABLE`	automatic; `CREATE STATISTICS` for manual
Plan pinning	`pg_hint_plan`	optimizer hints / `SET STATEMENT`	not generally available — rely on stats + baselines

MySQL exposes no stable per-statement plan hash, so derive a fingerprint from the normalized DIGEST plus a hash of the EXPLAIN FORMAT=JSON tree rather than trusting a single engine field. Distributed engines shard reads across nodes, so aggregate signals 2 and 4 across all participating nodes before comparing against the baseline, or a rebalancing event will read as a regression.

Defining Regression Thresholds for Query Plans — the parent stage that defines the scoring contract these bounds feed.
Tuning Thresholds for False-Positive Reduction — how the correlation gate and hysteresis buffer are tuned against real alert volume.
Tracking Cost Deltas Across Baseline Versions — where breached signals become durable regression scores.
Plan Hashing Algorithms for SQL Engines — the fingerprint source this runbook keys every threshold on.

← Back to Core Architecture & Baselining Fundamentals

Symptom Identification and Production Thresholds #

Root Cause Analysis #

Fingerprint drift (optimizer re-plan) #

Stale statistics #

Parameter sniffing #

Infrastructure contention #

Step-by-Step Remediation #

1. Compute adaptive bounds #

2. Wire the bounds into alerting #

3. Gate deploys and keep a bounded override path #

Verification Checklist #

Compatibility and Engine-Specific Notes #

Related #