Guide
Tuning Thresholds for False Positive Reduction
In production query plan tracking pipelines, the distance between actionable signal and alert fatigue is measured in threshold configuration. Tuning Thresholds for False Positive Reduction operates as a discrete, deterministic evaluation stage within the broader Regression Detection & Rule Engines architecture. This stage consumes normalized performance deltas, applies configurable boundary conditions, and emits classified regression verdicts. It does not capture execution plans, compute raw cost metrics, or trigger remediation workflows. Its sole responsibility is to translate continuous telemetry into discrete, high-confidence signals by eliminating noise through statistically grounded boundary definitions.
Stage Isolation and Data Flow Contract
Strict stage isolation is mandatory for deterministic behavior. The threshold evaluation stage accepts a strictly typed input payload: a normalized delta vector containing baseline cost, observed cost, query hash, execution count, and historical variance. It emits a classification payload containing verdict (PASS, WARN, FAIL), confidence score, triggered rule identifier, and routing tags.
Upstream dependencies are limited to the cost normalization layer, which must guarantee idempotent aggregation windows, timezone-aware sampling, and outlier clipping. Downstream consumers include alert routers, CI/CD gating pipelines, and plan stabilization controllers. Any deviation from this contract—such as embedding raw plan parsing, direct database query execution, or stateful cooldown logic within the threshold engine—breaks pipeline determinism and introduces non-reproducible alert states. The evaluation function must remain pure: identical inputs must always produce identical outputs, regardless of deployment epoch or infrastructure topology.
Statistical Foundations and Boundary Architecture
False positives in regression detection typically stem from three sources: volatile execution counts, seasonal workload shifts, and rigid static boundaries. Effective Tuning Thresholds for False Positive Reduction requires moving beyond fixed percentage deltas. Production-grade implementations layer multiple evaluation strategies into a deterministic cascade:
- Execution Floor Suppression: Queries below a configurable
min_exec_countare bypassed entirely, preventing cold-start artifacts or infrequent batch jobs from skewing statistical baselines. - Exponentially Weighted Moving Averages (EWMA): Transient spikes are smoothed using a configurable alpha coefficient (typically
0.15–0.30), preserving sensitivity to sustained degradation while ignoring ephemeral cache misses or lock contention. - Variance-Adjusted Z-Scores: Thresholds scale proportionally to observed standard deviation. High-variance OLAP queries receive wider tolerance bands, while OLTP workloads maintain tight boundaries.
- Dynamic Percentile Boundaries: Rolling 95th-percentile latency or cost baselines replace static floors, adapting organically to workload evolution.
These strategies operate in strict sequence. The upstream Tracking Cost Deltas Across Baseline Versions layer guarantees that the delta vector entering this stage is already normalized against schema drift and plan hash collisions, allowing the threshold engine to focus exclusively on statistical classification.
Implementation Blueprint: Deterministic Threshold Evaluation
The following configuration and implementation demonstrate a production-ready threshold engine. It prioritizes type safety, deterministic evaluation, and explicit fallback behavior.
Configuration Schema (thresholds.yaml)
evaluation:
min_exec_count: 50
ewma_alpha: 0.2
z_score_threshold: 2.5
percentile_floor: 0.95
max_cost_delta_pct: 15.0
routing:
pass_tags: ["stable", "monitor"]
warn_tags: ["degraded", "review_required"]
fail_tags: ["critical", "auto_rollback_candidate"]
fallback_verdict: "WARN"
fallback_confidence: 0.45Evaluation Engine (threshold_evaluator.py)
from __future__ import annotations
import math
import logging
from dataclasses import dataclass, field
from typing import Literal, Optional
from opentelemetry import metrics, trace
logger = logging.getLogger(__name__)
tracer = trace.get_tracer(__name__)
meter = metrics.get_meter(__name__)
eval_duration = meter.create_histogram("threshold_eval_duration_ms", unit="ms")
verdict_counter = meter.create_counter("threshold_verdict_total", unit="1")
fallback_counter = meter.create_counter("threshold_fallback_triggered_total", unit="1")
@dataclass(frozen=True)
class NormalizedDeltaVector:
query_hash: str
baseline_cost: float
observed_cost: float
execution_count: int
historical_variance: float
@dataclass
class ClassificationPayload:
verdict: Literal["PASS", "WARN", "FAIL"]
confidence: float
rule_id: str
routing_tags: list[str]
evaluation_metadata: dict = field(default_factory=dict)
class ThresholdEvaluator:
def __init__(self, config: dict):
self.cfg = config["evaluation"]
self.routing = config["routing"]
def evaluate(self, delta: NormalizedDeltaVector) -> ClassificationPayload:
with tracer.start_as_current_span("evaluate_threshold") as span:
span.set_attribute("query_hash", delta.query_hash)
try:
# 1. Execution floor suppression
if delta.execution_count < self.cfg["min_exec_count"]:
return ClassificationPayload(
verdict="PASS", confidence=0.9, rule_id="EXEC_FLOOR",
routing_tags=self.routing["pass_tags"],
evaluation_metadata={"suppressed": True}
)
# 2. Compute delta metrics
cost_delta_pct = ((delta.observed_cost - delta.baseline_cost) / delta.baseline_cost) * 100
std_dev = math.sqrt(delta.historical_variance)
z_score = (delta.observed_cost - delta.baseline_cost) / std_dev if std_dev > 0 else 0.0
# 3. EWMA smoothing (simulated for single-window evaluation)
smoothed_delta = cost_delta_pct * self.cfg["ewma_alpha"] + (1 - self.cfg["ewma_alpha"]) * 0.0
# 4. Deterministic cascade evaluation
if smoothed_delta <= self.cfg["max_cost_delta_pct"] and abs(z_score) < self.cfg["z_score_threshold"]:
verdict, confidence, rule_id = "PASS", 0.95, "BASELINE_STABLE"
tags = self.routing["pass_tags"]
elif smoothed_delta <= self.cfg["max_cost_delta_pct"] * 1.5 and abs(z_score) < self.cfg["z_score_threshold"] * 1.2:
verdict, confidence, rule_id = "WARN", 0.75, "MARGIN_DEGRADATION"
tags = self.routing["warn_tags"]
else:
verdict, confidence, rule_id = "FAIL", 0.98, "THRESHOLD_EXCEEDED"
tags = self.routing["fail_tags"]
verdict_counter.add(1, {"verdict": verdict, "rule_id": rule_id})
return ClassificationPayload(verdict, confidence, rule_id, tags, {"z_score": z_score, "smoothed_delta": smoothed_delta})
except Exception as e:
logger.error("Threshold evaluation failed for %s: %s", delta.query_hash, e, exc_info=True)
fallback_counter.add(1)
return ClassificationPayload(
self.routing["fallback_verdict"],
self.routing["fallback_confidence"],
"FALLBACK_DEFAULT",
self.routing["warn_tags"],
{"error": str(e)}
)Routing Logic, Observability, and Safe Fallback Protocols
The classification payload drives deterministic routing. PASS verdicts flow to baseline archival and telemetry aggregation. WARN verdicts trigger asynchronous review queues and attach to deployment dashboards. FAIL verdicts route to CI/CD gating pipelines and plan stabilization controllers, potentially halting automated rollouts.
When integrating with structural plan analysis, threshold verdicts should be cross-referenced with Detecting Join Type Shifts in Execution Plans. A FAIL threshold verdict paired with a confirmed join algorithm shift (e.g., Hash Join → Nested Loop) elevates routing priority to CRITICAL and bypasses standard cooldown windows.
Observability Hooks
- Metrics:
threshold_eval_duration_mstracks evaluation latency;threshold_verdict_totalprovides verdict distribution for alert tuning;threshold_fallback_triggered_totalmonitors configuration or data integrity failures. - Tracing: Each evaluation generates a span containing
query_hash,z_score, andsmoothed_delta. Correlation IDs propagate to downstream alert routers. - Structured Logging: JSON-formatted logs capture the exact boundary conditions evaluated, enabling post-incident threshold calibration without pipeline replay.
Safe Fallback Protocols
- Data Gap Handling: If
historical_varianceisNaNor missing, the engine defaults to a conservativeWARNwith0.45confidence, preventing silentPASSemissions on unprofiled queries. - Config Reload Failure: The evaluator caches the last known-good configuration. If a hot-reload fails validation, the system continues operating on the cached state and emits a
CONFIG_STALEmetric. - Statistical Divergence Circuit Breaker: If
z_scoreexceeds5.0andexecution_countdrops below the floor simultaneously, the engine flags the input as potentially corrupted and routes to a quarantine queue for manual inspection, rather than forcing a verdict.
For statistical validation and metric instrumentation, refer to the official Python statistics module documentation and the OpenTelemetry Semantic Conventions for standardized metric naming.