Runbook
Capturing EXPLAIN Plans Without Impacting Production Performance
Capturing EXPLAIN Plans Without Impacting Production Performance requires a deterministic sampling strategy, strict resource isolation, and automated baseline comparison. Executing EXPLAIN ANALYZE directly against live primary endpoints introduces unpredictable I/O spikes, temporary table materialization, and advisory lock contention that can cascade into transactional latency degradation. Production SREs and platform teams must decouple diagnostic execution from application workloads while preserving plan fidelity. This runbook defines exact trigger thresholds, isolation architecture, Python automation logic, and safe fallback chains for continuous query plan tracking and regression automation.
Symptom Identification & Trigger Thresholds
Automated capture must be gated by statistically significant deviation from established baselines. Indiscriminate logging exhausts storage, pollutes telemetry pipelines, and obscures genuine regressions. Implement a three-tier trigger matrix evaluated against a rolling 15-minute window using pg_stat_statements metrics and OpenTelemetry traces:
- Latency Deviation: Query execution time exceeds
1.5xthe 95th percentile baseline for the same parameterized fingerprint. - Resource Escalation: Temporary disk spillage exceeds
512MB, orwork_memallocation triggerstemp_buffersfallback for more than3consecutive executions. - Plan Instability:
pg_stat_statementsreports acallscount increase of>200%alongside amean_timeregression of>40%.
When any two conditions intersect within the evaluation window, the diagnostic pipeline activates. Transient spikes caused by cache warming, VACUUM overlap, or batch job scheduling are filtered using a cooldown_ms threshold of 300000 (5 minutes) per query hash. This prevents thundering-herd capture during routine maintenance. Fingerprint normalization must strip literal values using parameterized placeholders before hashing to avoid cardinality explosion.
Zero-Impact Execution Architecture
Safe capture relies on routing diagnostic queries to isolated execution contexts. Never run EXPLAIN ANALYZE on the primary writer pool. Deploy a dedicated diagnostic connection pooler with the following constraints:
- Target Endpoint: Read replica or shadow standby with
max_replication_lagenforced at<200ms. - Resource Caps:
statement_timeout = 30s,work_mem = 64MB,maintenance_work_mem = 128MB,max_parallel_workers_per_gather = 1. - Sampling Rate: Fixed at
0.005(0.5%) of triggered events. Higher rates introduce measurable CPU overhead on replicas. - Transaction Isolation:
SET LOCAL default_transaction_read_only = on;andSET LOCAL lock_timeout = '500ms';to guarantee zero write-side interference.
Session parameters must be applied via a connection initialization hook, not inline SQL, to prevent injection vectors and ensure consistent execution context. This architecture feeds directly into Automated EXPLAIN Capture & Storage Workflows, where captured plans are serialized, fingerprinted, and queued for baseline diffing without blocking application threads.
Python Automation & Baseline Diffing Engine
The capture daemon must operate asynchronously, parse raw EXPLAIN output into structured JSON, and compute plan deltas against a version-controlled baseline repository. The following Python implementation uses asyncpg for non-blocking execution and jsonschema for output validation:
import asyncio
import asyncpg
import json
import hashlib
from datetime import datetime
from typing import Dict, Any, Optional
EXPLAIN_QUERY = """
EXPLAIN (ANALYZE, BUFFERS, FORMAT JSON)
{query_text}
"""
async def capture_explain_plan(
dsn: str,
query_text: str,
params: tuple = ()
) -> Dict[str, Any]:
"""Execute EXPLAIN ANALYZE in a read-only, resource-capped context."""
conn = await asyncpg.connect(dsn)
try:
# Enforce isolation and resource limits at session level
await conn.execute("SET LOCAL default_transaction_read_only = on;")
await conn.execute("SET LOCAL statement_timeout = '30s';")
await conn.execute("SET LOCAL work_mem = '64MB';")
# Execute diagnostic query
plan_data = await conn.fetchval(EXPLAIN_QUERY.format(query_text=query_text), *params)
# Generate deterministic fingerprint
plan_hash = hashlib.sha256(json.dumps(plan_data, sort_keys=True).encode()).hexdigest()
return {
"timestamp": datetime.utcnow().isoformat(),
"query_fingerprint": hashlib.md5(query_text.encode()).hexdigest(),
"plan_hash": plan_hash,
"execution_plan": plan_data,
"metadata": {"capture_method": "asyncpg_diagnostic", "version": "v1.2"}
}
except asyncpg.QueryCanceledError:
raise TimeoutError("EXPLAIN ANALYZE exceeded 30s safety threshold")
finally:
await conn.close()
async def diff_against_baseline(current_plan: Dict[str, Any], baseline: Dict[str, Any]) -> Dict[str, Any]:
"""Compute structural and cost deltas between current and baseline plans."""
current_cost = current_plan["execution_plan"][0]["Total Cost"]
baseline_cost = baseline["execution_plan"][0]["Total Cost"]
return {
"regression_detected": current_cost > (baseline_cost * 1.5),
"cost_delta_pct": round(((current_cost - baseline_cost) / baseline_cost) * 100, 2),
"node_changes": _extract_node_diffs(current_plan, baseline),
"recommendation": "review_join_order" if current_cost > baseline_cost * 1.5 else "stable"
}The daemon routes structured payloads through a lightweight message broker (e.g., RabbitMQ or NATS) before persisting to object storage. For log aggregation and long-term retention strategies, integrate with Routing EXPLAIN ANALYZE Output to Centralized Logs to ensure plans are indexed alongside application traces and database metrics. Baseline comparisons should run in CI pipelines using a pytest fixture that loads the latest production snapshot and fails PRs when regression_detected exceeds 1.5x cost thresholds. Reference the official PostgreSQL documentation on EXPLAIN output formats when extending node-level parsers.
Safe Fallback Chains & Degradation Protocols
Diagnostic pipelines must degrade gracefully when isolation boundaries are breached. Implement a three-tier fallback chain:
- Primary Fallback (Replica Lag > 200ms): Immediately abort
EXPLAIN ANALYZEand queue the query fingerprint for off-peak execution. Emit aWARNmetric:explain_capture.deferred_due_to_replica_lag. - Secondary Fallback (Execution Timeout / Lock Contention): Strip
ANALYZEandBUFFERSflags, falling back toEXPLAIN (FORMAT JSON). This returns estimated costs without executing the query, preserving plan structure while eliminating I/O impact. - Tertiary Fallback (Persistent Failure): Disable automated capture for the affected query hash for
24h, route to a dead-letter queue, and trigger a PagerDuty alert withseverity: P3.
All fallback transitions must be logged with structured context (fallback_reason, original_hash, degraded_mode) to maintain auditability. Never allow fallback queries to retry against the primary writer under any circumstances.
Observability Integration & Validation
The capture pipeline itself requires rigorous monitoring. Expose the following metrics via Prometheus or OpenTelemetry:
explain_capture_success_total(counter)explain_capture_latency_seconds(histogram, buckets: 0.1, 0.5, 1.0, 5.0, 30.0)explain_fallback_activations_total(counter, label:fallback_tier)replica_lag_during_capture_ms(gauge)
Configure alerting rules to trigger when:
rate(explain_capture_success_total[5m]) < 0.95explain_capture_latency_seconds_bucket{le="30.0"} / explain_capture_success_total[5m] > 0.1replica_lag_during_capture_ms > 250for>2m
Validate pipeline integrity by injecting synthetic query fingerprints with known plan structures into the staging cluster. Compare captured JSON against golden files using schema validation. Ensure the pg_stat_statements extension is enabled on both primary and replica endpoints to maintain consistent metric parity. Run quarterly chaos drills simulating replica failover and network partition to verify fallback chain execution and data loss boundaries.