Capturing EXPLAIN Plans Without Impacting Production Performance

Capturing EXPLAIN ANALYZE output on a live fleet without degrading transactional latency is a routing-and-isolation problem, not a logging problem: the diagnostic query must never share I/O, memory, or lock scope with application traffic. Running EXPLAIN ANALYZE directly against a primary writer executes the statement for real, materializing temporary tables, spilling work_mem to disk, and taking the same buffer and advisory locks the application depends on. This runbook defines the exact trigger thresholds, the read-path isolation model, the async capture worker, and the graceful degradation chain that lets a database SRE stream fingerprinted plans into the routing stage that ships them to centralized logs with zero measurable effect on production.

Symptom Identification and Production Thresholds

Naive capture reveals itself as a latency correlation: p99 on the primary tracks your capture schedule. Gate every capture behind a statistically significant deviation instead of sampling blindly, and treat the following as hard breach conditions evaluated over a rolling 15-minute window from pg_stat_statements and OpenTelemetry spans:

Latency deviation: execution time for a parameterized fingerprint exceeds $1.5\times$ its 95th-percentile baseline. Below this multiplier, capture is noise.
Temp-file escalation: a statement spills more than 512 MB to temp, or triggers work_mem temp-file fallback on more than 3 consecutive executions.
Plan instability: a queryid shows a calls increase of >200% alongside a mean_exec_time regression of >40% versus its 7-day average.
Replica lag ceiling: the capture replica reports max_replication_lag above 200 ms — above this, defer, never fall back to the primary.

Capture activates only when any two conditions intersect inside the window. Apply a cooldown_ms of 300000 (5 minutes) per fingerprint so cache warming, autovacuum overlap, and batch windows do not trigger a thundering herd of EXPLAIN runs. Fingerprints must be literal-stripped and parameterized before hashing — the same discipline described in Normalizing Parameterized Queries for Consistent Plan Tracking — or cardinality explodes and the cooldown map never converges.

Root Cause Analysis

Three failure domains account for nearly every capture-induced production incident. Each has a direct diagnostic.

1. Diagnostic execution on the write path. The capture worker is pointed at the primary DSN, so EXPLAIN ANALYZE runs the real query and contends for buffers and locks. Confirm which backends are running EXPLAIN and where:

SQL

SELECT pid, backend_start, wait_event_type, wait_event, left(query, 60) AS q
FROM pg_stat_activity
WHERE query ILIKE 'EXPLAIN%' AND backend_type = 'client backend';

Any row whose pid resolves to the primary is a live incident — the capture DSN is misrouted.

2. Unbounded diagnostic sessions. A capture session inherits the application’s work_mem, statement_timeout = 0, and parallel worker settings, so a single heavy plan starves the replica. Inspect the effective session context before trusting a capture endpoint:

SQL

SHOW work_mem; SHOW statement_timeout; SHOW default_transaction_read_only;

If statement_timeout is 0 or default_transaction_read_only is off, the isolation contract is not in force.

3. Silent capture of production literals. EXPLAIN output embeds bound literals from the sampled statement, so plan text can carry PII into the log store. Grep a captured envelope before it leaves the host:

BASH

jq -r '.execution_plan[0].Plan | ..|.["Filter"]? // empty' captured.json | head

Any real literal in a Filter or Index Cond means normalization ran after capture instead of before it — a data-residency exposure covered by Security Boundaries for Baseline Data Storage.

Step-by-Step Remediation

1. Pin the capture endpoint to an isolated read path

Never target the primary writer pool. Route capture to a read replica or shadow standby and enforce resource caps at the session level, not inline in user query strings:

SQL

-- Applied by the connection init hook, inside the capture transaction only
SET LOCAL default_transaction_read_only = on;
SET LOCAL statement_timeout = '30s';
SET LOCAL lock_timeout = '500ms';
SET LOCAL work_mem = '64MB';
SET LOCAL max_parallel_workers_per_gather = 1;

Hold the effective sampling rate at 0.005 (0.5%) of triggered events; higher rates put measurable CPU load on the replica.

2. Run capture in an async, instrumented worker

The worker executes off the request path, parses FORMAT JSON, fingerprints the structure, and emits structlog events plus OpenTelemetry spans so the capture pipeline is observable in the same backend the plans flow into:

PYTHON

import asyncio
import hashlib
import json
from datetime import datetime, timezone
from typing import Any

import asyncpg
import structlog
from opentelemetry import metrics, trace

log = structlog.get_logger("explain_capture")
tracer = trace.get_tracer("explain_capture")
meter = metrics.get_meter("explain_capture")

CAPTURED = meter.create_counter("explain_capture_success_total")
DEFERRED = meter.create_counter("explain_capture_deferred_total")
CAPTURE_MS = meter.create_histogram("explain_capture_latency_seconds", unit="s")

SAFETY_TIMEOUT_S = 30
MAX_REPLICA_LAG_MS = 200


async def _replica_lag_ms(conn: asyncpg.Connection) -> float:
    lag = await conn.fetchval(
        "SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) * 1000"
    )
    return float(lag or 0.0)


async def capture_explain_plan(dsn: str, query_text: str, params: tuple = ()) -> dict[str, Any]:
    """Execute EXPLAIN ANALYZE in a read-only, resource-capped replica session."""
    conn = await asyncpg.connect(dsn)
    start = asyncio.get_event_loop().time()
    with tracer.start_as_current_span("capture.explain") as span:
        try:
            lag = await _replica_lag_ms(conn)
            span.set_attribute("replica.lag_ms", lag)
            if lag > MAX_REPLICA_LAG_MS:
                DEFERRED.add(1, {"reason": "replica_lag"})
                log.warning("capture_deferred", replica_lag_ms=lag)
                raise RuntimeError("replica lag over threshold; deferring capture")

            await conn.execute("SET LOCAL default_transaction_read_only = on;")
            await conn.execute("SET LOCAL statement_timeout = '30s';")
            await conn.execute("SET LOCAL work_mem = '64MB';")

            rows = await conn.fetch(
                f"EXPLAIN (ANALYZE, BUFFERS, FORMAT JSON) {query_text}", *params
            )
            plan_data = rows[0][0]  # asyncpg returns the decoded JSON list

            plan_hash = hashlib.sha256(
                json.dumps(plan_data, sort_keys=True).encode()
            ).hexdigest()
            CAPTURED.add(1)
            return {
                "timestamp": datetime.now(timezone.utc).isoformat(),
                "query_fingerprint": hashlib.sha256(query_text.encode()).hexdigest(),
                "plan_hash": plan_hash,
                "execution_plan": plan_data,
                "metadata": {"capture_method": "asyncpg_replica", "version": "v1.3"},
            }
        except asyncpg.QueryCanceledError as exc:
            span.record_exception(exc)
            raise TimeoutError("EXPLAIN ANALYZE exceeded 30s safety threshold") from exc
        finally:
            CAPTURE_MS.record(asyncio.get_event_loop().time() - start)
            await conn.close()

The returned envelope is the immutable object the routing stage consumes; its plan_hash uses the deterministic SHA-256 fingerprinting approach so identical plans collapse to one baseline key. Hand each envelope off over the async transport in Building Async Ingestion Pipelines for High-Throughput Queries rather than writing it synchronously.

3. Diff against the baseline off the hot path

Cost deltas run in the worker or CI, never on the capture connection. Interpret the delta with the cost-to-latency mapping from Mapping EXPLAIN Costs to Real-World Latency Metrics, and gate the block band with your regression thresholds:

PYTHON

def diff_against_baseline(current: dict[str, Any], baseline: dict[str, Any]) -> dict[str, Any]:
    cur = current["execution_plan"][0]["Plan"]      # FORMAT JSON root node
    base = baseline["execution_plan"][0]["Plan"]
    cur_cost, base_cost = cur.get("Total Cost", 0.0), base.get("Total Cost", 0.0)
    if base_cost == 0:
        return {"regression_detected": False, "cost_delta_pct": 0.0}
    delta = round(((cur_cost - base_cost) / base_cost) * 100, 2)
    return {
        "regression_detected": cur_cost > base_cost * 1.5,
        "cost_delta_pct": delta,
        "baseline_node_type": base.get("Node Type"),
        "current_node_type": cur.get("Node Type"),
    }

Note the structural fields: EXPLAIN (FORMAT JSON) returns an array whose root plan is at result[0]["Plan"], where Total Cost lives. Adding ANALYZE also populates Actual Rows, Actual Total Time, and Shared Hit Blocks per node — runtime fields that must be stripped before hashing so they never leak into the baseline key.

4. Wire the graceful degradation chain

When isolation cannot be guaranteed, degrade instead of touching the primary:

Replica lag > 200 ms: abort ANALYZE, queue the fingerprint for off-peak capture, emit explain_capture_deferred_total.
Timeout or lock contention: strip ANALYZE and BUFFERS, fall back to EXPLAIN (FORMAT JSON) — estimated costs, plan structure preserved, zero query execution.
Persistent failure: disable capture for that query_hash for 24h, route to a dead-letter queue, page at severity: P3.

Log every transition with fallback_reason, original_hash, and degraded_mode. No fallback path may ever retry against the primary writer.

Verification Checklist

[ ] Capture DSN resolves to a read replica or shadow standby, confirmed via pg_stat_activity on the primary showing zero client-backend EXPLAIN rows.
[ ] default_transaction_read_only = on, statement_timeout = 30s, and work_mem = 64MB are in force in the capture session (SHOW returns the capped values).
[ ] Trigger fires only when 2 of 4 breach conditions intersect, and per-fingerprint cooldown_ms = 300000 is enforced.
[ ] Replica-lag guard defers (not falls back to primary) above 200 ms; explain_capture_deferred_total increments under induced lag.
[ ] Captured envelopes carry a 64-hex plan_hash with runtime fields stripped before hashing.
[ ] Plan text contains no raw literals — normalization runs before the envelope leaves the host.
[ ] Primary p99 latency shows no correlation with the capture schedule over a 24h soak.

Compatibility and Engine-Specific Notes

Concern	PostgreSQL	MySQL / MariaDB	Distributed SQL (CockroachDB / YugabyteDB)
Capture command	`EXPLAIN (ANALYZE, BUFFERS, FORMAT JSON)`	`EXPLAIN ANALYZE FORMAT=JSON` (8.0.18+)	`EXPLAIN ANALYZE (FORMAT JSON)` / `EXPLAIN ANALYZE`
Read-path isolation	Streaming replica + `default_transaction_read_only`	Read replica + `--read-only`; `EXPLAIN` alone is estimate-only	Follower reads; capture on any node, cost is cluster-wide
Structure-only fallback	`EXPLAIN (FORMAT JSON)` — no execution	`EXPLAIN FORMAT=JSON` — no execution	`EXPLAIN (FORMAT JSON)` — no execution
Root cost field	`Plan.Total Cost` at `result[0]`	`query_block.cost_info.query_cost`	`Total Cost` per operator
Lag signal	`pg_last_xact_replay_timestamp()`	`Seconds_Behind_Source`	raft follower staleness / `--max-staleness`

MySQL’s plain EXPLAIN never executes the query, so on MySQL the estimate-only path is the default-safe fallback; only EXPLAIN ANALYZE needs replica isolation. Distributed engines fan the plan across nodes, so a “read replica” is any follower — isolate by using follower reads and a bounded staleness window rather than a dedicated standby.

Routing EXPLAIN ANALYZE Output to Centralized Logs — the downstream stage that ships the envelopes this worker produces.
Building Async Ingestion Pipelines for High-Throughput Queries — the backpressure-aware transport between capture and storage.
Defining Regression Thresholds for Query Plans — the numeric bands that turn a cost delta into a block decision.

← Back to Automated EXPLAIN Capture & Storage Workflows

Symptom Identification and Production Thresholds #

Root Cause Analysis #

Step-by-Step Remediation #

1. Pin the capture endpoint to an isolated read path #

2. Run capture in an async, instrumented worker #

3. Diff against the baseline off the hot path #

4. Wire the graceful degradation chain #

Verification Checklist #

Compatibility and Engine-Specific Notes #

Related #