Automated EXPLAIN Capture & Storage Workflows

Automated EXPLAIN capture and storage workflows give Database SREs, query optimization engineers, and platform teams a deterministic control plane that turns ephemeral optimizer output into versioned, regression-ready artifacts instead of disposable diagnostic text. Query performance drift is rarely a sudden failure; it is cumulative divergence from established execution patterns. The objective is not to run diagnostic commands on demand — it is to institutionalize plan visibility as a first-class, auditable dataset, which demands an automation-first architecture with strict stage isolation, production reliability guarantees, and a durable storage layer that survives engine upgrades and schema churn.

This is the reference architecture for the capture-to-storage half of the platform. It defines a five-stage pipeline, the exact data contracts that flow between stages, the numeric thresholds that gate promotion, and the Python orchestration patterns that run it in production. Downstream regression scoring and rule evaluation live in Regression Detection & Rule Engines; the canonicalization and hashing primitives this pipeline depends on are specified in Core Architecture & Baselining Fundamentals.

Pipeline Overview

A resilient baseline tracking system operates as a directed, dependency-bound pipeline with five isolated stages: Capture, then Regression, then CI Gate, then Index Sync, then Debugging. Each stage consumes only the validated output of its predecessor, preventing cascading failures and ensuring that plan artifacts remain immutable once committed. The pipeline executes as a state machine where transitions are gated by explicit health checks and schema contracts, so a fault in scoring can never corrupt a stored baseline and a statistics refresh can never silently rewrite history.

The contract between stages is deliberately narrow. Capture emits a signed, normalized plan envelope; Regression emits a scored delta event; the CI Gate emits a pass/warn/block decision; Index Sync emits baseline-invalidation events; and Debugging is a read-only consumer of every artifact the other stages persist. Nothing flows backward. This unidirectional shape is what makes the system auditable — every stored plan can be traced to exactly one capture event, one score, and one gate decision.

Five isolated stages, one narrow contract per edge; nothing flows backward except the dashed invalidate signal into the immutable baseline store.

Stage-by-Stage Architecture

Capture

The Capture stage intercepts or triggers EXPLAIN and EXPLAIN ANALYZE execution without introducing measurable latency to production workloads. This is achieved through read-replica routing, connection-pool sampling, or query shadowing. Captured payloads are immediately parsed, stripped of volatile fields, serialized, and dispatched to a durable message bus, decoupling the database control plane from downstream processing.

Input: raw optimizer output (EXPLAIN (FORMAT JSON) on PostgreSQL, EXPLAIN FORMAT=JSON on MySQL) plus execution context (engine version, search_path, statistics snapshot timestamp).
Output: a normalized plan envelope keyed by a stable query fingerprint and a deterministic plan hash.
Failure isolation: the capture agent runs read-only and time-boxed; if the bus is unavailable it buffers to a local durable queue rather than blocking the database.

Capturing safely under load is the central concern here. For high-throughput environments, Building Async Ingestion Pipelines for High-Throughput Queries details the backpressure mechanisms and consumer-group partitioning required to maintain pipeline stability under peak load, while Capturing EXPLAIN Plans Without Impacting Production Performance covers the sampling and timeout controls that keep capture invisible to user traffic.

Regression

The Regression stage ingests normalized plans and computes structural divergence against the committed baseline for that fingerprint. It evaluates node topology, join order, access-method selection, and cost-metric deltas, quantifying divergence with graph-diff algorithms and statistical distance measures to produce a deterministic regression score. This stage is stateless with respect to storage: it reads the current baseline, reads the candidate, and emits a score. It never mutates the baseline itself — promotion is a separate, gated decision.

The scoring logic and the rule engines that consume it are documented in Regression Detection & Rule Engines, including how a hash-join to nested-loop shift is detected and how weighted cost deltas are aggregated across multi-table plans.

Input: the normalized plan envelope plus the current baseline artifact for the same fingerprint.
Output: a scored delta event carrying cost_ratio, rows_estimated_variance, and node_topology_hash_mismatch.
Failure isolation: a missing or unparseable baseline yields an explicit no-baseline verdict, never a false regression.

CI Gate

The CI Gate stage consumes regression signals to approve, warn, or block deployment pipelines. It maps scores to predefined tolerance bands, integrating directly with version control and orchestration platforms. Gate outcomes are emitted as structured events, enabling automated rollback triggers or baseline-promotion workflows. Crucially, the gate is the only stage authorized to promote a candidate plan to baseline, and it does so only after a stability window (see the threshold matrix below).

Input: scored delta events.
Output: a pass / warn / block decision keyed to a commit SHA or deployment ID, plus an optional promotion command.
Failure isolation: the gate fails closed for block-band scores and fails open (with a logged warn) for infrastructure errors, so a scoring outage never hard-stops every deploy.

Index Sync

The Index Sync stage manages statistics refreshes, histogram updates, and DDL propagation. It monitors pg_statistic and pg_stat_user_indexes (or the engine equivalent), invalidating baselines when underlying table distributions shift beyond acceptable bounds. This ensures that regression signals reflect genuine optimizer drift rather than transient statistics staleness. When an index is created, dropped, or falls out of use, the corresponding baselines are marked stale so the next capture re-establishes a reference point instead of alerting on a legitimate plan change — the signals it acts on are the same ones described in Monitoring Index Usage Changes for Regression Signals.

Input: catalog change feed (statistics timestamps, index usage counters, DDL events).
Output: baseline-invalidation events keyed by affected fingerprint set.
Failure isolation: invalidation is idempotent and additive; a duplicate or late event can only widen the stale set, never resurrect a retired baseline.

Debugging

The Debugging stage surfaces historical plan artifacts, execution context, and regression deltas. It provides a queryable, read-only interface for incident response, letting engineers reconstruct execution timelines and isolate root causes without running manual diagnostic queries against production. Because every upstream stage writes immutable, hashed artifacts, this stage can answer “what did this query’s plan look like on the day latency doubled” deterministically.

Input: all persisted artifacts (envelopes, scores, gate decisions, invalidation events).
Output: correlated timelines and diffs for human or automated consumers.
Failure isolation: strictly read-only; it holds no write path to any other stage’s storage.

Baselining Fundamentals and Deterministic Tracking

A baseline is only as reliable as its metadata contract. Raw query plan text is inherently volatile — it contains execution timestamps, memory addresses, transient buffer-pool states, and engine-specific formatting that change across identical executions. Deterministic tracking requires stripping these ephemeral signals while preserving the structural topology of the execution graph.

Effective baselining rests on three principles: parameterized query fingerprinting, versioned optimizer-context capture, and immutable artifact hashing. Every captured plan is associated with a normalized query signature (a SHA-256 of the parameterized AST), the exact engine version, the statistics-snapshot timestamp, and the active configuration flags. The hashing scheme itself follows the SHA-256 plan-hashing approach, and the field-presence and version-compatibility rules that keep envelopes machine-checkable are defined in Schema Validation for Baseline Metadata.

Plan normalization is a prerequisite for reliable comparison. Engine-specific output must be parsed into a canonical intermediate representation before hashing or diffing — stripping volatile fields, standardizing operator nomenclature, and aligning cost units across optimizer versions. Normalizing Query Plans for Cross-Engine Comparison details the AST transformation pipelines and canonicalization rules that keep a baseline stable across heterogeneous deployments, and Normalizing Parameterized Queries for Consistent Plan Tracking covers literal substitution so that WHERE id = 42 and WHERE id = 99 collapse to the same fingerprint.

Threshold Matrix

Automation requires explicit, quantifiable thresholds. Ambiguous regression signals lead to alert fatigue and pipeline paralysis. The matrix below defines the operational boundaries applied in the Regression and CI Gate stages; the reasoning behind each band, and how to make it adaptive, is expanded in Defining Regression Thresholds for Query Plans.

Metric	Tolerance Band	Action	Automation Trigger
`cost_ratio` (new/old)	$\le 1.15$	Pass	Eligible for promotion
`cost_ratio`	`1.15 – 1.40`	Warn	Log, notify on-call, allow deploy
`cost_ratio`	`> 1.40`	Block	Halt CI, require manual review
`rows_estimated_variance`	$\le 30\%$	Pass	None
`rows_estimated_variance`	`> 30%`	Warn / Block	Context-dependent; block if paired with topology mismatch
`node_topology_hash_mismatch`	`0` (exact match)	Pass	None
`node_topology_hash_mismatch`	`> 0`	Block	Structural regression; open incident
Consecutive clean runs before promote	`N = 3–5`	Promote	Commit new baseline atomically

Threshold evaluation occurs within the Regression stage. When a metric breaches a band, the pipeline emits a structured event containing the baseline ID, regression score, and affected query fingerprint. The CI Gate consumes these events and maps them to deployment policy. Automated baseline promotion fires only after N consecutive successful executions (typically N = 3–5) with scores in the pass band, ensuring statistical stability before committing a new reference point. Tuning these bands to suppress noise without hiding real regressions is covered in Tuning Thresholds for False-Positive Reduction.

Production Readiness Requirements

Running EXPLAIN capture in production requires strict isolation from user-facing workloads. Four controls are non-negotiable:

Connection-pool isolation: the capture agent uses a dedicated pool (e.g. min_size=2, max_size=10) that is separate from the application pool, so a burst of captures can never exhaust connections that serve users.
Read-replica routing: EXPLAIN ANALYZE (which executes the query) is pinned to a read replica or a shadow instance; only lightweight EXPLAIN (plan-only) is ever permitted against a primary.
Circuit breakers: ingestion halts if database response times exceed a latency percentile guard (p99 > 500ms) or if the replica lag exceeds 10s, resuming automatically once the guard clears.
Least-privilege model: the agent authenticates as a role with SELECT-only grants and no DDL/DML rights; storage writes go through a separate service identity that cannot read production tables.

The storage tier that holds baselines needs its own guarantees — encryption in transit and at rest, and a clear tenancy boundary — as specified in Security Boundaries for Baseline Data Storage.

Observability Hooks

Observability must be embedded at every pipeline stage. Each EXPLAIN execution generates an OpenTelemetry span annotated with database semantic conventions, including db.statement, db.operation, and a plan.hash attribute. The structured-logging format, field-indexing strategy, and retention policy for that span stream are specified in Routing EXPLAIN ANALYZE Output to Centralized Logs.

Named metrics emitted by this pipeline, with their instrument types:

plan_capture_latency_ms — histogram tracking ingestion processing time per plan.
baseline_staleness_hours — gauge measuring time since the last successful baseline validation for a fingerprint.
regression_rate_per_hour — counter tracking threshold breaches, labelled by severity (warn, block).
ci_gate_decision_total — counter with a decision label recording pass, warn, and block outcomes.
capture_bus_lag_messages — gauge exposing the depth of the durable queue between Capture and Regression.
baseline_invalidations_total — counter labelled by cause (stats_refresh, ddl, index_drop).

These metrics feed Prometheus-compatible time-series storage and drive alerting rules. Alerts route to on-call only when sustained regression rates exceed operational baselines, preventing noise during transient optimizer fluctuations. The db.operation to real-latency mapping used when interpreting these figures is derived in Mapping EXPLAIN Costs to Real-World Latency Metrics.

Python Orchestration Patterns

Platform teams typically orchestrate the pipeline with a DAG scheduler (Airflow, Prefect, or Argo Workflows) paired with asyncio processing workers. Each stage is a task with an explicit dependency edge; the scheduler enforces the Capture → Regression → CI Gate ordering, while a long-running consumer pool drains the message bus between the scheduled DAG runs. A practical starting point is a worker pool sized to $2 \times replica_{count}$ capture coroutines per node, backed by a bounded asyncio.Queue that provides natural backpressure.

The capture worker below leverages asyncpg with a dedicated pool, structlog for structured events, and OpenTelemetry for tracing. Results are serialized to JSON, compressed, and handed to the bus for downstream scoring.

PYTHON

import asyncio
import gzip
import hashlib
import json
from typing import Any

import asyncpg
import structlog
from opentelemetry import trace
from prometheus_client import Counter, Gauge, Histogram

log = structlog.get_logger("explain_capture")
tracer = trace.get_tracer("explain_capture")

CAPTURE_LATENCY = Histogram("plan_capture_latency_ms", "Capture-to-envelope latency (ms)")
BUS_LAG = Gauge("capture_bus_lag_messages", "Pending plan envelopes on the bus")
CAPTURE_ERRORS = Counter("plan_capture_errors_total", "Failed captures", ["reason"])

VOLATILE_KEYS = frozenset({
    "Actual Rows", "Actual Total Time", "Actual Loops", "Execution Time",
    "Planning Time", "Buffers", "Shared Hit Blocks", "Shared Read Blocks",
})


def normalize(node: Any) -> Any:
    """Strip volatile fields and sort commutative children for a stable hash."""
    if isinstance(node, dict):
        cleaned = {k: normalize(v) for k, v in node.items() if k not in VOLATILE_KEYS}
        if isinstance(cleaned.get("Plans"), list):
            cleaned["Plans"] = sorted(
                cleaned["Plans"], key=lambda x: json.dumps(x, sort_keys=True)
            )
        return cleaned
    if isinstance(node, list):
        return [normalize(x) for x in node]
    return node


def build_envelope(raw_plan: dict, fingerprint: str, engine: str, version: str) -> bytes:
    canonical = normalize(raw_plan)
    canonical_bytes = json.dumps(canonical, sort_keys=True, separators=(",", ":")).encode()
    plan_hash = hashlib.sha256(canonical_bytes).hexdigest()
    envelope = {
        "fingerprint": fingerprint,
        "plan_hash": plan_hash,
        "engine": engine,
        "engine_version": version,
        "plan": canonical,
    }
    return gzip.compress(json.dumps(envelope, separators=(",", ":")).encode())


async def capture_one(pool: asyncpg.Pool, sql: str, fingerprint: str, bus: asyncio.Queue) -> None:
    with tracer.start_as_current_span("capture_one") as span:
        span.set_attribute("db.operation", "explain")
        span.set_attribute("plan.fingerprint", fingerprint)
        with CAPTURE_LATENCY.time():
            try:
                async with pool.acquire() as conn:
                    # Plan-only EXPLAIN against a read replica; no query execution.
                    row = await asyncio.wait_for(
                        conn.fetchval(f"EXPLAIN (FORMAT JSON) {sql}"), timeout=5.0
                    )
                    raw_plan = json.loads(row)[0]["Plan"]
                    version = await conn.fetchval("SHOW server_version")
            except (asyncio.TimeoutError, asyncpg.PostgresError) as exc:
                CAPTURE_ERRORS.labels(reason=type(exc).__name__).inc()
                await log.awarning("capture_failed", fingerprint=fingerprint, error=str(exc))
                return
        envelope = build_envelope(raw_plan, fingerprint, "postgres", version)
        span.set_attribute("plan.hash", hashlib.sha256(envelope).hexdigest())
        await bus.put(envelope)
        BUS_LAG.set(bus.qsize())
        await log.ainfo("capture_ok", fingerprint=fingerprint, bytes=len(envelope))


async def run_capture_pool(dsn: str, work: list[tuple[str, str]], concurrency: int = 8) -> None:
    pool = await asyncpg.create_pool(dsn, min_size=2, max_size=concurrency + 2)
    bus: asyncio.Queue = asyncio.Queue(maxsize=1000)
    sem = asyncio.Semaphore(concurrency)

    async def guarded(sql: str, fp: str) -> None:
        async with sem:
            await capture_one(pool, sql, fp, bus)

    try:
        await asyncio.gather(*(guarded(sql, fp) for sql, fp in work))
    finally:
        await pool.close()
        await log.ainfo("capture_pool_drained", remaining=bus.qsize())

For durable storage, prioritize queryability. Parquet partitioned by query_fingerprint and capture_date enables efficient historical analysis for the Debugging stage, while the hot envelope stream lands in object storage (S3, GCS, or MinIO) with immutable versioning. GitOps integration commits baseline manifests to a version-controlled repository, where pull requests trigger the CI Gate evaluation; idempotent upserts keyed on plan_hash ensure duplicate captures never corrupt baseline state. The Kafka-based fan-out that carries envelopes at scale is detailed in Using Kafka for Async Query Plan Ingestion at Scale.

A semaphore caps concurrency; the bounded queue turns a full buffer into natural backpressure, and its depth is the capture_bus_lag_messages gauge that the circuit breaker watches.

Common Failure Modes and Mitigations

A runbook entry per stage keeps incident response deterministic:

Capture — replica lag storm. Symptom: capture_bus_lag_messages climbs while plan_capture_latency_ms p99 spikes. Cause: replica falling behind under write load. Mitigation: the circuit breaker pauses capture at lag > 10s; captures buffer to the local durable queue and replay on recovery. No baselines are affected.
Capture — EXPLAIN ANALYZE leaking onto the primary. Symptom: unexpected write-path latency on the primary. Cause: misrouted analyze traffic. Mitigation: enforce plan-only EXPLAIN on primaries by role grant; route all analyze traffic to replicas. Revoke and alert on any analyze span whose db.instance is a primary.
Regression — false positive after ANALYZE. Symptom: a block-band rows_estimated_variance immediately after a statistics refresh. Cause: the baseline predates fresh histograms. Mitigation: Index Sync must invalidate affected baselines before Regression re-scores; treat any regression within the invalidation window as no-baseline, not a breach.
CI Gate — flapping promotions. Symptom: a fingerprint promotes and demotes across consecutive runs. Cause: N set too low for a genuinely bimodal plan. Mitigation: raise the consecutive-clean-run requirement toward N = 5 and require topology-hash equality, per Tuning Thresholds for False-Positive Reduction.
Index Sync — missed DDL event. Symptom: stale baseline alerting on a legitimate plan change after an index drop. Cause: a dropped catalog-change event. Mitigation: reconcile periodically by full-scanning pg_stat_user_indexes; invalidation is idempotent, so replaying the feed is always safe.
Debugging — artifact not found. Symptom: a timeline query returns no plan for a known incident window. Cause: retention expiry or a capture gap. Mitigation: confirm object-storage lifecycle rules against the required retention SLO; backfill from the compressed envelope archive if within window.

Conclusion

By treating query plans as deterministic, versioned artifacts rather than disposable diagnostic output, teams eliminate reactive performance firefighting. The five stages — Capture, Regression, CI Gate, Index Sync, and Debugging — enforce strict boundaries that keep each concern isolated, scalable, and auditable. When capture and storage are automated end-to-end, plan visibility shifts from a troubleshooting exercise into a proactive control plane that the regression and rule engines can build on with confidence.

Building Async Ingestion Pipelines for High-Throughput Queries — backpressure and partitioning for the Capture stage.
Schema Validation for Baseline Metadata — field contracts that keep envelopes machine-checkable.
Normalizing Query Plans for Cross-Engine Comparison — canonicalization rules for stable baselines.
Routing EXPLAIN ANALYZE Output to Centralized Logs — structured logging and retention for observability.
Core Architecture & Baselining Fundamentals and Regression Detection & Rule Engines — the sibling topic areas this pipeline feeds and depends on.

← Back to queryplan.org home

Pipeline Overview #

Stage-by-Stage Architecture #

Capture #

Regression #

CI Gate #

Index Sync #

Debugging #

Baselining Fundamentals and Deterministic Tracking #

Threshold Matrix #

Production Readiness Requirements #

Observability Hooks #

Python Orchestration Patterns #

Common Failure Modes and Mitigations #

Conclusion #

Related #