Mapping EXPLAIN Costs to Real-World Latency Metrics

This runbook documents how to continuously calibrate the correlation between an optimizer’s abstract EXPLAIN cost and the wall-clock latency a query actually incurs, so automated baselining stays trustworthy when hardware or workload shifts underneath it.

The optimizer’s cost metric is a dimensionless unit of estimated work, not a prediction of execution time. A total_cost of 4200 says nothing about milliseconds until you anchor it to observed telemetry on a specific instance class and storage tier. This mapping is the anchoring step. It lives immediately downstream of the cost normalization stage and feeds the regression threshold logic: if the cost-to-latency relationship drifts, every threshold expressed in cost units silently loses meaning. It is a component of the Core Architecture & Baselining Fundamentals reference architecture, and it is what lets a pipeline reason about “slower” without re-benchmarking every query on every deploy.

Symptom identification and production thresholds

Treat the mapping as unstable and page on it when any of these exact conditions are breached over a rolling 7-day window:

Correlation decay — the Pearson coefficient between estimated_cost and actual_latency_ms for a tracked queryid drops below 0.75. Below 0.60, freeze cost-unit thresholds entirely and fall back to raw latency comparison.
Cost-flat, latency-hot divergence — estimated cost delta stays within $\pm 5\%$ while P95 mean_exec_time rises by >40%. This is the canonical signature of an I/O-weighting or cardinality error the planner cannot see.
Buffer ratio inversion — shared_blks_hit / (shared_blks_hit + shared_blks_read) falls below 0.90 on a query that historically ran above 0.99, indicating cache eviction or a cold storage tier.
Parallel worker shortfall — EXPLAIN plans Workers Planned: N (N ≥ 2) but Workers Launched is 0, inflating real latency by a factor the cost model never charged for.
Per-cost latency slope shift — the fitted actual_latency_ms / estimated_cost ratio moves more than $\pm 25\%$ from its trailing 14-day median, meaning the same cost now buys different time (typically a storage or instance-class change).

These patterns cluster after ANALYZE/OPTIMIZE TABLE runs, storage tier migrations, connection-pool saturation, or cloud instance-class changes.

Root cause analysis

Failure domain 1: static cost constants versus dynamic I/O

EXPLAIN costs are computed from fixed constants (seq_page_cost, random_page_cost, cpu_tuple_cost, cpu_index_tuple_cost) that assume one static hardware profile. Real latency is governed by NVMe-versus-SATA IOPS variance, OS page-cache eviction, and cloud block-storage burst-credit exhaustion the planner cannot observe. Inspect the live constants and the true random/sequential ratio:

SQL

-- PostgreSQL: what the planner currently believes about I/O
SELECT name, setting, unit
FROM pg_settings
WHERE name IN ('random_page_cost', 'seq_page_cost',
               'effective_cache_size', 'cpu_tuple_cost');

-- Actual read pressure per statement
SELECT queryid, calls, mean_exec_time,
       shared_blks_hit, shared_blks_read,
       shared_blks_hit::float
         / NULLIF(shared_blks_hit + shared_blks_read, 0) AS hit_ratio
FROM pg_stat_statements
ORDER BY shared_blks_read DESC
LIMIT 20;

Failure domain 2: statistics drift and histogram staleness

Bulk inserts, n_distinct misestimation, and outdated histograms make the optimizer confident and wrong: costs drop while latency climbs. Compare planner beliefs against reality:

SQL

-- PostgreSQL: stale-statistics smell test
SELECT relname, n_live_tup, n_mod_since_analyze,
       last_analyze, last_autoanalyze
FROM pg_stat_user_tables
WHERE n_mod_since_analyze > 0.1 * GREATEST(n_live_tup, 1)
ORDER BY n_mod_since_analyze DESC;

BASH

# MySQL: histogram + cardinality currency for a suspect table
mysql -e "SELECT * FROM information_schema.column_statistics \
          WHERE table_name='orders'\G"
mysql -e "SHOW INDEX FROM orders" | awk '{print $3, $7}'

Failure domain 3: missing buffer visibility on MySQL

MySQL’s EXPLAIN has no equivalent to PostgreSQL’s EXPLAIN (BUFFERS), so the physical read profile has to be reconstructed from performance_schema before any mapping is possible:

SQL

-- MySQL: reconstruct the execution profile EXPLAIN omits
SELECT DIGEST_TEXT,
       COUNT_STAR,
       AVG_TIMER_WAIT/1e9      AS avg_ms,
       SUM_ROWS_EXAMINED,
       SUM_ROWS_SENT
FROM performance_schema.events_statements_summary_by_digest
ORDER BY AVG_TIMER_WAIT DESC
LIMIT 20;

Failure domain 4: parallel execution misestimation

The planner charges for workers it assumed it would get. When max_parallel_workers_per_gather limits or memory pressure strip those workers at runtime, the cost stays low but latency does not:

SQL

-- Confirm the gap between planned and launched workers
EXPLAIN (ANALYZE, BUFFERS, VERBOSE, FORMAT JSON)
SELECT ... ;   -- inspect "Workers Planned" vs "Workers Launched"

Step-by-step remediation

The durable fix is a continuous calibration worker that pairs each captured EXPLAIN cost with live pg_stat_statements telemetry and tracks the rolling correlation. Capture the plan cost on a replica so you never touch the primary hot path — the same discipline described in capturing EXPLAIN plans without impacting production performance.

Step 1 — Stand up the async calibrator. It reads execution stats through asyncpg, emits structured events with structlog, and instruments the correlation path with OpenTelemetry so the mapping itself is observable.

PYTHON

import asyncio
import math
from collections import deque
from dataclasses import dataclass
from typing import Deque, Dict, Optional

import asyncpg
import structlog
from opentelemetry import metrics, trace

log = structlog.get_logger("cost_latency_calibrator")
tracer = trace.get_tracer("queryplan.cost_latency")
meter = metrics.get_meter("queryplan.cost_latency")

CORRELATION_GAUGE = meter.create_gauge(
    "db_cost_latency_correlation",
    description="Rolling Pearson r between estimated cost and mean latency",
)
HIT_RATIO_GAUGE = meter.create_gauge(
    "db_buffer_hit_ratio",
    description="shared_blks_hit / (hit + read) for a tracked queryid",
)

CORRELATION_THRESHOLD = 0.75
MIN_SAMPLES = 30
WINDOW_SIZE = 1000


@dataclass(frozen=True)
class CostLatencySample:
    query_hash: str
    estimated_cost: float     # from EXPLAIN (FORMAT JSON), captured on a replica
    actual_latency_ms: float  # mean_exec_time from pg_stat_statements
    buffer_hit_ratio: float


class CostLatencyCalibrator:
    def __init__(self, pool: asyncpg.Pool) -> None:
        self._pool = pool
        self._windows: Dict[str, Deque[CostLatencySample]] = {}

    async def fetch_execution_stats(self, queryid: int) -> Optional[dict]:
        row = await self._pool.fetchrow(
            """
            SELECT mean_exec_time,
                   shared_blks_hit::float
                     / NULLIF(shared_blks_hit + shared_blks_read, 0) AS hit_ratio
            FROM pg_stat_statements
            WHERE queryid = $1
            """,
            queryid,
        )
        if row is None:
            return None
        return {
            "mean_exec_time_ms": float(row["mean_exec_time"]),
            "buffer_hit_ratio": float(row["hit_ratio"] or 0.0),
        }

    async def add_sample(
        self, query_hash: str, estimated_cost: float, queryid: int
    ) -> None:
        stats = await self.fetch_execution_stats(queryid)
        if stats is None:
            log.warning("no_stats_for_queryid", query_hash=query_hash, queryid=queryid)
            return
        window = self._windows.setdefault(query_hash, deque(maxlen=WINDOW_SIZE))
        window.append(
            CostLatencySample(
                query_hash=query_hash,
                estimated_cost=estimated_cost,
                actual_latency_ms=stats["mean_exec_time_ms"],
                buffer_hit_ratio=stats["buffer_hit_ratio"],
            )
        )
        HIT_RATIO_GAUGE.set(stats["buffer_hit_ratio"], {"query_hash": query_hash})

    def _pearson(self, window: Deque[CostLatencySample]) -> float:
        n = len(window)
        if n < MIN_SAMPLES:
            return 1.0  # insufficient data — assume stable
        xs = [s.estimated_cost for s in window]
        ys = [s.actual_latency_ms for s in window]
        mx, my = sum(xs) / n, sum(ys) / n
        cov = sum((x - mx) * (y - my) for x, y in zip(xs, ys))
        vx = math.sqrt(sum((x - mx) ** 2 for x in xs))
        vy = math.sqrt(sum((y - my) ** 2 for y in ys))
        if vx == 0.0 or vy == 0.0:
            return 1.0
        return cov / (vx * vy)

    async def evaluate_and_alert(self, query_hash: str) -> bool:
        with tracer.start_as_current_span("evaluate_correlation") as span:
            window = self._windows.get(query_hash, deque())
            corr = self._pearson(window)
            span.set_attribute("query_hash", query_hash)
            span.set_attribute("correlation", corr)
            CORRELATION_GAUGE.set(corr, {"query_hash": query_hash})
            if corr < CORRELATION_THRESHOLD:
                log.warning(
                    "cost_latency_regression",
                    query_hash=query_hash,
                    correlation=round(corr, 3),
                    threshold=CORRELATION_THRESHOLD,
                )
                return True
            return False


async def run_once(dsn: str, tracked: Dict[str, int]) -> None:
    pool = await asyncpg.create_pool(dsn, min_size=2, max_size=8)
    try:
        calibrator = CostLatencyCalibrator(pool)
        for query_hash, queryid in tracked.items():
            cost = await capture_explain_cost(pool, queryid)  # replica-side EXPLAIN
            await calibrator.add_sample(query_hash, cost, queryid)
            await calibrator.evaluate_and_alert(query_hash)
    finally:
        await pool.close()

Run run_once on a schedule (every 60 seconds against a read replica). Expected structlog output on a healthy query is a single evaluate_correlation span with no warning line; a breach emits cost_latency_regression query_hash=… correlation=0.71 threshold=0.75.

Step 2 — Emit the alert rule. Ship the gauge to Prometheus and gate on the same 0.75 floor the calibrator enforces.

YAML

groups:
  - name: query_cost_regression
    rules:
      - alert: CostLatencyMappingDegraded
        expr: db_cost_latency_correlation < 0.75
        for: 10m
        labels:
          severity: warning
          team: database-sre
        annotations:
          summary: "Optimizer cost model diverging from physical latency"
          description: "Query  correlation dropped to ; verify storage I/O and run ANALYZE."

Step 3 — Recalibrate the cost constants, scoped and reversible. When the divergence is I/O-driven, steer the planner with a session-local override rather than a global ALTER SYSTEM, then observe for at least 15 minutes before promoting:

SQL

-- Steer toward sequential scans when random I/O is genuinely slow
SET LOCAL random_page_cost = 12.0;
-- Refresh statistics that drove a bad estimate
ANALYZE VERBOSE orders (customer_id, status);

For MySQL, refresh the equivalent statistics and inspect the recomputed cost:

SQL

ANALYZE TABLE orders;
EXPLAIN FORMAT=JSON SELECT ... ;   -- read query_cost under cost_info

Step 4 — Persist the calibration. Track the accepted constants and the correlation floor per cluster tier in a version-controlled cost_baseline.yaml, so a recalibration is auditable and a bad change is a one-commit revert:

YAML

cluster_tier: gp3-16xlarge
random_page_cost: 12.0
seq_page_cost: 1.0
correlation_floor: 0.80
observed_slope_ms_per_cost: 0.34

Verification checklist

[ ] db_cost_latency_correlation for every tracked queryid is at or above 0.75 for a sustained 30 minutes after the change.
[ ] P95 mean_exec_time on the affected queries returned to within $\pm 10\%$ of its pre-incident baseline.
[ ] shared_blks_hit / (hit + read) recovered above 0.90 (or the query’s historical norm) with no new cold-read spikes.
[ ] Workers Launched now equals Workers Planned for any plan that previously fell short.
[ ] Any SET LOCAL override was either promoted through cost_baseline.yaml or reverted — no undocumented global ALTER SYSTEM remains.
[ ] The recalibration is committed with the instance tier, the new constants, and the observed cost-to-latency slope.

Compatibility and engine-specific notes

Capability	PostgreSQL	MySQL 8.x	Distributed SQL (CockroachDB / Citus)
Cost source	`EXPLAIN (FORMAT JSON)` → `Total Cost`	`EXPLAIN FORMAT=JSON` → `cost_info.query_cost`	Per-node cost fragments; sum spans network hops
Latency source	`pg_stat_statements.mean_exec_time`	`events_statements_summary_by_digest.AVG_TIMER_WAIT`	Cluster-wide statement stats; add inter-node RTT
Native buffer metrics	Yes — `EXPLAIN (BUFFERS)`	No — reconstruct from `performance_schema`	Partial; storage layer is remote by design
Tunable I/O constants	`random_page_cost`, `seq_page_cost`	`optimizer` cost constants via `mysql.engine_cost`	Locality/latency multipliers, not page costs
Parallelism signal	`Workers Planned` vs `Workers Launched`	Limited parallel query support	Distributed across nodes; measure per-fragment

Because the emitted mapping is anchored per engine and instance class, an identical logical plan on PostgreSQL and MySQL can still produce comparable regression signals once each side is calibrated — the mechanism that lets downstream cost-delta tracking apply uniform multipliers without engine-specific branching.

Cost Estimation Mapping Across PostgreSQL and MySQL — the upstream stage that normalizes raw optimizer costs into the dimensionless unit this page anchors to latency.
Defining Regression Thresholds for Query Plans — the downstream consumer that breaks the moment this correlation drifts.
Tracking Cost Deltas Across Baseline Versions — how calibrated costs turn into version-over-version regression signals.
Capturing EXPLAIN Plans Without Impacting Production Performance — the replica-side capture discipline behind Step 1.

← Back to Cost Estimation Mapping Across PostgreSQL and MySQL

Symptom identification and production thresholds #

Root cause analysis #

Failure domain 1: static cost constants versus dynamic I/O #

Failure domain 2: statistics drift and histogram staleness #

Failure domain 3: missing buffer visibility on MySQL #

Failure domain 4: parallel execution misestimation #

Step-by-step remediation #

Verification checklist #

Compatibility and engine-specific notes #

Related #