How to Generate Deterministic Query Plan Hashes in Python

Generating a deterministic query plan hash in Python means turning a raw EXPLAIN (FORMAT JSON) tree into one stable 64-character SHA-256 fingerprint that never changes unless the plan’s structure genuinely changes. This runbook is the concrete Python implementation of the plan hashing stage: it strips engine-specific volatility, canonicalizes the structural tree, and serializes to a fixed byte representation so downstream regression threshold logic can attach historical telemetry to the correct baseline. It operationalizes the reference models in Core Architecture & Baselining Fundamentals and consumes the unified plan vocabulary produced by normalizing query plans for cross-engine comparison.

Symptom Identification & Production Thresholds

Non-deterministic hashing surfaces as false-positive drift alerts in a pipeline where nothing structural actually changed. CI gating must treat these as hard breaches, not soft warnings. The following conditions each indicate the hasher itself is unstable and must be fixed before it can gate merges:

Phantom hash mismatch on redeploy — the same SQL against the same engine version yields a plan_hash != baseline on > 0 redeploys with zero schema or query changes. Any non-zero rate here means volatile fields are leaking into the digest.
Hash flip after a patch-level upgrade — plan_hash changes across an engine_major.minor that is held constant (e.g. $16.4 \to 16.5$ ). Structural fingerprints must survive patch releases; a flip means engine metadata (JSON key order, statistics timestamps) is unstripped.
Baseline attach failure — > 5% of captured plans fail to join to a historical baseline row because the fingerprint drifted, breaking p99 latency continuity for those query_fingerprint keys.
Serialization jitter — two hashes of the same in-memory plan object differ across processes. This is a non-canonical serialization defect: unsorted keys, ensure_ascii divergence, or float precision beyond 4 decimal places.
Hash latency breach — compute_hash p95 exceeds 5 ms per plan on the ingestion hot path, indicating the canonicalizer is re-parsing or deep-copying instead of streaming a single pass.

Root Cause Analysis

Instability almost always traces to one of three failure domains. Each carries a command that confirms or eliminates it.

Volatile-field leakage. Runtime counters, timings, and session-scoped identifiers embedded in the plan change on every run. Dump the top-level keys your engine actually emits so you can prove which are volatile:

BASH

psql "$DSN" -qtAc "EXPLAIN (FORMAT JSON) SELECT 1" \
  | python3 -c "import sys,json; print('\n'.join(sorted(json.load(sys.stdin)[0]['Plan'])))"

Any key matching a timing (Actual Total Time), a counter (Shared Hit Blocks), or planning metadata (Planning Time) must be in VOLATILE_KEYS before hashing.

Non-canonical serialization. Key ordering drift is the most common byte-level cause. Confirm two serializations of the same object are identical before trusting the digest:

BASH

python3 -c "import json; d={'b':1,'a':2}; \
print(json.dumps(d,sort_keys=True,separators=(',',':')) == json.dumps({'a':2,'b':1},sort_keys=True,separators=(',',':')))"

Expected output: True. If your pipeline omits sort_keys=True, dict insertion order leaks into the hash.

Commutative child reordering. The optimizer may emit the two inputs of a hash join in either order without changing semantics. Compare the child arrays of a symmetric join across runs; if Plans[0] and Plans[1] swap, an unsorted child list produces two different hashes for one logical plan. This is the boundary where cost fields normalized by cost estimation mapping across PostgreSQL and MySQL must already be canonical before the sort runs.

Step-by-Step Remediation

1. Build the canonicalization core

The canonicalizer parses the raw plan, strips volatile fields, enforces deterministic key ordering, rounds numeric precision, and serializes to a fixed byte representation. Structural normalization must precede the cryptographic operation — never mask differences after hashing.

PYTHON

import hashlib
import json
import re
import structlog
from typing import Any, Dict, FrozenSet, Union
from opentelemetry import trace

log = structlog.get_logger("plan_hasher")
tracer = trace.get_tracer("plan_hasher")


class PlanCanonicalizer:
    # Engine-agnostic volatile fields that cause hash drift.
    VOLATILE_KEYS: FrozenSet[str] = frozenset({
        "execution_time", "start_time", "end_time", "node_id", "memory_address",
        "stats_last_update", "temp_table_name", "session_id",
        "actual_rows", "actual_loops", "planning_time", "triggers", "workers_planned",
        # PostgreSQL EXPLAIN (ANALYZE) runtime fields
        "Actual Total Time", "Actual Startup Time", "Actual Rows", "Actual Loops",
        "Shared Hit Blocks", "Shared Read Blocks", "Temp Written Blocks",
        "Planning Time", "Execution Time", "Workers Planned", "Workers Launched",
    })
    # Child arrays whose order is non-semantic and must be sorted for stability.
    COMMUTATIVE_JOINS: FrozenSet[str] = frozenset({"Hash Join", "Merge Join"})

    @staticmethod
    def _normalize_value(val: Any) -> Any:
        """Normalize scalars to prevent IEEE 754 and whitespace drift."""
        if isinstance(val, float):
            return round(val, 4)
        if isinstance(val, str):
            return re.sub(r"\s+", " ", val.strip())
        return val

    @classmethod
    def _canonicalize_node(cls, node: Any, parent_type: str = "") -> Any:
        """Recursively canonicalize a plan tree node."""
        if isinstance(node, dict):
            node_type = node.get("Node Type", parent_type)
            filtered = {
                k: cls._canonicalize_node(v, node_type)
                for k, v in node.items()
                if k not in cls.VOLATILE_KEYS
            }
            return dict(sorted(filtered.items()))
        if isinstance(node, list):
            children = [cls._canonicalize_node(item, parent_type) for item in node]
            # Only sort children when the parent join is commutative; nested-loop
            # inner/outer order is semantic and must be preserved.
            if parent_type in cls.COMMUTATIVE_JOINS:
                return sorted(children, key=lambda c: json.dumps(c, sort_keys=True))
            return children
        return cls._normalize_value(node)

    @classmethod
    def compute_hash(cls, raw_plan: Union[str, Dict[str, Any]]) -> str:
        """Generate a deterministic SHA-256 fingerprint from a plan."""
        with tracer.start_as_current_span("compute_hash") as span:
            if isinstance(raw_plan, str):
                try:
                    plan_obj = json.loads(raw_plan)
                except json.JSONDecodeError as exc:
                    log.error("plan_parse_failed", error=str(exc))
                    raise ValueError("ERR_MALFORMED_JSON") from exc
            else:
                plan_obj = raw_plan

            canonical = cls._canonicalize_node(plan_obj)
            serialized = json.dumps(
                canonical, sort_keys=True, separators=(",", ":"), ensure_ascii=False
            ).encode("utf-8")
            digest = hashlib.sha256(serialized).hexdigest()
            span.set_attribute("plan.hash", digest)
            return digest

Expected: compute_hash returns the same 64-character hex digest for two EXPLAIN captures of one query whose only differences are timings and buffer counters.

2. Wire it to a live capture path

Pull the plan on a read replica to keep hashing off the production hot path — the same isolation principle used when capturing EXPLAIN plans without impacting production performance. See the PostgreSQL EXPLAIN documentation for output semantics.

PYTHON

import asyncpg


async def hash_from_live_query(dsn: str, sql: str) -> str:
    conn = await asyncpg.connect(dsn)
    try:
        rows = await conn.fetch(f"EXPLAIN (FORMAT JSON) {sql}")
        raw_plan = rows[0][0]  # EXPLAIN JSON is a single-row list
        digest = PlanCanonicalizer.compute_hash(raw_plan)
        log.info("plan_hashed", digest=digest[:12], sql_id=hash(sql))
        return digest
    finally:
        await conn.close()

3. Gate merges against the stored baseline

Compare the new fingerprint against the version-tagged baseline. Source the numeric bands from the shared regression thresholds engine so CI gating and runtime alerting evaluate divergence identically.

YAML

# .github/workflows/plan-baseline-check.yml (excerpt)
- name: Validate query plan baseline
  run: |
    python -c "
    from plan_hasher import PlanCanonicalizer
    import sys, json
    raw = json.load(open('explain_output.json'))
    new_hash = PlanCanonicalizer.compute_hash(raw)
    baseline = open('.plan-baseline').read().strip()
    if new_hash != baseline:
        print('PLAN_DRIFT_DETECTED', new_hash)
        sys.exit(1)
    "

Block the merge on any mismatch unless the query latency improves by >= 10%, the plan node count decreases, or a senior DBA signs off a drift exception with a two-role approval (@platform-sre and @db-lead).

4. Maintain a safe override manifest

Legitimate plan changes must not break CI indefinitely. Keep a .plan-overrides.json mapping query_fingerprint to allowed hashes with expiration dates, and store the previous hash alongside the new one so an automated rollback fires if post-deploy p99 latency degrades > 15%. Accepted baselines persist under the rules in security boundaries for baseline data storage.

Verification Checklist

Run these after building or changing the canonicalizer, before it gates any merge:

[ ] compute_hash returns an identical digest for two EXPLAIN (ANALYZE) captures of one unchanged query.
[ ] Every timing, buffer counter, and Workers * field appears in VOLATILE_KEYS.
[ ] json.dumps uses sort_keys=True and separators=(",", ":"), and float rounding is fixed at 4 decimals.
[ ] A commutative Hash Join with swapped child order produces one stable hash.
[ ] A nested-loop inner/outer swap produces a different hash (semantic order preserved).
[ ] plan_hash survives a patch-level engine upgrade held at the same major.minor.
[ ] compute_hash p95 stays <= 5 ms per plan under production ingestion load.

Compatibility & Engine-Specific Notes

No engine exposes a stable, comparable plan hash natively, so this canonicalize-then-digest step is mandatory for every dialect. Normalize the differences below before hashing so a cross-engine baseline never false-positives.

Concern	PostgreSQL	MySQL / MariaDB	Distributed SQL (CockroachDB / Yugabyte)
Plan format	`EXPLAIN (FORMAT JSON)`	`EXPLAIN FORMAT=JSON`	`EXPLAIN (FORMAT JSON)` (PG-compatible)
Tree root	list → `Plan` node	`query_block` object	list → `Plan` node
Volatile timings	`Actual Total Time`, `*Blocks`	`r_total_time_ms`, `r_rows`	per-node `execution latency`
Commutative children	`Plans[]` under `Hash Join`	`nested_loop[]` reorders	`Plans[]` under `hash-join`
Built-in plan hash	none — SHA-256 the normalized tree	none — normalize `query_block` first	partial (`plan gist`) — still normalize

For MySQL, flatten query_block into the same Node Type/Plans shape before calling compute_hash; the flattening rules live in normalizing query plans for cross-engine comparison. Once the fingerprint is stable, join-shape changes it detects feed directly into detecting join-type shifts in execution plans.

← Back to Plan Hashing Algorithms for SQL Engines
Core Architecture & Baselining Fundamentals — the reference architecture this stage belongs to
Defining Regression Thresholds for Query Plans — source the numeric gate bands
Normalizing Query Plans for Cross-Engine Comparison — produces the unified tree this hasher consumes
Validating Schema Changes Against Baseline Metadata — where the plan_hash divergence signal gates DDL

Symptom Identification & Production Thresholds #

Root Cause Analysis #

Step-by-Step Remediation #

1. Build the canonicalization core #

2. Wire it to a live capture path #

3. Gate merges against the stored baseline #

4. Maintain a safe override manifest #

Verification Checklist #

Compatibility & Engine-Specific Notes #

Related #