Runbook
Validating Schema Changes Against Baseline Metadata
Validating Schema Changes Against Baseline Metadata is a deterministic, CI-driven control plane requirement for any production database operating under strict SLOs. When DDL executes, the optimizer’s cost model recalculates cardinality estimates, access paths, and join orders. Without automated validation, silent regressions manifest as sudden latency spikes, elevated CPU/IO wait states, and plan cache thrashing. This runbook establishes a repeatable, threshold-gated workflow to intercept optimizer divergence before DDL reaches production. It operates as a core component of the Schema Validation for Baseline Metadata cluster and integrates directly with broader Automated EXPLAIN Capture & Storage Workflows to ensure plan stability across schema evolution.
Symptom Identification & Root Cause Analysis
Production incidents triggered by unvalidated DDL typically surface through three observable telemetry patterns. SREs must correlate these signals with deployment timestamps to isolate regression vectors.
- Plan Hash Divergence: The
plan_hashorquery_idfor a critical workload shifts post-migration. This indicates the optimizer abandoned a previously stable execution strategy, often due to altered index visibility, constraint drops, or partition boundary changes. - Cardinality Estimation Drift: Row estimates deviate by >30% from historical baselines. This forces nested loops to replace hash joins, triggers unnecessary materializations, or causes spill-to-disk operations under memory pressure.
- Cost Model Inflation: The estimated execution cost multiplies beyond acceptable bounds. Correlates strongly with missing statistics, altered column types, or implicit type casting introduced by schema modifications.
Root cause isolation follows a strict diagnostic hierarchy:
- Layer 1: Verify DDL impact on indexed columns, partition schemes, and foreign key constraints.
- Layer 2: Confirm
ANALYZE/UPDATE STATISTICSexecution post-migration. Stale histograms invalidate selectivity calculations. - Layer 3: Compare pre- and post-apply execution trees. If nullability, data types, or index visibility changed, the optimizer’s cost functions will shift, invalidating previously pinned or baseline-captured plans.
Deterministic Regression Thresholds
CI gating must enforce hard thresholds. Soft warnings are insufficient for SLO-bound systems. The following metrics define automatic pipeline failure:
| Metric | Threshold | Action |
|---|---|---|
plan_hash mismatch | != baseline | Block merge; trigger manual review |
| Cardinality drift | per node | Block merge; require UPDATE STATISTICS validation |
| Estimated cost delta | > +15% | Block merge; require optimizer hint review |
| Buffer hit ratio drop | < -10% vs baseline | Block merge; flag for index/IO review |
| Execution time (ANALYZE) | > +20% p95 | Block merge; require query rewrite or schema rollback |
Step-by-Step Validation Pipeline
The automation pipeline executes in four isolated stages. All operations run against a staging replica with identical hardware profiles, configuration parameters, and statistically representative data distributions.
1. Baseline Metadata & Plan Extraction
Capture the current schema state and associated execution plans for the top 50 critical queries. Store metadata as versioned JSON artifacts containing table definitions, index structures, constraint states, and optimizer statistics. Use a deterministic query set tagged via pg_stat_statements or equivalent telemetry.
import asyncpg
import json
from pathlib import Path
BASELINE_DIR = Path("baselines")
BASELINE_DIR.mkdir(exist_ok=True)
async def extract_baseline(dsn: str, query_ids: list[str]) -> dict:
conn = await asyncpg.connect(dsn)
baseline = {"queries": {}, "schema_version": await conn.fetchval("SELECT current_setting('schema_version')")}
for qid in query_ids:
plan_json = await conn.fetchval(f"EXPLAIN (FORMAT JSON) SELECT * FROM pg_stat_statements WHERE queryid = $1", qid)
baseline["queries"][qid] = plan_json[0]
await conn.close()
BASELINE_DIR.joinpath(f"baseline_{baseline['schema_version']}.json").write_text(json.dumps(baseline, indent=2))
return baseline2. Sandbox DDL Application
Apply the proposed migration script to a cloned schema. Ensure work_mem, shared_buffers, and parallelism settings match production. Disable auto-vacuum and background statistics collection during the test window to eliminate noise.
-- Pre-migration isolation
ALTER SYSTEM SET autovacuum = off;
ALTER SYSTEM SET track_counts = off;
SELECT pg_reload_conf();
-- Apply DDL in transaction
BEGIN;
\i /migrations/2024_05_12_add_column_nullable.sql
COMMIT;
-- Post-apply stats refresh (mandatory for cost model accuracy)
ANALYZE VERBOSE;3. EXPLAIN Capture & Comparison
Execute EXPLAIN (ANALYZE, BUFFERS, FORMAT JSON) for each baseline query. Capture actual execution metrics, not just estimates. Reference the official PostgreSQL EXPLAIN documentation for buffer and timing semantics.
async def capture_explain(dsn: str, query_sql: str) -> dict:
conn = await asyncpg.connect(dsn)
plan = await conn.fetchval(f"EXPLAIN (ANALYZE, BUFFERS, FORMAT JSON) {query_sql}")
await conn.close()
return plan[0]4. Diff Evaluation & CI Gating
Compare captured plans against baseline artifacts using structural diffing. Focus on node type changes, cost inflation, and buffer utilization.
from deepdiff import DeepDiff
import sys
def evaluate_regression(baseline_plan: dict, new_plan: dict) -> bool:
# Extract critical nodes
b_root = baseline_plan.get("Plan", {})
n_root = new_plan.get("Plan", {})
# Threshold checks
cost_ratio = n_root.get("Total Cost", 1) / b_root.get("Total Cost", 1)
if cost_ratio > 1.15:
print(f"[FAIL] Cost inflation: {cost_ratio:.2f}x")
return True
if b_root.get("Plan Hash") != n_root.get("Plan Hash"):
print("[FAIL] Plan hash divergence detected")
return True
# Structural diff for node type changes
diff = DeepDiff(b_root, n_root, ignore_order=True, exclude_paths=["root['Execution Time']", "root['Planning Time']"])
if diff.get("values_changed") or diff.get("iterable_item_added"):
print(f"[FAIL] Structural regression: {diff}")
return True
return FalseDebugging Scenarios & Safe Fallback Chains
When validation fails, execute the following mitigation paths in strict order. Do not bypass CI gates without documented SRE approval.
| Scenario | Diagnostic Signal | Mitigation Path |
|---|---|---|
| Index Invalidation | Seq Scan replaces Index Scan post-DDL | Rebuild index concurrently; verify pg_index.indisvalid = true; re-run ANALYZE |
| Type Coercion Overhead | Filter node with implicit cast appears | Add explicit cast in query; adjust column type with USING clause; update query templates |
| Statistics Staleness | Cardinality drift >30% despite stable schema | Run ANALYZE with increased default_statistics_target; verify sample coverage; trigger manual stats refresh |
| Join Order Flip | Nested loop replaces hash join; work_mem spill detected | Increase work_mem temporarily; add SET enable_nestloop = off for validation; evaluate partition pruning |
Fallback Chain Protocol:
- Immediate Rollback:
REVERT DDLvia migration framework. Preserve schema version lock. - Plan Pinning: If rollback is blocked by business requirements, inject optimizer hints (
/*+ LEADING(...) */or equivalent) to force baseline plan selection. - Statistical Correction: Execute targeted
ANALYZEwithdefault_statistics_targetscaled to 2x baseline. Re-validate. - Escalation: If regression persists >15% after steps 1-3, escalate to query optimization engineers with full
EXPLAIN (ANALYZE, BUFFERS)artifacts.
Observability Integration & Telemetry Routing
Validation pipelines must emit structured telemetry to enable post-deployment correlation. Route metrics through your existing observability stack using OpenTelemetry semantic conventions for database spans.
Required Metrics:
schema_validation.plan_hash_mismatch(Counter)schema_validation.cardinality_drift_pct(Gauge)schema_validation.cost_inflation_ratio(Gauge)schema_validation.validation_duration_ms(Histogram)
CI/CD Integration:
# .gitlab-ci.yml / GitHub Actions equivalent
validate_schema:
stage: test
services:
- postgres:15
script:
- python -m schema_validator --dsn $STAGING_DSN --baseline-dir ./baselines --threshold-config ./thresholds.json
rules:
- if: $CI_PIPELINE_SOURCE == "merge_request_event"
artifacts:
reports:
junit: validation-results.xmlAlerting Rules:
schema_validation.plan_hash_mismatch > 0→ Page SRE on-callschema_validation.cardinality_drift_pct > 30→ Slack warning + Jira ticket auto-creationschema_validation.validation_duration_ms > 300000→ Timeout alert; investigate sandbox resource contention
Operational Runbook Summary
Validating Schema Changes Against Baseline Metadata transforms DDL deployment from a probabilistic risk into a deterministic control plane operation. By enforcing strict regression thresholds, capturing EXPLAIN (ANALYZE, BUFFERS) artifacts, and implementing automated fallback chains, platform teams eliminate silent optimizer regressions. Integrate this pipeline into your CI/CD gate, route telemetry to centralized dashboards, and maintain a versioned baseline repository. When schema evolution is predictable, query performance remains stable, and SLO violations become preventable rather than reactive.