Guide
Monitoring Index Usage Changes for Regression Signals
Monitoring Index Usage Changes for Regression Signals is a discrete pipeline stage engineered to detect structural shifts in index utilization before they cascade into latency degradation or throughput collapse. Operating strictly between plan extraction and cost evaluation, this stage isolates index access pattern telemetry from broader execution metrics. By treating index utilization as a first-class regression signal, platform teams can intercept optimizer missteps during deployment windows rather than reacting to post-incident alerts. The workflow is designed for deterministic parsing, strict boundary enforcement, and automated routing to downstream evaluation engines.
Stage Isolation and Data Contract
Strict stage isolation requires that this component consumes only normalized plan fragments and index access counters. It explicitly rejects raw query text, full execution trees, and runtime latency measurements. Input arrives as a structured payload containing plan_hash, index_name, access_type (e.g., seek, scan, key_lookup), and execution_frequency. The stage outputs a delta report containing usage_shift_score, affected_queries, and routing_decision.
This isolation prevents cross-contamination with adjacent stages such as Tracking Cost Deltas Across Baseline Versions, which operates exclusively on estimated versus actual cost vectors. The data contract enforces strict schema validation at ingestion using Pydantic or equivalent validation layers. Payloads missing mandatory index telemetry fields are rejected at the gateway, and schema violations are logged to a dedicated audit queue with ERROR severity. This boundary guarantees that downstream rule engines receive only structurally sound, index-specific telemetry.
# schema_contract.yaml
index_usage_snapshot:
required_fields:
- plan_hash
- index_name
- access_type
- execution_frequency
- snapshot_timestamp
validation_rules:
access_type: enum[seek, scan, key_lookup, index_scan, table_scan, unused]
execution_frequency: min=0, type=integer
snapshot_timestamp: iso8601, timezone=UTC
rejection_action: route_to_audit_queueDeterministic Extraction and Normalization
Reliable signal detection depends on deterministic extraction logic that bypasses optimizer volatility and plan cache churn. In PostgreSQL and SQL Server environments, index usage telemetry is sourced from engine-specific system views. The automation layer must normalize these views into a canonical schema before comparison. A Python-based extractor implements idempotent snapshotting with explicit transaction boundaries to prevent mid-capture plan cache flushes.
import hashlib
import json
import logging
from contextlib import contextmanager
from dataclasses import dataclass, asdict
from typing import List, Dict, Any
import psycopg2
from psycopg2.extras import RealDictCursor
logger = logging.getLogger(__name__)
@dataclass(frozen=True)
class IndexAccessRecord:
plan_hash: str
index_name: str
access_type: str
execution_frequency: int
snapshot_ts: str
def compute_stable_snapshot_hash(records: List[Dict[str, Any]]) -> str:
"""Deterministic SHA-256 hash over sorted record tuples to detect drift."""
sorted_tuples = sorted(
(r["plan_hash"], r["index_name"], r["access_type"], r["execution_frequency"])
for r in records
)
payload = json.dumps(sorted_tuples, sort_keys=True).encode("utf-8")
return hashlib.sha256(payload).hexdigest()
@contextmanager
def idempotent_extractor(dsn: str):
"""Context manager for transactional, repeatable-read snapshot extraction."""
conn = psycopg2.connect(dsn)
conn.set_session(autocommit=False, isolation_level=psycopg2.extensions.ISOLATION_LEVEL_REPEATABLE_READ)
try:
yield conn
finally:
conn.rollback()
conn.close()
def extract_index_usage(dsn: str, target_tables: List[str]) -> List[IndexAccessRecord]:
with idempotent_extractor(dsn) as conn:
with conn.cursor(cursor_factory=RealDictCursor) as cur:
# PostgreSQL: pg_stat_user_indexes + pg_stat_statements correlation
query = """
SELECT
ps.relname AS table_name,
i.indexrelname AS index_name,
CASE
WHEN idx_scan > 0 AND idx_tup_read > 0 THEN 'seek'
WHEN idx_scan > 0 AND idx_tup_read = 0 THEN 'scan'
ELSE 'unused'
END AS access_type,
idx_scan AS execution_frequency,
CURRENT_TIMESTAMP AT TIME ZONE 'UTC' AS snapshot_ts
FROM pg_stat_user_indexes i
JOIN pg_stat_user_tables ps ON i.relid = ps.relid
WHERE ps.relname = ANY(%s)
ORDER BY ps.relname, i.indexrelname;
"""
cur.execute(query, (target_tables,))
rows = cur.fetchall()
# Apply fallback correlation if plan_hash is missing (see fallback section)
records = []
for row in rows:
records.append(IndexAccessRecord(
plan_hash=compute_stable_snapshot_hash([row]),
index_name=row["index_name"],
access_type=row["access_type"],
execution_frequency=int(row["execution_frequency"]),
snapshot_ts=row["snapshot_ts"].isoformat()
))
return recordsThe extraction routine queries the statistics tables, applies a stable hash to the resulting row set, and correlates each index entry with cached execution plans using dependency graphs rather than transient query fingerprints. For SQL Server deployments, the equivalent telemetry resides in sys.dm_db_index_usage_stats (Microsoft Docs). PostgreSQL operators should reference the official monitoring statistics documentation (PostgreSQL Docs) to align pg_stat_* view semantics with the canonical schema.
Delta Computation and Routing Thresholds
Once snapshots are ingested, the stage computes a usage delta matrix by comparing the current snapshot against the most recent stable baseline. The routing engine evaluates three deterministic conditions:
- Access Type Degradation: A shift from
seektoscanorkey_lookupindicates optimizer misestimation or missing statistics. - Frequency Anomaly: Execution frequency deltas exceeding ±20% relative to baseline suggest query pattern shifts or parameter sniffing.
- Index Abandonment: Zero execution frequency for previously active indexes indicates plan cache eviction or query rewrite.
The usage_shift_score is computed as a weighted composite:
usage_shift_score = (w_type * type_penalty) + (w_freq * freq_delta_pct) + (w_abandon * abandon_flag)
Where w_type=0.5, w_freq=0.3, w_abandon=0.2. The routing engine applies precise thresholds:
usage_shift_score | Routing Decision | Action |
|---|---|---|
< 10.0 | PASS | Forward to cost evaluation baseline |
10.0 – 24.9 | WARN | Queue for Detecting Join Type Shifts in Execution Plans correlation |
BLOCK | Halt deployment, trigger regression ticket, notify SRE on-call |
The routing configuration is declarative and hot-reloadable:
routing_engine:
thresholds:
warn_min: 10.0
block_min: 25.0
weights:
type_penalty: 0.5
freq_delta_pct: 0.3
abandon_flag: 0.2
fallback_on_missing_baseline: defer_to_rolling_windowSafe Fallback Protocols and Observability Hooks
Plan hash rotation, automatic statistics resets, or transient telemetry gaps can break deterministic comparison. The stage implements a three-tier fallback protocol:
- Dependency Graph Correlation: When
plan_hashchanges but index-to-table relationships remain constant, the engine traces foreign key and constraint metadata to maintain continuity across baseline versions. - Rolling Window Baseline: If the primary baseline is missing or corrupted, the stage falls back to a 7-day exponential moving average (EMA) of index access counters. This prevents false-positive regression signals during maintenance windows.
- Graceful Degradation: If telemetry loss exceeds 15% of expected snapshots, the stage transitions to
OBSERVEmode, bypassing routing decisions while emitting high-cardinality diagnostic traces.
Observability is embedded at every boundary. The stage exports OpenTelemetry-compliant metrics following database semantic conventions (OpenTelemetry Spec):
from opentelemetry import metrics
meter = metrics.get_meter("index_regression_monitor")
shift_score_counter = meter.create_histogram(
"index.usage_shift_score", unit="score", description="Composite index regression score"
)
routing_decision_counter = meter.create_counter(
"index.routing_decision", unit="1", description="Count of routing decisions by type"
)
def emit_observability(score: float, decision: str, affected_count: int):
shift_score_counter.record(score)
routing_decision_counter.add(1, {"decision": decision, "affected_queries": affected_count})Structured logs include trace_id, span_id, and baseline_version attributes to enable distributed tracing across the Regression Detection & Rule Engines pipeline. Alerting thresholds are decoupled from routing logic; PagerDuty or equivalent integrations trigger only on sustained BLOCK states across three consecutive pipeline cycles, eliminating noise from transient optimizer recalculations.
By enforcing strict stage boundaries, deterministic extraction, and explicit fallback paths, this pipeline component transforms index telemetry into a reliable, automated regression signal. Platform teams gain deterministic visibility into optimizer behavior, enabling proactive intervention before structural shifts impact production latency or throughput.