Guide

Monitoring Index Usage Changes for Regression Signals

Monitoring Index Usage Changes for Regression Signals is a discrete pipeline stage engineered to detect structural shifts in index utilization before they cascade into latency degradation or throughput collapse. Operating strictly between plan extraction and cost evaluation, this stage isolates index access pattern telemetry from broader execution metrics. By treating index utilization as a first-class regression signal, platform teams can intercept optimizer missteps during deployment windows rather than reacting to post-incident alerts. The workflow is designed for deterministic parsing, strict boundary enforcement, and automated routing to downstream evaluation engines.

Stage Isolation and Data Contract

Strict stage isolation requires that this component consumes only normalized plan fragments and index access counters. It explicitly rejects raw query text, full execution trees, and runtime latency measurements. Input arrives as a structured payload containing plan_hash, index_name, access_type (e.g., seek, scan, key_lookup), and execution_frequency. The stage outputs a delta report containing usage_shift_score, affected_queries, and routing_decision.

This isolation prevents cross-contamination with adjacent stages such as Tracking Cost Deltas Across Baseline Versions, which operates exclusively on estimated versus actual cost vectors. The data contract enforces strict schema validation at ingestion using Pydantic or equivalent validation layers. Payloads missing mandatory index telemetry fields are rejected at the gateway, and schema violations are logged to a dedicated audit queue with ERROR severity. This boundary guarantees that downstream rule engines receive only structurally sound, index-specific telemetry.

YAML
# schema_contract.yaml
index_usage_snapshot:
  required_fields:
    - plan_hash
    - index_name
    - access_type
    - execution_frequency
    - snapshot_timestamp
  validation_rules:
    access_type: enum[seek, scan, key_lookup, index_scan, table_scan, unused]
    execution_frequency: min=0, type=integer
    snapshot_timestamp: iso8601, timezone=UTC
  rejection_action: route_to_audit_queue

Deterministic Extraction and Normalization

Reliable signal detection depends on deterministic extraction logic that bypasses optimizer volatility and plan cache churn. In PostgreSQL and SQL Server environments, index usage telemetry is sourced from engine-specific system views. The automation layer must normalize these views into a canonical schema before comparison. A Python-based extractor implements idempotent snapshotting with explicit transaction boundaries to prevent mid-capture plan cache flushes.

PYTHON
import hashlib
import json
import logging
from contextlib import contextmanager
from dataclasses import dataclass, asdict
from typing import List, Dict, Any
import psycopg2
from psycopg2.extras import RealDictCursor

logger = logging.getLogger(__name__)

@dataclass(frozen=True)
class IndexAccessRecord:
    plan_hash: str
    index_name: str
    access_type: str
    execution_frequency: int
    snapshot_ts: str

def compute_stable_snapshot_hash(records: List[Dict[str, Any]]) -> str:
    """Deterministic SHA-256 hash over sorted record tuples to detect drift."""
    sorted_tuples = sorted(
        (r["plan_hash"], r["index_name"], r["access_type"], r["execution_frequency"])
        for r in records
    )
    payload = json.dumps(sorted_tuples, sort_keys=True).encode("utf-8")
    return hashlib.sha256(payload).hexdigest()

@contextmanager
def idempotent_extractor(dsn: str):
    """Context manager for transactional, repeatable-read snapshot extraction."""
    conn = psycopg2.connect(dsn)
    conn.set_session(autocommit=False, isolation_level=psycopg2.extensions.ISOLATION_LEVEL_REPEATABLE_READ)
    try:
        yield conn
    finally:
        conn.rollback()
        conn.close()

def extract_index_usage(dsn: str, target_tables: List[str]) -> List[IndexAccessRecord]:
    with idempotent_extractor(dsn) as conn:
        with conn.cursor(cursor_factory=RealDictCursor) as cur:
            # PostgreSQL: pg_stat_user_indexes + pg_stat_statements correlation
            query = """
                SELECT 
                    ps.relname AS table_name,
                    i.indexrelname AS index_name,
                    CASE 
                        WHEN idx_scan > 0 AND idx_tup_read > 0 THEN 'seek'
                        WHEN idx_scan > 0 AND idx_tup_read = 0 THEN 'scan'
                        ELSE 'unused'
                    END AS access_type,
                    idx_scan AS execution_frequency,
                    CURRENT_TIMESTAMP AT TIME ZONE 'UTC' AS snapshot_ts
                FROM pg_stat_user_indexes i
                JOIN pg_stat_user_tables ps ON i.relid = ps.relid
                WHERE ps.relname = ANY(%s)
                ORDER BY ps.relname, i.indexrelname;
            """
            cur.execute(query, (target_tables,))
            rows = cur.fetchall()
            
            # Apply fallback correlation if plan_hash is missing (see fallback section)
            records = []
            for row in rows:
                records.append(IndexAccessRecord(
                    plan_hash=compute_stable_snapshot_hash([row]),
                    index_name=row["index_name"],
                    access_type=row["access_type"],
                    execution_frequency=int(row["execution_frequency"]),
                    snapshot_ts=row["snapshot_ts"].isoformat()
                ))
            return records

The extraction routine queries the statistics tables, applies a stable hash to the resulting row set, and correlates each index entry with cached execution plans using dependency graphs rather than transient query fingerprints. For SQL Server deployments, the equivalent telemetry resides in sys.dm_db_index_usage_stats (Microsoft Docs). PostgreSQL operators should reference the official monitoring statistics documentation (PostgreSQL Docs) to align pg_stat_* view semantics with the canonical schema.

Delta Computation and Routing Thresholds

Once snapshots are ingested, the stage computes a usage delta matrix by comparing the current snapshot against the most recent stable baseline. The routing engine evaluates three deterministic conditions:

  1. Access Type Degradation: A shift from seek to scan or key_lookup indicates optimizer misestimation or missing statistics.
  2. Frequency Anomaly: Execution frequency deltas exceeding ±20% relative to baseline suggest query pattern shifts or parameter sniffing.
  3. Index Abandonment: Zero execution frequency for previously active indexes indicates plan cache eviction or query rewrite.

The usage_shift_score is computed as a weighted composite:

usage_shift_score = (w_type * type_penalty) + (w_freq * freq_delta_pct) + (w_abandon * abandon_flag)

Where w_type=0.5, w_freq=0.3, w_abandon=0.2. The routing engine applies precise thresholds:

usage_shift_scoreRouting DecisionAction
< 10.0PASSForward to cost evaluation baseline
10.0 – 24.9WARNQueue for Detecting Join Type Shifts in Execution Plans correlation
25.0\ge 25.0BLOCKHalt deployment, trigger regression ticket, notify SRE on-call

The routing configuration is declarative and hot-reloadable:

YAML
routing_engine:
  thresholds:
    warn_min: 10.0
    block_min: 25.0
  weights:
    type_penalty: 0.5
    freq_delta_pct: 0.3
    abandon_flag: 0.2
  fallback_on_missing_baseline: defer_to_rolling_window

Safe Fallback Protocols and Observability Hooks

Plan hash rotation, automatic statistics resets, or transient telemetry gaps can break deterministic comparison. The stage implements a three-tier fallback protocol:

  1. Dependency Graph Correlation: When plan_hash changes but index-to-table relationships remain constant, the engine traces foreign key and constraint metadata to maintain continuity across baseline versions.
  2. Rolling Window Baseline: If the primary baseline is missing or corrupted, the stage falls back to a 7-day exponential moving average (EMA) of index access counters. This prevents false-positive regression signals during maintenance windows.
  3. Graceful Degradation: If telemetry loss exceeds 15% of expected snapshots, the stage transitions to OBSERVE mode, bypassing routing decisions while emitting high-cardinality diagnostic traces.

Observability is embedded at every boundary. The stage exports OpenTelemetry-compliant metrics following database semantic conventions (OpenTelemetry Spec):

PYTHON
from opentelemetry import metrics

meter = metrics.get_meter("index_regression_monitor")
shift_score_counter = meter.create_histogram(
    "index.usage_shift_score", unit="score", description="Composite index regression score"
)
routing_decision_counter = meter.create_counter(
    "index.routing_decision", unit="1", description="Count of routing decisions by type"
)

def emit_observability(score: float, decision: str, affected_count: int):
    shift_score_counter.record(score)
    routing_decision_counter.add(1, {"decision": decision, "affected_queries": affected_count})

Structured logs include trace_id, span_id, and baseline_version attributes to enable distributed tracing across the Regression Detection & Rule Engines pipeline. Alerting thresholds are decoupled from routing logic; PagerDuty or equivalent integrations trigger only on sustained BLOCK states across three consecutive pipeline cycles, eliminating noise from transient optimizer recalculations.

By enforcing strict stage boundaries, deterministic extraction, and explicit fallback paths, this pipeline component transforms index telemetry into a reliable, automated regression signal. Platform teams gain deterministic visibility into optimizer behavior, enabling proactive intervention before structural shifts impact production latency or throughput.