Monitoring Index Usage Changes for Regression Signals

This stage owns exactly one responsibility inside the regression pipeline: comparing index access telemetry from a candidate deployment against its anchored baseline and emitting a deterministic index-usage-shift signal — a seek→scan degradation, a frequency swing, or an outright index abandonment — before that shift cascades into latency degradation or throughput collapse.

Operating strictly between plan capture and cost evaluation, it isolates index access-pattern telemetry from broader execution metrics. By treating index utilization as a first-class regression signal, platform teams intercept optimizer missteps during deployment windows rather than reacting to post-incident alerts. It is a component of the Regression Detection & Rule Engines subsystem, and it fires the earliest structural warning that the Index Sync stage — or a fresh statistics refresh — has quietly changed which access paths the optimizer prefers. This page defines the stage’s input and output contracts, a runnable async implementation, the exact numeric thresholds it enforces, and the failure modes you will actually page on.

Architectural boundaries

Strict isolation is what makes an index-usage signal trustworthy. This stage consumes normalized plan fragments and index access counters and emits a single structured delta report to the rule engine. It sits after normalization and before cost evaluation, and it holds no state between payloads beyond the baseline it reads.

Upstream (consumes): a structured index-usage snapshot carrying plan_hash, index_name, access_type (one of seek, scan, key_lookup, index_scan, table_scan, unused), execution_frequency, and snapshot_timestamp. The plan_hash is the same canonical fingerprint produced by the SHA-256 plan hashing approach, and the counters are collected by the Automated EXPLAIN Capture & Storage Workflows pipeline. The stage explicitly rejects raw query text, full execution trees, and runtime latency measurements — those cross adjacent boundaries and would couple this component to signals it must not evaluate.

Downstream (emits): a delta report containing usage_shift_score, affected_queries, and routing_decision (PASS, WARN, or BLOCK). That report is published to the rule engine, which correlates it with the cost signal from Tracking Cost Deltas Across Baseline Versions and the access-path signal from Detecting Join Type Shifts in Execution Plans before any final verdict is reached. This stage never blocks a deploy, opens a ticket, or rewrites a query on its own.

This isolation is load-bearing. Because the stage evaluates only index-access topology, it is idempotent, safe to run in parallel across thousands of query fingerprints, and engine-agnostic. Structural detection and policy enforcement stay decoupled: this node answers “did index utilization shift, and by how much?” and nothing else.

Deterministic routing and schema enforcement

Every snapshot is validated against a strict field contract before it is admitted. The canonical schema pins the accepted access_type values, forbids negative frequencies, and requires a UTC-normalized timestamp so baseline windows align across replicas:

JSON

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "title": "IndexUsageSnapshot",
  "type": "object",
  "additionalProperties": false,
  "required": ["plan_hash", "index_name", "access_type", "execution_frequency", "snapshot_timestamp"],
  "properties": {
    "plan_hash": { "type": "string", "pattern": "^[0-9a-f]{64}$" },
    "index_name": { "type": "string", "minLength": 1 },
    "access_type": { "enum": ["seek", "scan", "key_lookup", "index_scan", "table_scan", "unused"] },
    "execution_frequency": { "type": "integer", "minimum": 0 },
    "snapshot_timestamp": { "type": "string", "format": "date-time" }
  }
}

Payloads missing mandatory index telemetry fields are rejected at the gateway and routed to a dedicated audit queue with ERROR severity — never a silent fallback that would poison baseline history. Routing of the resulting delta report is formula-driven, not ad hoc. The comparison partition is derived from the fingerprint itself so writes distribute uniformly and idempotently:

Partition key: partition = int(plan_hash[:4], 16) % INDEX_SIGNAL_SHARD_COUNT — the first 16 bits of the digest fan out across a fixed ring (default 256), independent of table or index name.
Baseline key: baseline = f"{plan_hash}:{index_name}" — each index carries its own access-pattern history, so a table with a new covering index opens a new baseline rather than colliding with the old one.

Once snapshots are ingested, the stage computes a usage delta matrix against the most recent stable baseline and evaluates three deterministic conditions:

Access-type degradation. A shift from seek to scan or unused indicates optimizer misestimation or stale statistics.
Frequency anomaly. Execution-frequency deltas exceeding ±20% relative to baseline suggest query-pattern shifts or parameter sniffing.
Index abandonment. Zero execution frequency for a previously active index indicates plan-cache eviction or a query rewrite.

The usage_shift_score is a weighted composite of those three penalties:

usage_shift_score = (0.5 * type_penalty) + (0.3 * freq_delta_pct) + (0.2 * abandon_flag)

where type_penalty scales the severity of an access-type downgrade (0 for none, up to 100 for seek→unused), freq_delta_pct is the absolute percentage frequency change, and abandon_flag contributes a fixed penalty when an index goes cold. The weights and thresholds are declarative and hot-reloadable, so operators can retune sensitivity without a redeploy — the same discipline documented in Tuning Thresholds for False-Positive Reduction.

Production-ready implementation

The following implementation is a stateless async extractor and scorer built on asyncpg, structlog, and OpenTelemetry. It snapshots index usage under a repeatable read transaction so a mid-capture statistics reset or plan-cache flush cannot tear the read, normalizes the engine views into the canonical schema, computes the delta against a supplied baseline, and emits a routed delta report. It performs no threshold math outside the declared weights and never mutates the baseline store.

PYTHON

import asyncio
import hashlib
import json
from dataclasses import dataclass, asdict
from datetime import datetime, timezone

import asyncpg
import structlog
from opentelemetry import metrics, trace

log = structlog.get_logger("index_regression_monitor")
tracer = trace.get_tracer("index_regression_monitor")
meter = metrics.get_meter("index_regression_monitor")

shift_score_histogram = meter.create_histogram(
    "index.usage_shift_score", unit="score",
    description="Composite index regression score per fingerprint/index",
)
routing_decision_counter = meter.create_counter(
    "index.routing_decision", unit="1",
    description="Count of routing decisions by verdict",
)

TYPE_PENALTY = {
    ("seek", "scan"): 40.0,
    ("seek", "unused"): 100.0,
    ("index_scan", "table_scan"): 80.0,
    ("key_lookup", "scan"): 30.0,
}
W_TYPE, W_FREQ, W_ABANDON = 0.5, 0.3, 0.2
WARN_MIN, BLOCK_MIN = 10.0, 25.0

# Correlate pg_stat_user_indexes with pg_stat_user_tables. idx_scan is the
# number of index scans; idx_tup_fetch > 0 means heap rows were fetched via
# the index (a seek-like path), == 0 with scans means a wide index scan.
SNAPSHOT_SQL = """
    SELECT ps.relname AS table_name,
           i.indexrelname AS index_name,
           CASE
               WHEN i.idx_scan > 0 AND i.idx_tup_fetch > 0 THEN 'seek'
               WHEN i.idx_scan > 0 AND i.idx_tup_fetch = 0 THEN 'scan'
               ELSE 'unused'
           END AS access_type,
           i.idx_scan AS execution_frequency
      FROM pg_stat_user_indexes i
      JOIN pg_stat_user_tables ps ON i.relid = ps.relid
     WHERE ps.relname = ANY($1::text[])
     ORDER BY ps.relname, i.indexrelname;
"""


@dataclass(frozen=True)
class IndexAccessRecord:
    plan_hash: str
    index_name: str
    access_type: str
    execution_frequency: int
    snapshot_ts: str


@dataclass(frozen=True)
class DeltaReport:
    usage_shift_score: float
    affected_queries: int
    routing_decision: str


def _record_hash(table: str, index: str, access_type: str) -> str:
    payload = json.dumps(
        {"table": table, "index": index, "access_type": access_type},
        sort_keys=True,
    ).encode("utf-8")
    return hashlib.sha256(payload).hexdigest()


async def extract_index_usage(
    pool: asyncpg.Pool, target_tables: list[str]
) -> list[IndexAccessRecord]:
    """Idempotent, transactional snapshot of index-access telemetry."""
    now = datetime.now(timezone.utc).isoformat()
    async with pool.acquire() as conn:
        async with conn.transaction(isolation="repeatable_read", readonly=True):
            rows = await conn.fetch(SNAPSHOT_SQL, target_tables)
    records = [
        IndexAccessRecord(
            plan_hash=_record_hash(r["table_name"], r["index_name"], r["access_type"]),
            index_name=r["index_name"],
            access_type=r["access_type"],
            execution_frequency=int(r["execution_frequency"]),
            snapshot_ts=now,
        )
        for r in rows
    ]
    log.info("snapshot.extracted", tables=len(target_tables), records=len(records))
    return records


def _freq_delta_pct(current: int, baseline: int) -> float:
    if baseline == 0:
        return 100.0 if current > 0 else 0.0
    return abs(current - baseline) / baseline * 100.0


def score_delta(
    current: list[IndexAccessRecord],
    baseline: dict[str, IndexAccessRecord],
) -> DeltaReport:
    """Pure, side-effect-free scoring of a snapshot against its baseline."""
    max_score, affected = 0.0, 0
    for rec in current:
        prior = baseline.get(rec.index_name)
        if prior is None:
            continue  # new index — no baseline to regress against
        type_penalty = TYPE_PENALTY.get((prior.access_type, rec.access_type), 0.0)
        freq_pct = _freq_delta_pct(rec.execution_frequency, prior.execution_frequency)
        abandon = 100.0 if (prior.execution_frequency > 0 and rec.execution_frequency == 0) else 0.0
        score = (W_TYPE * type_penalty) + (W_FREQ * freq_pct) + (W_ABANDON * abandon)
        if score >= WARN_MIN:
            affected += 1
        max_score = max(max_score, score)

    if max_score >= BLOCK_MIN:
        decision = "BLOCK"
    elif max_score >= WARN_MIN:
        decision = "WARN"
    else:
        decision = "PASS"
    return DeltaReport(round(max_score, 2), affected, decision)


async def evaluate(
    pool: asyncpg.Pool,
    target_tables: list[str],
    baseline: dict[str, IndexAccessRecord],
) -> DeltaReport:
    with tracer.start_as_current_span("index_usage.evaluate") as span:
        current = await extract_index_usage(pool, target_tables)
        report = score_delta(current, baseline)
        shift_score_histogram.record(report.usage_shift_score)
        routing_decision_counter.add(1, {"decision": report.routing_decision})
        span.set_attribute("index.usage_shift_score", report.usage_shift_score)
        span.set_attribute("index.routing_decision", report.routing_decision)
        span.set_attribute("index.affected_queries", report.affected_queries)
        log.info("delta.scored", **asdict(report))
        return report

For SQL Server deployments the same contract is populated from sys.dm_db_index_usage_stats — user_seeks, user_scans, and user_lookups map onto the access_type enum — while PostgreSQL operators should align pg_stat_* semantics with the canonical schema using the monitoring statistics documentation. The extractor is deliberately engine-agnostic downstream of the SQL: once records reach score_delta, no dialect-specific logic remains.

Threshold reference

Routing is governed by exact numeric bands, not adjectives. The score is the maximum per-index composite across the evaluated fingerprint, so a single degraded index is enough to trip the gate.

`usage_shift_score`	`routing_decision`	Action
`< 10.0`	`PASS`	Forward to cost evaluation baseline; no operator action
`10.0 – 24.99`	`WARN`	Queue for join-shift and cost-delta correlation; annotate the merge request
$\ge 25.0$	`BLOCK`	Halt deployment, open a regression ticket, notify the on-call SRE

Alerting is decoupled from routing so transient optimizer recalculations do not page anyone. A pager fires only on a sustained BLOCK across three consecutive pipeline cycles:

YAML

# alert-rules.yaml
groups:
  - name: index-usage-regression
    rules:
      - alert: IndexUsageRegressionBlock
        expr: min_over_time(index_routing_decision{decision="BLOCK"}[3m]) >= 1
        for: 3m
        labels: { severity: page, stage: index-usage-monitor }
        annotations:
          summary: "Sustained index-usage BLOCK across 3 pipeline cycles"
          runbook: "https://queryplan.org/regression-detection-rule-engines/monitoring-index-usage-changes-for-regression-signals/"
      - alert: IndexUsageShiftScoreP95
        expr: histogram_quantile(0.95, rate(index_usage_shift_score_bucket[10m])) > 24.9
        for: 10m
        labels: { severity: ticket, stage: index-usage-monitor }
        annotations:
          summary: "p95 index usage_shift_score entered WARN band"

The stage also holds its own operational SLOs: p95 evaluation latency < 150 ms per fingerprint, snapshot extraction error rate < 0.1%, and baseline-lookup miss rate < 2% (a higher miss rate signals baseline-store drift, not a real regression).

Failure scenarios and root cause analysis

Deterministic comparison breaks in predictable ways. Each mode below lists its symptom, a diagnostic command, and the mitigation the stage applies.

1. Plan-hash rotation with stable topology. Symptom: a wave of no_baseline misses immediately after a deploy, even though the underlying indexes are unchanged. Diagnostic: correlate index-to-table relationships across the rotation —

SQL

SELECT i.indexrelname, ps.relname, i.idx_scan
  FROM pg_stat_user_indexes i
  JOIN pg_stat_user_tables ps ON i.relid = ps.relid
 WHERE ps.relname = ANY($1) ORDER BY i.idx_scan DESC;

Mitigation — dependency-graph correlation: when plan_hash changes but foreign-key and constraint metadata are constant, the engine traces those relationships to maintain baseline continuity across versions rather than emitting false regressions.

2. Statistics reset zeroing counters. Symptom: every index reports unused with execution_frequency = 0 simultaneously. Diagnostic: check when the stats were last reset with SELECT stats_reset FROM pg_stat_database WHERE datname = current_database();. Mitigation: a global zeroing is treated as a reset event, not an abandonment; the abandonment penalty is suppressed for one cycle while the rolling window rebuilds.

3. Missing or corrupted primary baseline. Symptom: baseline-lookup miss rate exceeds 2% and usage_shift_score swings wildly. Diagnostic: verify the baseline key f"{plan_hash}:{index_name}" resolves in the baseline store. Mitigation — rolling-window fallback: the stage falls back to a 7-day exponential moving average of index access counters, preventing false-positive regressions during maintenance windows.

4. Telemetry gap during collection. Symptom: fewer snapshots arrive than the expected fingerprint set. Diagnostic: compare emitted snapshot count against the fingerprint registry per cycle. Mitigation — graceful degradation: when telemetry loss exceeds 15% of expected snapshots, the stage transitions to OBSERVE mode, bypassing routing decisions while emitting high-cardinality diagnostic traces so nothing is blocked on partial data.

5. Parameter-sniffing frequency swings. Symptom: repeated WARN on freq_delta_pct for a hot index with no access-type change. Diagnostic: inspect plan variance for the fingerprint against the defined regression thresholds. Mitigation: frequency-only shifts route to WARN, never BLOCK, and are correlated with cost deltas before any deploy is halted.

Configuration reference

All tuning knobs are environment-driven and hot-reloadable; changing a threshold never requires a redeploy of the stage.

Setting	Env var	Default	Purpose
Warn threshold	`INDEX_WARN_MIN`	`10.0`	Minimum `usage_shift_score` for a `WARN` verdict
Block threshold	`INDEX_BLOCK_MIN`	`25.0`	Minimum `usage_shift_score` for a `BLOCK` verdict
Type weight	`INDEX_W_TYPE`	`0.5`	Weight on access-type degradation penalty
Frequency weight	`INDEX_W_FREQ`	`0.3`	Weight on percentage frequency delta
Abandonment weight	`INDEX_W_ABANDON`	`0.2`	Weight on the index-abandonment flag
Frequency anomaly band	`INDEX_FREQ_BAND_PCT`	`20.0`	± percentage change that counts as an anomaly
Shard ring size	`INDEX_SIGNAL_SHARD_COUNT`	`256`	Partition ring for delta-report routing
Rolling window	`INDEX_ROLLING_WINDOW_DAYS`	`7`	EMA span used when the primary baseline is missing
Degradation cutoff	`INDEX_TELEMETRY_LOSS_CUTOFF`	`0.15`	Snapshot-loss fraction that trips `OBSERVE` mode
Alert dwell	`INDEX_BLOCK_DWELL_CYCLES`	`3`	Consecutive `BLOCK` cycles before paging

Tracking Cost Deltas Across Baseline Versions — the sibling stage this signal is correlated with before a verdict.
Detecting Join Type Shifts in Execution Plans — access-path structural detection that shares the same rule engine.
Tuning Thresholds for False-Positive Reduction — how to retune the weights and bands above without redeploying.
Plan Hashing Algorithms for SQL Engines — the fingerprint that keys every index baseline.
← Back to Regression Detection & Rule Engines

Architectural boundaries #

Deterministic routing and schema enforcement #

Production-ready implementation #

Threshold reference #

Failure scenarios and root cause analysis #

Configuration reference #

Related #