Using Kafka for Async Query Plan Ingestion at Scale

This runbook documents how to move EXPLAIN capture off the synchronous request path and onto an Apache Kafka topic so that a query plan baseline pipeline can absorb 50,000+ plans per second without exhausting database connection pools or stalling regression alerting.

Synchronous EXPLAIN execution couples capture latency directly to query latency: every plan you record steals a connection and a few milliseconds from the workload you are trying to observe. A decoupled message bus breaks that coupling. Capture agents publish plans and return immediately; a consumer pool drains the topic and commits to the baseline store on its own schedule. This page is the operational counterpart to the async ingestion architecture — it assumes those stage boundaries already hold and focuses on the failure modes, thresholds, and remediation steps specific to running Kafka as the transport at scale. It is one component of the broader Automated EXPLAIN Capture & Storage Workflows reference architecture.

Symptom identification and production thresholds

Baseline drift is silent: when the ingestion path degrades, plans stop arriving at the store and regression detection goes blind until a production query visibly slows down. Wire automated alerts to these exact breach conditions so degradation pages you before it costs you a missed regression.

Consumer group lag > 15,000 messages per partition, sustained for 3 minutes. Indicates backpressure or a deserialization bottleneck on the query-plan-consumers group; at 48 partitions this is roughly 720,000 unprocessed plans.
End-to-end p95 ingestion latency > 1,200 ms (producer publish timestamp to baseline-store commit). Beyond this, the delay between a plan changing and the alert firing violates the regression-alerting SLA.
Dead-letter routing rate > 0.15% of consumed records. A DLQ rate above 15 in 10,000 signals schema drift or malformed JSON/Protobuf payloads rather than isolated corruption.
Broker disk iowait > 65% on partitions hosting the query-plans topic. Sustained I/O wait pushes producers into RequestTimeout and forces retries that amplify load.
Baseline-store row-lock wait > 800 ms during upsert. In PostgreSQL or ClickHouse this points to partition skew or a missing composite index on (query_hash, plan_version).

When any threshold breaches, the pipeline throttles non-critical captures while preserving high-priority regression candidates — the routing logic for that prioritization lives in the async ingestion pipeline that owns baseline persistence.

Map each threshold to a Prometheus rule and propagate an OpenTelemetry trace_id from the database proxy through producer, broker, consumer, and store so you can separate an ingestion bottleneck from a real workload regression:

YAML

# prometheus/alerts/kafka_ingestion.yml
groups:
  - name: kafka_query_plan_ingestion
    rules:
      - alert: HighConsumerLag
        expr: sum by (partition) (kafka_consumer_group_lag{group="query-plan-consumers", topic="query-plans"}) > 15000
        for: 3m
        labels: { severity: warning }
        annotations:
          summary: "Query plan consumer lag exceeds safe threshold"
      - alert: DLQFailureSpike
        expr: rate(kafka_consumer_dlq_total[5m]) / rate(kafka_consumer_records_consumed_total[5m]) > 0.0015
        for: 2m
        labels: { severity: critical }
        annotations:
          summary: "Deserialization failure rate breaching 0.15% SLA"

Root cause analysis

Async ingestion rarely fails at a single component. Triage these four domains in order, each with a diagnostic you can run before touching config.

1. Deserialization overhead and GIL contention

Synchronous json.loads() on multi-megabyte execution plans blocks the event loop; when consumer-pod CPU exceeds 85% while network I/O stays low, unbatched deserialization is the bottleneck. Confirm which consumers are hot and how far behind they are:

BASH

kafka-consumer-groups.sh --bootstrap-server kafka-broker-01:9092 \
  --describe --group query-plan-consumers | sort -k5 -n -r | head
# CURRENT-OFFSET / LOG-END-OFFSET / LAG per partition — a few partitions
# carrying almost all the lag confirms per-record CPU-bound parsing.

Mitigation: replace the standard-library parser with orjson or msgspec (C extensions that release the GIL during parsing) and deserialize in micro-batches rather than one record at a time.

2. Consumer rebalancing storms

An aggressive session.timeout.ms below 10,000 ms combined with GC pauses triggers needless partition reassignment; every rebalance pauses ingestion for 2–5 seconds and spikes lag. Count how often the group is rebalancing:

BASH

kafka-consumer-groups.sh --bootstrap-server kafka-broker-01:9092 \
  --describe --group query-plan-consumers --state
# Frequent transitions out of "Stable", or CONSUMER-ID churn between polls,
# indicate a rebalance storm rather than steady processing.

Mitigation: set session.timeout.ms=45000, heartbeat.interval.ms=15000, and max.poll.interval.ms=300000, and adopt cooperative-sticky assignment so scale events revoke only the partitions that actually move. The exact parameter semantics are in the Apache Kafka consumer configuration reference.

3. Schema registry version mismatch

Producers publishing Avro/Protobuf payloads with backward-incompatible field removals raise SchemaRegistryError, and consumers that lack explicit handling drop those messages silently. This is the same envelope contract enforced upstream by schema validation for baseline metadata, so a drift here usually traces to a producer deployed ahead of the registry. Check the registered compatibility level:

BASH

curl -s http://schema-registry:8081/config/query-plans-value | jq .
# Expect {"compatibilityLevel":"BACKWARD_TRANSITIVE"}; anything weaker lets
# an incompatible producer poison the topic.

Mitigation: enforce BACKWARD_TRANSITIVE at the registry and add producer-edge validation middleware that rejects non-conforming payloads before they reach the topic.

4. Broker disk saturation and ISR shrinkage

High log.segment.bytes or aggressive flush intervals build a disk queue; when UnderReplicatedPartitions > 0, acks=all producers block until replicas catch up. On the store side, a slow upsert holds row locks. Confirm whether the stall is broker-side or store-side:

SQL

-- PostgreSQL baseline store: who is blocking the plan upsert?
SELECT pid, wait_event_type, wait_event, now() - query_start AS waited, query
FROM pg_stat_activity
WHERE state = 'active' AND wait_event_type = 'Lock'
ORDER BY waited DESC;

Mitigation: watch kafka_server_ReplicaManager_UnderReplicatedPartitions, batch disk writes with log.flush.interval.messages=100000 and log.flush.interval.ms=10000, and confirm a composite index on (query_hash, plan_version) exists so upserts stop serializing on a sequential scan.

Step-by-step remediation

Apply these steps in order; each is idempotent and safe to run against a live cluster.

Step 1 — Pin the topic for high-throughput durability

PROPERTIES

# topic-config.properties
query-plans.partitions=48
query-plans.replication.factor=3
query-plans.retention.bytes=107374182400
query-plans.retention.ms=604800000
query-plans.compression.type=zstd
query-plans.min.insync.replicas=2
query-plans.cleanup.policy=delete

BASH

kafka-topics.sh --bootstrap-server kafka-broker-01:9092 --describe --topic query-plans
# Expect: PartitionCount: 48, ReplicationFactor: 3, and configs showing
# min.insync.replicas=2 and compression.type=zstd.

Step 2 — Publish with an idempotent, batched async producer

The producer keys each plan by priority:query_hash so a topic-side circuit breaker can shed low-priority traffic first, and so all variants of one fingerprint land on the same partition — the same fingerprint produced by the plan hashing algorithm used everywhere in the baseline system.

PYTHON

import hashlib

import orjson
import structlog
from aiokafka import AIOKafkaProducer
from opentelemetry import trace

log = structlog.get_logger(__name__)
tracer = trace.get_tracer("query-plan.producer")


async def build_producer() -> AIOKafkaProducer:
    producer = AIOKafkaProducer(
        bootstrap_servers="kafka-broker-01:9092,kafka-broker-02:9092",
        compression_type="zstd",
        acks="all",
        enable_idempotence=True,
        max_batch_size=1_048_576,
        linger_ms=15,
    )
    await producer.start()
    return producer


async def publish_query_plan(
    producer: AIOKafkaProducer, plan_payload: dict, priority: str = "standard"
) -> None:
    plan_json = orjson.dumps(plan_payload)
    query_hash = hashlib.sha256(plan_json).hexdigest()
    key = f"{priority}:{query_hash}".encode()
    with tracer.start_as_current_span("publish_query_plan") as span:
        span.set_attribute("query_plan.priority", priority)
        span.set_attribute("query_plan.hash", query_hash)
        try:
            await producer.send_and_wait("query-plans", key=key, value=plan_json)
        except Exception:
            log.error("plan_publish_failed", query_hash=query_hash, priority=priority)
            raise

Step 3 — Drain deterministically with DLQ routing

The consumer disables auto-commit, deserializes with orjson, and commits only after the whole micro-batch is durably upserted, so a crash mid-batch replays rather than skips. Payloads that fail parsing or the store write are routed to query-plans.dlq with a reason, never dropped.

PYTHON

import asyncio

import orjson
import structlog
from aiokafka import AIOKafkaConsumer, AIOKafkaProducer

log = structlog.get_logger(__name__)


async def run_consumer(dlq: AIOKafkaProducer, upsert) -> None:
    consumer = AIOKafkaConsumer(
        "query-plans",
        bootstrap_servers="kafka-broker-01:9092",
        group_id="query-plan-consumers",
        auto_offset_reset="earliest",
        enable_auto_commit=False,
        partition_assignment_strategy=["cooperative-sticky"],
        session_timeout_ms=45_000,
        heartbeat_interval_ms=15_000,
        max_poll_interval_ms=300_000,
    )
    await consumer.start()
    try:
        while True:
            batches = await consumer.getmany(timeout_ms=1000, max_records=500)
            for _tp, records in batches.items():
                for msg in records:
                    try:
                        await upsert(orjson.loads(msg.value))
                    except Exception as exc:
                        await dlq.send_and_wait(
                            "query-plans.dlq", key=msg.key, value=msg.value
                        )
                        log.warning("routed_to_dlq", reason=str(exc), offset=msg.offset)
            if batches:
                await consumer.commit()
    finally:
        await consumer.stop()


asyncio.run(run_consumer(dlq, upsert))  # dlq: started AIOKafkaProducer; upsert: async store writer

Step 4 — Wire the tiered fallback chain

Production must degrade gracefully, not stop. Layer four fallbacks so regression candidates keep flowing even while infrastructure is impaired:

Circuit breaker. When p95 ingestion latency exceeds 1,200 ms for 60 seconds, trip the breaker and drop priority=low immediately while continuing priority=critical and priority=regression_candidate.
Local buffering. If the baseline store rejects writes, buffer to a local write-ahead log file and resume only when store lock wait drops below 400 ms.
Synchronous last resort. If cluster health degrades (UnderReplicatedPartitions > 5 for 2 minutes), route EXPLAIN output to a lightweight Redis queue with TTL eviction — this accepts higher DB connection overhead to preserve capture continuity, and pairs with the low-impact capture techniques that keep the fallback from harming the workload.
DLQ replay. Inspect query-plans.dlq with a schema-validation worker, correct malformed payloads, republish to the main topic, and advance consumer offsets only after a successful baseline commit.

Verification checklist

Run these after any config or code change to the ingestion path:

[ ] kafka-consumer-groups.sh --describe shows per-partition lag below 15,000 and holding steady
[ ] End-to-end p95 latency (producer publish to store commit) is under 1,200 ms on a 1% traced sample
[ ] DLQ routing rate is below 0.15% of consumed records over a rolling 5-minute window
[ ] min.insync.replicas=2 and UnderReplicatedPartitions=0 confirmed on the query-plans topic
[ ] Schema registry reports BACKWARD_TRANSITIVE compatibility for query-plans-value
[ ] A composite index on (query_hash, plan_version) exists and store row-lock wait stays under 800 ms
[ ] A staging replay at 2× production throughput triggers consumer autoscaling and activates fallbacks with zero data loss

Compatibility and engine-specific notes

The transport is engine-agnostic, but the payload shape and the store-side upsert differ by source engine. Producers must normalize before publishing so downstream stages stay uniform, which is the job of the cross-engine normalization stage.

Concern	PostgreSQL	MySQL	Distributed SQL (CockroachDB / Vitess)
Plan source	`EXPLAIN (FORMAT JSON, ANALYZE)`	`EXPLAIN FORMAT=JSON`	`EXPLAIN (FORMAT JSON)` / vendor JSON
Payload size	Large; nested `Plans` arrays, verbose	Compact; flatter `query_block`	Large; per-node distribution metadata inflates size
Partition key	`query_hash` from normalized tree	Same, after operator canonicalization	Same, but include gateway/region to avoid cross-node skew
Store upsert	`INSERT ... ON CONFLICT (query_hash, plan_version)`	`INSERT ... ON DUPLICATE KEY UPDATE`	Prefer `UPSERT`; watch for retryable serialization errors
Compression payoff	High (verbose JSON, zstd ~5×)	Moderate	High, but larger absolute sizes raise `linger.ms` sensitivity

For distributed engines, size a few extra partitions and raise max_batch_size so the larger per-node plan payloads do not stall a partition. The downstream comparison of those normalized plans against history is governed by the regression thresholds defined in the core architecture.

← Back to Building Async Ingestion Pipelines for High-Throughput Queries
Automated EXPLAIN Capture & Storage Workflows — the full capture-to-storage reference architecture
Normalizing Query Plans for Cross-Engine Comparison — the consumer that drains this topic
Schema Validation for Baseline Metadata — envelope contracts enforced before persistence
Plan Hashing Algorithms for SQL Engines — the fingerprint used as the partition key

Symptom identification and production thresholds #

Root cause analysis #

1. Deserialization overhead and GIL contention #

2. Consumer rebalancing storms #

3. Schema registry version mismatch #

4. Broker disk saturation and ISR shrinkage #

Step-by-step remediation #

Step 1 — Pin the topic for high-throughput durability #

Step 2 — Publish with an idempotent, batched async producer #

Step 3 — Drain deterministically with DLQ routing #

Step 4 — Wire the tiered fallback chain #

Verification checklist #

Compatibility and engine-specific notes #

Related #

Symptom identification and production thresholds

Root cause analysis

1. Deserialization overhead and GIL contention

2. Consumer rebalancing storms

3. Schema registry version mismatch

4. Broker disk saturation and ISR shrinkage

Step-by-step remediation

Step 1 — Pin the topic for high-throughput durability

Step 2 — Publish with an idempotent, batched async producer

Step 3 — Drain deterministically with DLQ routing

Step 4 — Wire the tiered fallback chain

Verification checklist

Compatibility and engine-specific notes

Related