Runbook
Encrypting Baseline Query Plans at Rest and in Transit: Production Runbook & Automation Guide
Implementing robust cryptographic controls around execution plan artifacts is a non-negotiable requirement for modern query optimization pipelines. When baseline plans leak or become inaccessible due to cipher misalignment, regression detection breaks, and performance degradation goes uncaught. This guide details the operational implementation of Encrypting Baseline Query Plans at Rest and in Transit, providing exact thresholds, Python automation logic, and incident resolution patterns for platform teams managing baseline tracking infrastructure. All controls align with the isolation and key management principles established in Core Architecture & Baselining Fundamentals.
Symptom Identification & Threshold Mapping
Baseline encryption failures rarely manifest as explicit DECRYPTION_FAILED errors in production. Instead, they surface as silent degradation in CI/CD regression gates and anomalous latency in plan diff engines. SREs should monitor for the following exact symptom thresholds:
- Baseline Fetch Latency:
p99 > 2.5son plan retrieval endpoints indicates TLS handshake retries or KMS decryption queue saturation. - CI Pipeline Stalls: Regression jobs hanging at
fetch_baseline_planfor> 120stypically signal cipher suite mismatch between the collector and object storage. - Plan Diff Nullification: When the regression engine receives
0-byteorcorrupted_base64payloads, envelope decryption has failed silently due to key version drift. - TLS Alert Frequency:
SSLV3_ALERT_HANDSHAKE_FAILUREorTLS1_ALERT_PROTOCOL_VERSIONappearing at> 5/minin collector logs confirms in-transit encryption negotiation breakdown. - Key Age Threshold: KMS/CloudHSM keys older than
90 dayswithout automated rotation trigger compliance blocks and increase the probability of stale ciphertext.
Root Cause Analysis & Cryptographic Architecture
Most encryption regressions trace back to three operational failure modes. Understanding the underlying envelope encryption model is critical for rapid triage.
- Envelope Key Drift: Baseline plans are encrypted using a symmetric Data Encryption Key (DEK), which is itself wrapped by a Key Encryption Key (KEK) managed by a centralized KMS. If the KEK rotates but the DEK metadata (
key_id,algorithm,iv) isn’t atomically updated in the baseline manifest, the regression engine attempts decryption with stale parameters. Refer to AWS KMS documentation on envelope encryption for the standard cryptographic workflow. - In-Transit Cipher Downgrade: Legacy CI runners or outdated database collectors default to
TLS_RSA_WITH_AES_128_CBC_SHA. Modern baseline storage rejects these suites, causing connection resets before payload transfer. Python’ssslmodule defaults to secure contexts, but explicitminimum_versionenforcement is required in custom collectors (Python SSL Context Docs). - Storage Boundary Misconfiguration: When baseline artifacts cross availability zones or cloud accounts, IAM/KMS policies often lack explicit
kms:Decryptorkms:GenerateDataKeypermissions for the regression service account. This aligns with the isolation requirements documented in Security Boundaries for Baseline Data Storage, where cross-boundary key access must be explicitly scoped to prevent lateral decryption and maintain least-privilege enforcement.
Step-by-Step Mitigation & Python/CI Automation Logic
Remediation requires deterministic envelope encryption/decryption routines, strict TLS configuration, and automated key rotation validation. The following Python implementation demonstrates a production-ready envelope handler for baseline artifacts.
The envelope-encryption model — a per-record data key (DEK) encrypts the plan, and a KMS-managed key-encryption key (KEK) wraps the DEK:
flowchart LR PT["Plaintext plan JSON"] --> ENC["AES-256-GCM encrypt"] DEK["Data key — DEK"] --> ENC IV["Random 12-byte IV"] --> ENC ENC --> CT["Ciphertext"] DEK --> WRAP["KMS wrap with KEK"] WRAP --> WDEK["Wrapped DEK"] CT --> ENV["Envelope: ciphertext + wrapped_dek + iv"] WDEK --> ENV IV --> ENV
1. Envelope Encryption/Decryption Handler
import os
import base64
import json
import struct
from cryptography.hazmat.primitives.ciphers.aead import AESGCM
from cryptography.hazmat.primitives import hashes
from cryptography.hazmat.primitives.kdf.hkdf import HKDF
from typing import Tuple
class BaselinePlanCrypto:
"""Handles DEK generation, KMS wrapping, and AES-GCM envelope operations."""
def __init__(self, kms_client, key_id: str):
self.kms = kms_client
self.key_id = key_id
def encrypt_plan(self, plan_json: str) -> dict:
# 1. Generate 256-bit DEK
dek = os.urandom(32)
aesgcm = AESGCM(dek)
# 2. Generate IV (12 bytes for AES-GCM)
iv = os.urandom(12)
# 3. Encrypt payload
ciphertext = aesgcm.encrypt(iv, plan_json.encode('utf-8'), None)
# 4. Wrap DEK via KMS (KEK)
wrapped_dek = self.kms.encrypt(KeyId=self.key_id, Plaintext=dek)
return {
"ciphertext": base64.b64encode(ciphertext).decode(),
"wrapped_dek": base64.b64encode(wrapped_dek["CiphertextBlob"]).decode(),
"iv": base64.b64encode(iv).decode(),
"key_id": self.key_id,
"algorithm": "AES_256_GCM"
}
def decrypt_plan(self, envelope: dict) -> str:
# 1. Unwrap DEK
wrapped_dek_blob = base64.b64decode(envelope["wrapped_dek"])
unwrap_resp = self.kms.decrypt(CiphertextBlob=wrapped_dek_blob)
dek = unwrap_resp["Plaintext"]
# 2. Decrypt payload
iv = base64.b64decode(envelope["iv"])
ciphertext = base64.b64decode(envelope["ciphertext"])
aesgcm = AESGCM(dek)
plaintext = aesgcm.decrypt(iv, ciphertext, None)
return plaintext.decode('utf-8')2. CI/CD Pipeline Integration (GitLab CI / GitHub Actions)
Embed cryptographic validation directly into regression gates. The following YAML snippet enforces TLS 1.3 and validates envelope integrity before diff execution.
stages:
- fetch_baseline
- validate_crypto
- regression_diff
fetch_baseline:
stage: fetch_baseline
image: python:3.11-slim
script:
- pip install boto3 cryptography requests
- python -c "
import ssl, requests
ctx = ssl.create_default_context()
ctx.minimum_version = ssl.TLSVersion.TLSv1_3
resp = requests.get('${BASELINE_STORAGE_URL}', verify=True, timeout=30)
resp.raise_for_status()
with open('/tmp/plan_envelope.json', 'w') as f: f.write(resp.text)
"
artifacts:
paths: ["/tmp/plan_envelope.json"]
expire_in: 1h
validate_crypto:
stage: validate_crypto
script:
- python -c "
import json, sys
env = json.load(open('/tmp/plan_envelope.json'))
required = {'ciphertext', 'wrapped_dek', 'iv', 'key_id', 'algorithm'}
missing = required - env.keys()
if missing:
print(f'FATAL: Missing envelope fields: {missing}')
sys.exit(1)
print('Envelope structure validated successfully.')
"Observability & Telemetry Integration
Silent failures in cryptographic pipelines require explicit instrumentation. Deploy the following OpenTelemetry/Prometheus configuration to surface degradation before it impacts regression SLAs.
Metric Definitions
| Metric | Type | Labels | Alert Threshold |
|---|---|---|---|
baseline_kms_decrypt_duration_seconds | Histogram | key_id, status | p95 > 800ms |
baseline_tls_handshake_failures_total | Counter | cipher_suite, endpoint | > 5/min |
baseline_envelope_decryption_errors_total | Counter | error_type | > 0 (Page) |
baseline_key_age_days | Gauge | key_id | > 85 (Warning) |
Prometheus Alert Rules
groups:
- name: baseline_crypto_alerts
rules:
- alert: BaselineDecryptionLatencyHigh
expr: histogram_quantile(0.95, rate(baseline_kms_decrypt_duration_seconds_bucket[5m])) > 0.8
for: 3m
labels:
severity: warning
annotations:
summary: "KMS decryption latency exceeding 800ms p95"
- alert: BaselineEnvelopeCorruption
expr: increase(baseline_envelope_decryption_errors_total[10m]) > 0
for: 1m
labels:
severity: critical
annotations:
summary: "Baseline plan decryption failures detected. Check KEK rotation status."Safe Fallback Chains & Incident Response
When cryptographic dependencies fail, the regression pipeline must degrade gracefully without compromising security posture. Implement the following fallback chain in your orchestration layer:
- Circuit Breaker Activation: If
kms:Decrypterror rate exceeds10%over a 2-minute window, trip the circuit breaker. Route subsequent fetch requests to a read-only cache. - Stale Baseline Fallback: Serve the most recently validated plaintext baseline from an encrypted, short-TTL Redis cache (
TTL=15m). Tag the payload withx-baseline-state: cached_fallbackto prevent false-positive regression alerts. - Manual Key Override (Break-Glass): In the event of KMS regional outage, authorized SREs can inject a pre-approved, hardware-backed recovery key via environment variable
BASELINE_RECOVERY_KEY_ARN. This key is strictly audited and auto-revoked after24h. - Pipeline Bypass Protocol: If decryption fails and no fallback exists, the CI pipeline must skip the regression diff stage but must not auto-approve the deployment. Instead, it should block the release with a
CRYPTO_DEPENDENCY_DEGRADEDstatus and notify the platform on-call.
Incident Runbook Quick Reference
- Verify KMS health dashboard and cross-region replication status.
- Check
baseline_key_age_daysmetric. If> 90, trigger emergency rotation viaaws kms schedule-key-deletion. - Validate TLS configuration on CI runners:
openssl s_client -connect <storage_endpoint>:443 -tls1_3. - If envelope corruption is isolated to a single plan, re-run the collector with
--force-recollectto regenerate the DEK/KEK binding. - Post-incident: Audit IAM policies against the least-privilege matrix and update the cryptographic baseline manifest schema version.