Runbook

Encrypting Baseline Query Plans at Rest and in Transit: Production Runbook & Automation Guide

Implementing robust cryptographic controls around execution plan artifacts is a non-negotiable requirement for modern query optimization pipelines. When baseline plans leak or become inaccessible due to cipher misalignment, regression detection breaks, and performance degradation goes uncaught. This guide details the operational implementation of Encrypting Baseline Query Plans at Rest and in Transit, providing exact thresholds, Python automation logic, and incident resolution patterns for platform teams managing baseline tracking infrastructure. All controls align with the isolation and key management principles established in Core Architecture & Baselining Fundamentals.

Symptom Identification & Threshold Mapping

Baseline encryption failures rarely manifest as explicit DECRYPTION_FAILED errors in production. Instead, they surface as silent degradation in CI/CD regression gates and anomalous latency in plan diff engines. SREs should monitor for the following exact symptom thresholds:

  • Baseline Fetch Latency: p99 > 2.5s on plan retrieval endpoints indicates TLS handshake retries or KMS decryption queue saturation.
  • CI Pipeline Stalls: Regression jobs hanging at fetch_baseline_plan for > 120s typically signal cipher suite mismatch between the collector and object storage.
  • Plan Diff Nullification: When the regression engine receives 0-byte or corrupted_base64 payloads, envelope decryption has failed silently due to key version drift.
  • TLS Alert Frequency: SSLV3_ALERT_HANDSHAKE_FAILURE or TLS1_ALERT_PROTOCOL_VERSION appearing at > 5/min in collector logs confirms in-transit encryption negotiation breakdown.
  • Key Age Threshold: KMS/CloudHSM keys older than 90 days without automated rotation trigger compliance blocks and increase the probability of stale ciphertext.

Root Cause Analysis & Cryptographic Architecture

Most encryption regressions trace back to three operational failure modes. Understanding the underlying envelope encryption model is critical for rapid triage.

  1. Envelope Key Drift: Baseline plans are encrypted using a symmetric Data Encryption Key (DEK), which is itself wrapped by a Key Encryption Key (KEK) managed by a centralized KMS. If the KEK rotates but the DEK metadata (key_id, algorithm, iv) isn’t atomically updated in the baseline manifest, the regression engine attempts decryption with stale parameters. Refer to AWS KMS documentation on envelope encryption for the standard cryptographic workflow.
  2. In-Transit Cipher Downgrade: Legacy CI runners or outdated database collectors default to TLS_RSA_WITH_AES_128_CBC_SHA. Modern baseline storage rejects these suites, causing connection resets before payload transfer. Python’s ssl module defaults to secure contexts, but explicit minimum_version enforcement is required in custom collectors (Python SSL Context Docs).
  3. Storage Boundary Misconfiguration: When baseline artifacts cross availability zones or cloud accounts, IAM/KMS policies often lack explicit kms:Decrypt or kms:GenerateDataKey permissions for the regression service account. This aligns with the isolation requirements documented in Security Boundaries for Baseline Data Storage, where cross-boundary key access must be explicitly scoped to prevent lateral decryption and maintain least-privilege enforcement.

Step-by-Step Mitigation & Python/CI Automation Logic

Remediation requires deterministic envelope encryption/decryption routines, strict TLS configuration, and automated key rotation validation. The following Python implementation demonstrates a production-ready envelope handler for baseline artifacts.

The envelope-encryption model — a per-record data key (DEK) encrypts the plan, and a KMS-managed key-encryption key (KEK) wraps the DEK:

flowchart LR
  PT["Plaintext plan JSON"] --> ENC["AES-256-GCM encrypt"]
  DEK["Data key — DEK"] --> ENC
  IV["Random 12-byte IV"] --> ENC
  ENC --> CT["Ciphertext"]
  DEK --> WRAP["KMS wrap with KEK"]
  WRAP --> WDEK["Wrapped DEK"]
  CT --> ENV["Envelope: ciphertext + wrapped_dek + iv"]
  WDEK --> ENV
  IV --> ENV

1. Envelope Encryption/Decryption Handler

PYTHON
import os
import base64
import json
import struct
from cryptography.hazmat.primitives.ciphers.aead import AESGCM
from cryptography.hazmat.primitives import hashes
from cryptography.hazmat.primitives.kdf.hkdf import HKDF
from typing import Tuple

class BaselinePlanCrypto:
    """Handles DEK generation, KMS wrapping, and AES-GCM envelope operations."""
    
    def __init__(self, kms_client, key_id: str):
        self.kms = kms_client
        self.key_id = key_id

    def encrypt_plan(self, plan_json: str) -> dict:
        # 1. Generate 256-bit DEK
        dek = os.urandom(32)
        aesgcm = AESGCM(dek)
        
        # 2. Generate IV (12 bytes for AES-GCM)
        iv = os.urandom(12)
        
        # 3. Encrypt payload
        ciphertext = aesgcm.encrypt(iv, plan_json.encode('utf-8'), None)
        
        # 4. Wrap DEK via KMS (KEK)
        wrapped_dek = self.kms.encrypt(KeyId=self.key_id, Plaintext=dek)
        
        return {
            "ciphertext": base64.b64encode(ciphertext).decode(),
            "wrapped_dek": base64.b64encode(wrapped_dek["CiphertextBlob"]).decode(),
            "iv": base64.b64encode(iv).decode(),
            "key_id": self.key_id,
            "algorithm": "AES_256_GCM"
        }

    def decrypt_plan(self, envelope: dict) -> str:
        # 1. Unwrap DEK
        wrapped_dek_blob = base64.b64decode(envelope["wrapped_dek"])
        unwrap_resp = self.kms.decrypt(CiphertextBlob=wrapped_dek_blob)
        dek = unwrap_resp["Plaintext"]
        
        # 2. Decrypt payload
        iv = base64.b64decode(envelope["iv"])
        ciphertext = base64.b64decode(envelope["ciphertext"])
        aesgcm = AESGCM(dek)
        
        plaintext = aesgcm.decrypt(iv, ciphertext, None)
        return plaintext.decode('utf-8')

2. CI/CD Pipeline Integration (GitLab CI / GitHub Actions)

Embed cryptographic validation directly into regression gates. The following YAML snippet enforces TLS 1.3 and validates envelope integrity before diff execution.

YAML
stages:
  - fetch_baseline
  - validate_crypto
  - regression_diff

fetch_baseline:
  stage: fetch_baseline
  image: python:3.11-slim
  script:
    - pip install boto3 cryptography requests
    - python -c "
      import ssl, requests
      ctx = ssl.create_default_context()
      ctx.minimum_version = ssl.TLSVersion.TLSv1_3
      resp = requests.get('${BASELINE_STORAGE_URL}', verify=True, timeout=30)
      resp.raise_for_status()
      with open('/tmp/plan_envelope.json', 'w') as f: f.write(resp.text)
      "
  artifacts:
    paths: ["/tmp/plan_envelope.json"]
    expire_in: 1h

validate_crypto:
  stage: validate_crypto
  script:
    - python -c "
      import json, sys
      env = json.load(open('/tmp/plan_envelope.json'))
      required = {'ciphertext', 'wrapped_dek', 'iv', 'key_id', 'algorithm'}
      missing = required - env.keys()
      if missing:
          print(f'FATAL: Missing envelope fields: {missing}')
          sys.exit(1)
      print('Envelope structure validated successfully.')
      "

Observability & Telemetry Integration

Silent failures in cryptographic pipelines require explicit instrumentation. Deploy the following OpenTelemetry/Prometheus configuration to surface degradation before it impacts regression SLAs.

Metric Definitions

MetricTypeLabelsAlert Threshold
baseline_kms_decrypt_duration_secondsHistogramkey_id, statusp95 > 800ms
baseline_tls_handshake_failures_totalCountercipher_suite, endpoint> 5/min
baseline_envelope_decryption_errors_totalCountererror_type> 0 (Page)
baseline_key_age_daysGaugekey_id> 85 (Warning)

Prometheus Alert Rules

YAML
groups:
  - name: baseline_crypto_alerts
    rules:
      - alert: BaselineDecryptionLatencyHigh
        expr: histogram_quantile(0.95, rate(baseline_kms_decrypt_duration_seconds_bucket[5m])) > 0.8
        for: 3m
        labels:
          severity: warning
        annotations:
          summary: "KMS decryption latency exceeding 800ms p95"
          
      - alert: BaselineEnvelopeCorruption
        expr: increase(baseline_envelope_decryption_errors_total[10m]) > 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Baseline plan decryption failures detected. Check KEK rotation status."

Safe Fallback Chains & Incident Response

When cryptographic dependencies fail, the regression pipeline must degrade gracefully without compromising security posture. Implement the following fallback chain in your orchestration layer:

  1. Circuit Breaker Activation: If kms:Decrypt error rate exceeds 10% over a 2-minute window, trip the circuit breaker. Route subsequent fetch requests to a read-only cache.
  2. Stale Baseline Fallback: Serve the most recently validated plaintext baseline from an encrypted, short-TTL Redis cache (TTL=15m). Tag the payload with x-baseline-state: cached_fallback to prevent false-positive regression alerts.
  3. Manual Key Override (Break-Glass): In the event of KMS regional outage, authorized SREs can inject a pre-approved, hardware-backed recovery key via environment variable BASELINE_RECOVERY_KEY_ARN. This key is strictly audited and auto-revoked after 24h.
  4. Pipeline Bypass Protocol: If decryption fails and no fallback exists, the CI pipeline must skip the regression diff stage but must not auto-approve the deployment. Instead, it should block the release with a CRYPTO_DEPENDENCY_DEGRADED status and notify the platform on-call.

Incident Runbook Quick Reference

  1. Verify KMS health dashboard and cross-region replication status.
  2. Check baseline_key_age_days metric. If > 90, trigger emergency rotation via aws kms schedule-key-deletion.
  3. Validate TLS configuration on CI runners: openssl s_client -connect <storage_endpoint>:443 -tls1_3.
  4. If envelope corruption is isolated to a single plan, re-run the collector with --force-recollect to regenerate the DEK/KEK binding.
  5. Post-incident: Audit IAM policies against the least-privilege matrix and update the cryptographic baseline manifest schema version.