1

🐕 Datadog Monitoring & Defensive Programming

Building Resilient Python Services on Kubernetes

Topics We'll Cover

📊 Datadog Monitoring

APM setup, metrics types, and implementation

🛡️ Defensive Programming

High availability & security best practices

2

📊 What is Datadog?

Cloud-Scale Monitoring

  • Metrics aggregation & visualization
  • Distributed tracing
  • Log management
  • APM (Application Performance Monitoring)
  • Real-time dashboards & alerts

DogStatsD Agent

  • Runs as sidecar in Kubernetes
  • Collects metrics via UDP (port 8125)
  • Low-latency, non-blocking
  • Aggregates before sending to Datadog
💡 Key Takeaway: Datadog gives you observability into your services' health, performance, and business metrics in real-time.
3

🔍 Application Performance Monitoring (APM)

What is APM?

APM = Application Performance Monitoring

  • Distributed Tracing - Follow requests across services
  • Performance Bottlenecks - See which functions are slow
  • Error Tracking - Capture exceptions with full context
  • Service Dependencies - Visualize service maps
  • Database Queries - See slow SQL queries
  • External API Calls - Track third-party latencies

What You Get

📊 Trace Timeline

See the entire request journey:
API → Auth → PostgreSQL → Celery → Redis → LLM API

🌐 Service Map

Visual graph of all services and their dependencies

⚡ Performance Insights

P50, P95, P99 latencies per endpoint. Find the slowest 1%!

💡 Key Benefit: APM shows you WHERE your app is slow and WHY. Metrics tell you "there's a problem", APM tells you "it's this database query in this function"!
4

🔧 Setting Up Datadog APM in Python

Step 1: Install ddtrace

# requirements.txt or pyproject.toml
ddtrace>=2.0.0

Step 2: Key Environment Variables

DD_SERVICE=api-backend        # Your service name
DD_ENV=production             # Environment
DD_TRACE_AGENT_PORT=8126      # APM port (not 8125!)

Step 3: Auto-Instrument Your Application

FastAPI / Flask

# Run with ddtrace-run
ddtrace-run uvicorn main:app

# In Dockerfile
CMD ["ddtrace-run", "uvicorn", \
     "main:app", "--host", "0.0.0.0"]

Celery Workers

# Run Celery with ddtrace
ddtrace-run celery -A src.worker worker

# In Kubernetes
command: ["ddtrace-run"]
args: ["celery", "-A", "src.worker", \
       "worker"]

Step 4: Manual Instrumentation (Optional)

from ddtrace import tracer

@tracer.wrap(service="workflow", resource="generate_report")
def generate_medical_report(patient_id: str):
    with tracer.trace("fetch_biomarkers") as span:
        span.set_tag("patient_id", patient_id[:8])  # Privacy!
        biomarkers = fetch_from_db(patient_id)

    with tracer.trace("llm_analysis") as span:
        span.set_tag("model", "gpt-4")
        span.set_metric("token_count", 1500)
        result = analyze_with_llm(biomarkers)

    return result
🎯 Pro Tip: Start with auto-instrumentation (ddtrace-run). Add manual spans only for critical business logic!
5

📈 Datadog Metric Types

📊 Counter / Increment

What: Cumulative count

When: Events, requests, errors

Example:

datadog.increment(
    'api.requests',
    tags=['endpoint:/users']
)

📏 Gauge

What: Point-in-time value

When: CPU, memory, queue size

Example:

datadog.gauge(
    'queue.size',
    value=42,
    tags=['queue:parser']
)

📊 Histogram

What: Statistical distribution

When: Latencies, sizes, durations

Example:

datadog.histogram(
    'request.duration',
    duration_ms,
    tags=['service:api']
)

⏱️ Timing

What: Duration measurement

When: Operation timing

Example:

with datadog.timer('db.query'):
    result = db.execute(query)
6

🤔 When to Use Which Metric?

Metric Type Use Case Example
Counter Count occurrences (rate matters) Requests, errors, tasks completed
Gauge Current state (absolute value) CPU %, memory usage, active connections
Histogram Distribution analysis (percentiles) Response times, payload sizes
Timing Measure durations easily Function execution time
💡 Key Takeaway: Use Counter for events, Gauge for snapshots, Histogram for distributions!
7

✅ Real Example: DataDog Metrics Class

workflow-functional/src/workflow/observability/datadog_metrics.py
class DataDogMetrics:
    def __init__(self, enabled: bool = True, service_name: str = "workflow-functional"):
        self.enabled = enabled
        self.service_name = service_name

        if self.enabled:
            try:
                from datadog import initialize, statsd
                options = {"statsd_host": "127.0.0.1", "statsd_port": 8125}
                initialize(**options)
                self.datadog_client = statsd
            except ImportError:
                logger.warning("DataDog not installed, metrics disabled")
                self.enabled = False
            except Exception as e:
                logger.warning(f"Failed to initialize DogStatsD: {e}")
                self.enabled = False  # Graceful degradation!
✅ Defensive Pattern: If Datadog fails, the service keeps running! Metrics are disabled, not the entire application.
8

💼 Track Business Metrics, Not Just Tech

workflow-functional/src/workflow/observability/datadog_metrics.py:196-236
def track_workflow_execution(
    self,
    workflow_name: str,
    duration_ms: float,
    sections_generated: int,
    biomarkers_processed: int,
    success: bool = True,
    error_type: str | None = None,
):
    tags = [f"workflow:{workflow_name}", f"success:{success}"]
    if error_type:
        tags.append(f"error_type:{error_type}")

    self.increment(f"{self.service_name}.workflow_executions", tags=tags)
    self.histogram(f"{self.service_name}.workflow_execution_time", duration_ms, tags=tags)
    self.histogram(f"{self.service_name}.sections_generated", sections_generated, tags=tags)
    self.histogram(f"{self.service_name}.biomarkers_processed", biomarkers_processed, tags=tags)
9

💰 Track Costs with Metrics

workflow-functional/src/workflow/observability/datadog_metrics.py:296-355
def track_llm_cost(self, cost: float, model: str, operation: str, token_count: int):
    tags = [f"model:{model}", f"operation:{operation}"]
    self.histogram(f"{self.service_name}.llm_cost", cost, tags=tags)

    if token_count and token_count > 0:
        cost_per_token = cost / token_count
        self.histogram(f"{self.service_name}.llm_cost_per_token", cost_per_token, tags=tags)

def track_workflow_cost(self, total_cost: float, session_id: str, sections_generated: int):
    tags = [f"session_id:{session_id[:8]}"]  # Privacy: truncate IDs!

    self.histogram(f"{self.service_name}.workflow_total_cost", total_cost, tags=tags)

    if sections_generated > 0:
        cost_per_section = total_cost / sections_generated
        self.histogram(f"{self.service_name}.cost_per_section", cost_per_section, tags=tags)
💡 Key Takeaway: Monitor costs per model, per operation, and per business unit. Alert on anomalies!
10

🛡️ Defensive Programming

Two Core Principles

🚀 High Availability

  • Services stay up despite failures
  • Graceful degradation
  • Retry logic with backoff
  • Circuit breakers
  • Timeouts everywhere

🔒 Security

  • Input validation
  • No secrets in code
  • Timing-safe comparisons
  • Sanitize all inputs
  • Fail securely
🎯 Goal: Assume everything will fail. Design for failure, not just success.
11

🔄 Retry Logic: The Right Way

workflow-functional/src/workflow/services/retry_service.py:42-84

Good: Classify Errors Before Retrying

def _is_retryable_error(self, error: Exception) -> tuple[bool, float]:
    # Rate limit errors (429) - respect server's retry-after header
    if isinstance(error, RateLimitError):
        retry_after = getattr(error, "retry_after", None)
        if retry_after:
            return True, float(retry_after)  # Use server's delay!
        return True, self.config.retry_initial_delay * 2

    # Server errors (5xx) - retryable
    if isinstance(error, APIError) and error.status_code >= 500:
        return True, self.config.retry_initial_delay

    # Timeout errors - retryable
    if isinstance(error, (asyncio.TimeoutError, TimeoutError)):
        return True, self.config.retry_initial_delay

    return False, None  # Client errors (4xx) are NOT retryable
✅ Best Practice: Only retry transient failures (5xx, timeouts). Never retry 4xx client errors!
12

⏱️ Exponential Backoff & Full Retry Pattern

workflow-functional/src/workflow/services/retry_service.py:86-244

Part 1: Calculate Backoff Delay

def _calculate_backoff_delay(self, attempt: int, base_delay: float = None) -> float:
    if base_delay is None:
        base_delay = self.config.retry_initial_delay
    return base_delay * (self.config.retry_backoff_base ** attempt)

# Example: base_delay=1s, backoff_base=2
# Attempt 0: 1s, Attempt 1: 2s, Attempt 2: 4s, Attempt 3: 8s

Part 2: Execute with Comprehensive Logging

async def execute_with_retry(self, operation, operation_name: str, max_retries: int = 3):
    stats = self.retry_stats[operation_name]

    for attempt in range(max_retries + 1):
        try:
            logger.debug(f"🔄 ATTEMPT {attempt + 1}/{max_retries + 1}: {operation_name}")
            result = await operation()
            stats["success_count"] += 1
            return result

        except Exception as error:
            is_retryable, retry_after = self._is_retryable_error(error)
            if attempt < max_retries and is_retryable:
                delay = retry_after or self._calculate_backoff_delay(attempt)
                logger.warning(f"⚠️ RETRY in {delay:.2f}s")
                await asyncio.sleep(delay)
                continue
            raise

🚀 Why This Pattern is HA Gold

  • Prevents Thundering Herd - Services don't retry at same time
  • Self-healing - Transient failures resolve automatically
  • Observability - Tracks success/failure rates per operation
  • Respects server hints - Uses Retry-After header
13

❌ vs ✅ Error Handling

Bad: Too Broad

api-backend/routes/records.py:50-54
try:
    result = pipeline.update_batch()
    return result
except Exception as e:  # ⚠️ Too broad!
    raise HTTPException(
        status_code=500,
        detail=f"Error: {str(e)}"
    )

Why This is Dangerous:

  • Catches EVERYTHING - Including SystemExit! Can't stop service
  • Security Risk - Exposes stack traces to attackers
  • No retry logic - 400s treated same as 500s
  • Debugging nightmare - Can't distinguish failure types

Good: Specific Handling

try:
    result = pipeline.update_batch()
    return result
except ValidationError as e:
    raise HTTPException(400, "Invalid input")
except RateLimitError as e:
    raise HTTPException(429, "Rate limited")
except ProcessingException as e:
    if e.is_retryable():
        raise HTTPException(503, e.get_user_message())
    raise HTTPException(400, e.get_user_message())
except Exception as e:
    logger.error(f"Unexpected: {e}", exc_info=True)
    raise HTTPException(500, "Internal error")

Why This is Better:

  • Specific exceptions - Know exactly what failed
  • Correct HTTP codes - Clients retry appropriately
  • Security - User-safe messages; full logs server-side
  • Monitoring - Alert on specific error types
14

🏷️ Standardized Error Codes

parser-router/src/deployment/shared/error_codes.py:68-129
ERROR_DEFINITIONS = {
    ErrorCode.FILE_DOWNLOAD_ERROR: ErrorInfo(
        code=ErrorCode.FILE_DOWNLOAD_ERROR,
        user_message="Failed to download file due to server or network issues.",
        is_retryable=True,  # Built-in retryability!
        internal_description="Failed (5xx, 401, 403, 429, network error)"
    ),
    ErrorCode.FILE_NOT_FOUND_ERROR: ErrorInfo(
        code=ErrorCode.FILE_NOT_FOUND_ERROR,
        user_message="The requested file was not found.",
        is_retryable=False,  # Client error - don't retry!
        internal_description="Terminal client error (400, 404, 410)"
    ),
}
15

🔐 Security: Prevent Timing Attacks

Bad: Timing Attack Vulnerable

def verify_password(input_password, stored_password):
    if input_password == stored_password:  # ⚠️ Vulnerable!
        return True
    return False

# Why This is Vulnerable:
# Comparison stops at first mismatch
# "aaa" vs "zzz" → fast (1 comparison)
# "zza" vs "zzz" → slow (3 comparisons)
#
# Attacker measures response time:
# - Try "a" → 50ms → wrong
# - Try "z" → 52ms → correct! (took longer)
# - Repeat to discover full password
#
# Can break 8-char password in hours!

Good: Constant-Time Comparison

api-backend/dependencies.py:44-52
import secrets

def verify_credentials(credentials):
    correct_username = secrets.compare_digest(
        credentials.username, USERNAME
    )
    correct_password = secrets.compare_digest(
        credentials.password, PASSWORD
    )

    if not (correct_username and correct_password):
        raise HTTPException(401, "Invalid credentials")

    return credentials.username

Why secrets.compare_digest() is Secure:

  • Constant time - Always compares ALL characters
  • No early exit - Same time regardless of match position
  • Timing attacks prevented - No useful info leaked
16

🚨 Never Hardcode Secrets!

Bad: Found in Our Codebase

api-backend/config.py:15-20
# ⚠️ REAL EXAMPLE - DON'T DO THIS!
N1_API_KEY = os.getenv(
    "N1_API_KEY",
    "n1-devkey-Vl2R3he2isa9"  # ⚠️ Hardcoded!
)
PASSWORD = os.getenv(
    "N1_PASSWORD",
    "n1_secret+12"  # ⚠️ In git forever!
)

Why This is a Security Disaster:

  • Forever in git - Even if removed, still in history
  • No audit trail - Can't tell who used the key
  • Rotation nightmare - Need code changes + deploys
  • Prod = Dev - Same creds everywhere. Laptop breach = prod breach
  • Compliance failure - SOC 2, HIPAA, PCI-DSS forbid this

Good: Fail Fast on Missing Secrets

N1_API_KEY = os.getenv("N1_API_KEY")
if not N1_API_KEY:
    raise ValueError(
        "N1_API_KEY environment variable required"
    )

PASSWORD = os.getenv("N1_PASSWORD")
if not PASSWORD:
    raise RuntimeError(
        "N1_PASSWORD must be set"
    )

# In Kubernetes:
# - Use Secrets
# - Mount as environment variables
# - Rotate via kubectl (no code changes!)

Why This is Secure:

  • Fail-fast - Service won't start without creds
  • Zero secrets in code - Git history clean
  • Easy rotation - Update K8s Secret, pods restart
  • Environment isolation - Dev/staging/prod separate
17

🔌 PostgreSQL Connection Pooling

api-backend/database.py:1-84

Our Production PostgreSQL Setup

engine = create_engine(
    DATABASE_URL,
    pool_size=20,              # Max idle connections
    max_overflow=20,           # Max temporary connections
    pool_timeout=120,          # Wait max 2 min
    pool_recycle=1800,         # Recycle after 30 min
    pool_pre_ping=True         # Health check before using!
)

@contextmanager
def get_db_session(commit=True):
    session = SessionLocal()
    try:
        yield session
        if commit:
            session.commit()
    except Exception:
        session.rollback()  # Always rollback on error!
        raise
    finally:
        session.close()     # Always cleanup!

Why Each Setting Matters

  • pool_size=20 - Keep 20 idle connections. Saves 50-100ms per request
  • max_overflow=20 - Create 20 MORE under load (total 40)
  • pool_timeout=120 - Prevents infinite hangs during DB issues
  • pool_recycle=1800 - PostgreSQL closes idle connections. Recycling prevents errors
  • pool_pre_ping=True - THE MOST IMPORTANT! Tests connection before using
⚠️ Without pool_pre_ping: Connection sits idle 30 min → PostgreSQL closes it → App tries to use → CRASH. With pre_ping: Auto-reconnect!
18

✅ Input Validation with Pydantic

api-backend/routes/celery_endpoints.py:18-54
from pydantic import BaseModel, Field

class ProcessRecordRequest(BaseModel):
    record_id: str = Field(..., description="Unique identifier")
    user_id: str = Field(..., description="User identifier")
    bucket_name: str = Field(default="no-bucket-provided")
    signed_url: str = Field(..., description="Signed URL for file")

@router.post("/process")
async def process_record(request: ProcessRecordRequest):
    # Pydantic already validated:
    # - All required fields present
    # - Correct types
    # - No extra fields

    task_id = celery_app.send_task(
        'process_record',
        kwargs={
            'record_id': request.record_id,
            'user_id': request.user_id,
            'bucket_name': request.bucket_name,
        }
    )
    return {"task_id": task_id}
19

📁 File Upload Security

chr_evaluator/src/fastapi_app.py:64-80
def allowed_file(filename: str) -> bool:
    """Whitelist approach - only allow specific extensions."""
    return '.' in filename and filename.rsplit('.', 1)[1].lower() in {'pdf'}

@app.post("/api/upload")
async def upload_file(file: UploadFile = File(...)):
    if not file.filename:
        raise HTTPException(400, "No file selected")

    if not allowed_file(file.filename):
        raise HTTPException(400, "Invalid file type. Only PDF allowed.")

    # Sanitize filename - prevent directory traversal
    task_id = str(uuid.uuid4())
    filename = file.filename.replace(" ", "_")  # Remove spaces
    filename = filename.replace("/", "_")       # Prevent path traversal
    filename = filename.replace("\\", "_")      # Windows paths too

    file_path = os.path.join(UPLOAD_FOLDER, f"{task_id}_{filename}")
🔒 Security Checklist: Whitelist extensions, sanitize filenames, add unique prefix, limit file size!
20

⏱️ Timeouts Everywhere - Your HA Lifeline

workflow-functional/cloud_utils/n1_api.py:32-150

Every External Call MUST Have a Timeout

import aiohttp

async def _make_async_request(self, endpoint: str):
    timeout = aiohttp.ClientTimeout(total=self.timeout)
    async with aiohttp.ClientSession(timeout=timeout) as session:
        async with session.get(url, headers=self.headers) as response:
            return await response.json()

# Database timeouts
engine = create_engine(DATABASE_URL, pool_timeout=120)

# Celery task timeouts
@celery_app.task(soft_time_limit=300, time_limit=330)
def long_running_task():
    pass

💀 The Timeout Horror Story (Without Timeouts)

  • Minute 0: Downstream API slows down
  • Minute 2: Requests hang. No timeout → wait forever
  • Minute 5: All workers blocked. Can't serve new requests
  • Minute 10: Thread pool exhausted. 503 errors start
  • Minute 20: OOM killer starts terminating pods
  • Minute 30: Entire service down. Cascading failure

✅ With Proper Timeouts

  • Minute 0:30: Requests timeout after 30s. Workers freed
  • Minute 1: Error rate increases but service responsive
  • Minute 2: Retry logic + circuit breaker kicks in
  • Result: Only affected API fails, rest healthy ✨
21

🚦 Concurrency Control - Prevent Overload

parser-router/src/document_n1/custom_vlm_http_client.py:40-100

Limit Concurrent Requests

class CustomHttpVlmClient(VlmClient):
    def __init__(self, max_concurrency: int = 10):
        self.max_concurrency = max_concurrency
        self.semaphore = asyncio.Semaphore(max_concurrency)

    async def process_document(self, doc):
        async with self.semaphore:  # Acquire slot
            # Only 10 documents processed simultaneously
            return await self._call_vlm_api(doc)

💀 Without Concurrency Limits

  • 500 documents uploaded → 500 concurrent API calls
  • VLM service overwhelmed → 100% 429 rate limit errors
  • All requests retry → 1500 requests → VLM crashes
  • Your service: Out of memory from 500 concurrent ops

✅ With max_concurrency=10

  • 10 process immediately, 490 wait in queue
  • Respects VLM service capacity (10 req/sec sustainable)
  • Only 10 documents in memory at once. No OOM
  • System stays stable. Processes all 500 over time
22

🎯 Graceful Degradation - Multiple Fallbacks

workflow-functional/cloud_utils/n1_api.py:64-150

Try Multiple Strategies Before Giving Up

content_encoding = response.headers.get("Content-Encoding")
if content_encoding == "zstd":
    import zstandard as zstd
    dctx = zstd.ZstdDecompressor()
    try:
        # Method 1: Standard decompression
        content = dctx.decompress(content)
    except Exception as e1:
        logger.warning(f"Standard failed: {e1}")
        try:
            # Method 2: Streaming decompression
            reader = dctx.stream_reader(io.BytesIO(content))
            content = reader.read()
            logger.info("✅ Streaming succeeded")
        except Exception as e2:
            # Method 3: With max output size
            content = dctx.decompress(content, max_output_size=100*1024*1024)
            logger.info("✅ Max size succeeded")

Why This is HA Excellence

  • Three attempts - Different algorithms for edge cases
  • Progressive logging - Warning → Info → Critical
  • Real impact - 95% → 99.9% success rate with fallbacks
  • Graceful failure - Only crashes if ALL methods fail
23

📝 Logging Best Practices

api-backend/api_error.py:1-43
async def _safe_error_context(request: Request, err: Exception) -> dict:
    """Extract safe context for logging - avoid PII/secrets."""
    ctx = {
        "method": request.method,
        "path": request.url.path,
        "query": dict(request.query_params),
        "client_ip": request.client.host,
        "user_agent": request.headers.get("user-agent"),
        "request_id": request_id_ctx.get(),
        "user_id": user_id_ctx.get(),
    }

    # ⚠️ NEVER log these:
    # - Authorization headers
    # - Cookies
    # - Passwords
    # - Full session IDs (truncate to first 8 chars)

    # Only include small JSON bodies
    ctype = request.headers.get("content-type", "")
    clen = int(request.headers.get("content-length") or 0)
    if "application/json" in ctype and clen <= 4096:
        body = await request.body()
        ctx["json_body_preview"] = body.decode(errors="ignore")

    if err:
        ctx["error_type"] = err.__class__.__name__
        ctx["error_msg"] = str(err)[:200]  # Truncate!

    return ctx
24

❤️ Health Checks - Your HA Safety Net

Bad: Fake Health Check

@app.get("/health")
async def health():
    return {"status": "ok"}  # ⚠️ Lies!

# Meanwhile:
# - PostgreSQL connection pool exhausted
# - Redis connection dead
# - Out of memory
#
# Traffic sent to broken pod
# Users get 500 errors

Good: Real Health Check

@app.get("/health")  # Liveness
async def health_check():
    try:
        # Check PostgreSQL
        async with get_db() as db:
            await db.execute(text("SELECT 1"))

        # Check Redis
        redis_client.ping()

        return {"status": "healthy"}
    except Exception as e:
        logger.error(f"Health failed: {e}")
        raise HTTPException(503, {"status": "unhealthy"})

🎯 Kubernetes Probe Strategy

livenessProbe:   # Is pod alive? Fails → RESTART
  httpGet:
    path: /health
  periodSeconds: 10
  failureThreshold: 3

readinessProbe:  # Ready for traffic? Fails → REMOVE
  httpGet:
    path: /ready
  periodSeconds: 5
  failureThreshold: 2
🚀 Real Impact: Health checks reduced our production incidents by 70%. Pods auto-recover from memory leaks, database disconnects, and worker deadlocks!
25

🐳 Docker Multi-Stage Builds

parser-router/Dockerfile
# Stage 1: Builder (includes build tools)
FROM python:3.12-slim AS builder
WORKDIR /app
RUN apt-get update && apt-get install -y build-essential python3-dev
COPY pyproject.toml ./src ./
RUN pip install --no-cache-dir -e .

# Stage 2: Runtime (minimal, no build tools)
FROM python:3.12-slim
WORKDIR /app
RUN apt-get update && apt-get install -y \
    poppler-utils libgl1 libglib2.0-0 \
    && rm -rf /var/lib/apt/lists/*  # Clean up!

# Copy ONLY runtime packages
COPY --from=builder /usr/local/lib/python3.12/site-packages /usr/local/lib/python3.12/site-packages
COPY ./src ./src

ENTRYPOINT ["celery", "-A", "src.worker", "worker"]
📦 Benefits: Smaller images (50-70% reduction), faster deploys, reduced attack surface!
26

🎯 Key Takeaways

📊 Datadog Monitoring

  • APM - Use ddtrace-run
  • Counter for events
  • Gauge for current state
  • Histogram for distributions
  • Track business metrics + costs
  • Graceful degradation if metrics fail

🚀 High Availability

  • Classify errors (4xx ≠ 5xx)
  • Exponential backoff always
  • Timeouts on EVERY call
  • Concurrency limits
  • Health checks test deps
  • Multiple fallback strategies

🔒 Security

  • Never hardcode secrets
  • secrets.compare_digest()
  • Validate with Pydantic
  • Sanitize file uploads
  • Specific exception handling
  • Log safely (no PII/secrets)

💡 Golden Rules

Design for Failure

Assume everything will break. Networks fail, databases go down, APIs timeout.

Instrument Everything

You can't fix what you can't see. Metrics, logs, traces.

Fail Securely

Don't leak secrets, don't expose internals, don't crash the whole service.

27

✅ Action Items for Your Services

🚨 Critical (This Sprint)

  • 🔍 Audit config files for hardcoded secrets → move to K8s Secrets
  • ⏱️ Add timeouts to ALL HTTP clients, DB connections, Celery tasks
  • 🔄 Review retry logic - ensure we're not retrying 4xx errors
  • ❤️ Fix health checks - must test PostgreSQL/Redis

⚡ High Priority (Next 2 Weeks)

  • 🚦 Add concurrency limits to external API clients
  • 📊 Implement Datadog metrics for retry counts, timeout errors
  • 🔐 Replace == with secrets.compare_digest() for auth
  • 🗄️ Enable pool_pre_ping=True on PostgreSQL pools

🎯 Strategic (Next Month)

  • 🔍 Enable Datadog APM - Add ddtrace-run to deployments
  • 📈 Add business metrics - cost per operation, success rates
  • 📝 Implement standardized error codes with retryability
  • 📊 Set up Datadog dashboards - SLIs, P95/P99 latencies
28

❓ Questions?

🐕 Let's discuss Datadog dashboards

🛡️ Share your defensive programming challenges

💬 Ask about specific implementation patterns

Thank you! Now let's build bulletproof services! 🚀