Rollout & Observability

Feature flags, monitoring strategy, and gradual deployment plan

Feature Flag Architecture

Personalization is controlled via a multi-level feature flag system that allows fine-grained control over rollout and instant kill switches.

Flag Hierarchy

L1

Global Kill Switch

PERSONALIZATION_ENABLED=true/false
Environment variable. Disables all personalization system-wide. Used for emergencies.

L2

Per-Tenant Flag

tenant_configs.personalization_enabled
Database configuration. Allows selective enablement per tenant. Can be toggled via CMS.

L3

Percentage Rollout

tenant_configs.rollout_percentage=0-100
Gradual rollout within a tenant. Hash user_id to determine eligibility. Allows A/B testing.

Flag Evaluation Logic

Note: The code below reflects the prototype implementation using synchronous in-memory data. In production, getTenantConfig() would be an async database call.

Helper Function: Consistent hashing for percentage-based rollout

function hashUserId(userId) {
  // Use SHA-256 for consistent, deterministic hashing
  const hash = crypto.createHash('sha256').update(userId).digest('hex');
  // Convert to integer and get percentile (0-99)
  return parseInt(hash.substring(0, 8), 16) % 100;
}

Main Decision Function: Multi-level feature flag evaluation

function shouldPersonalize(tenantId, userId, tenantConfig) {
  // L1: Global kill switch
  if (!PERSONALIZATION_ENABLED) {
    return { 
      enabled: false, 
      reason: 'Global personalization disabled' 
    };
  }
  
  // L2: Tenant-specific flag
  if (!tenantConfig.personalization_enabled) {
    return { 
      enabled: false, 
      reason: 'Personalization disabled for tenant' 
    };
  }
  
  // L3: Percentage rollout
  const userHash = hashUserId(userId);
  if (userHash >= tenantConfig.rollout_percentage) {
    return { 
      enabled: false, 
      reason: 'User not in rollout (percentile: ' + userHash + ')' 
    };
  }
  
  return { 
    enabled: true, 
    reason: 'All checks passed' 
  };
}

Prototype vs Production Implementation

Return Value:

  • Prototype: Returns object with { enabled, reason } for better debugging and metadata
  • Production: Could return simple boolean, or keep the object for observability

Tenant Config Retrieval:

  • Prototype: Synchronous in-memory object lookup (tenantConfigs[tenantId])
  • Production: Async database query (await getTenantConfig(tenantId)) with caching

Parameter Passing:

  • Prototype: Passes tenantConfig as parameter (already loaded)
  • Production: Could fetch config inside function or cache at request middleware level

Consistency Guarantee:

  • SHA-256 hashing ensures: Same user always gets same experience (deterministic)
  • Rollout is gradual: Changing from 50% → 75% adds users, never removes them
  • Distribution is uniform: Users evenly spread across percentiles

Production Implementation Considerations

The prototype demonstrates the core logic with in-memory data. Production deployments require additional considerations for database access, caching, and error handling.

Async/Await Pattern

In production with database calls, the function signature would be:

async function shouldPersonalize(tenantId, userId) {
  // L1: Global kill switch (env var, still synchronous)
  if (!process.env.PERSONALIZATION_ENABLED) {
    return { enabled: false, reason: 'Global kill switch' };
  }
  
  // L2: Fetch tenant config from database (with caching)
  const tenantConfig = await getTenantConfig(tenantId);
  if (!tenantConfig?.personalization_enabled) {
    return { enabled: false, reason: 'Tenant disabled' };
  }
  
  // L3: Percentage rollout (same logic)
  const userHash = hashUserId(userId);
  if (userHash >= tenantConfig.rollout_percentage) {
    return { enabled: false, reason: 'Not in rollout' };
  }
  
  return { enabled: true, reason: 'Personalization enabled' };
}

Caching Strategy

  • • Cache tenant configs in Redis (TTL: 5 minutes)
  • • Fallback to database if cache miss
  • • Invalidate cache when tenant settings change via CMS
  • • Consider in-memory LRU cache (process-level) for hot tenants

Error Handling

  • • If tenant config fetch fails → default to enabled: false (fail safe)
  • • Log all feature flag decisions for audit trail
  • • Include request_id in logs for traceability
  • • Monitor cache hit rates and database query latency

Config Refresh

  • • Global kill switch: Read from environment, restart not required (use config service)
  • • Tenant flags: Updated via CMS, propagated via Redis pub/sub
  • • Rollout percentage: Gradual changes (e.g., 25% → 50% → 75% → 100%)
  • • Maximum propagation delay: ~30 seconds across all instances

Gradual Rollout Plan

Phase 1: Internal Testing (Week 1)

  • • Enable for 1-2 internal test tenants
  • • rollout_percentage=100 (all users)
  • • Monitor latency, error rates, cache performance
  • • Validate data pipeline and signal accuracy
  • • Goal: Identify critical bugs, performance issues

Phase 2: Beta Tenants (Week 2-3)

  • • Enable for 5-10 volunteer beta tenants
  • • rollout_percentage=50 (A/B test)
  • • Compare engagement metrics: CTR, completion rate, session length
  • • Collect qualitative feedback from tenant partners
  • • Goal: Validate business value, gather UX feedback

Phase 3: Gradual Expansion (Week 4-6)

  • • Enable for 25% of tenants (30 tenants)
  • • rollout_percentage=100 for enabled tenants
  • • Monitor system load, database performance
  • • Scale infrastructure as needed (add Redis nodes, DB replicas)
  • • Goal: Stress test at moderate scale

Phase 4: Full Rollout (Week 7-8)

  • • Enable for all 120 tenants
  • • rollout_percentage=100
  • • Monitor closely for first 48 hours
  • • Keep global kill switch ready
  • • Goal: Achieve stable 3k RPS with p95 < 250ms

Kill Switch Procedure

When to Activate

  • • p95 latency exceeds 225ms for 3 minutes (approaching 250ms non-negotiable)
  • • OR p99 latency exceeds 550ms for 3 minutes (approaching 600ms non-negotiable)
  • • Error rate exceeds 1% for any endpoint
  • • Database or Redis cluster experiencing failures
  • • Significant drop in engagement metrics (>20%)
  • • Security incident requiring immediate isolation

Activation Steps

1

Set PERSONALIZATION_ENABLED=false in environment config

2

Restart API instances (or wait for config hot-reload ~30 seconds)

3

Verify all traffic routes to non-personalized feed endpoint

4

Investigate root cause while system operates in safe mode

5

Re-enable gradually starting with single test tenant

Fallback Behavior

When personalization is disabled, the system automatically falls back to editorial curation:

  • • Videos sorted by editorial_boost DESC, created_at DESC
  • • No user signals queried (reduces database load)
  • • Cache TTL extended to 5 minutes (higher reuse)
  • • Response time drops to ~30ms (vs ~150ms personalized)

Latency Threshold Design Rationale

Our latency thresholds are designed with a tiered approach that protects the non-negotiable requirements (p95 < 250ms, p99 < 600ms) through early detection and progressive intervention.

Design Philosophy

  • Non-negotiables are cliff edges — We must never breach p95 < 250ms or p99 < 600ms. These are hard limits, not targets.
  • Design with headroom — Normal operation targets are p95 < 150ms and p99 < 400ms, providing 100ms and 200ms safety margins respectively.
  • Progressive intervention — Alerts escalate from warnings to automatic mitigation to kill switch activation, all before reaching the non-negotiables.
  • Sustained violations only — All thresholds require sustained degradation (1-3 minutes) to avoid false positives from transient spikes.

Threshold Progression

Info: p95 > 150ms or p99 > 400ms

Above design targets but still safe. Dashboard tracking only. Margins: 100ms (p95), 200ms (p99).

Warning: p95 > 180ms or p99 > 450ms (1 min)

Degradation detected. Slack notification sent. Margins: 70ms (p95), 150ms (p99).

Critical: p95 > 200ms or p99 > 500ms (2 min)

Serious degradation. Page on-call immediately. Margins: 50ms (p95), 100ms (p99).

Auto-Mitigate: Same thresholds (3 min)

Automatic rollout reduction by 50%. Reduces load to restore performance.

Kill Switch: p95 > 225ms or p99 > 550ms (3 min)

Catastrophic failure. Full personalization disable. Final safety net with 25ms (p95) and 50ms (p99) margins.

Why These Specific Values?

  • 150ms/400ms design targets: Provides comfortable operational headroom under normal load
  • 180ms/450ms warnings: Early detection while still ~25% below limits
  • 200ms/500ms critical: Clear signal that intervention is needed (~20% below limits)
  • 225ms/550ms kill switch: Last resort protection (~10% below limits)
  • Progressive timing: 1min → 2min → 3min allows for human response before automation

Key Metrics & Dashboards

Performance Metrics

Latency

  • • Design targets: p95 < 150ms, p99 < 400ms
  • • Non-negotiables: p95 < 250ms, p99 < 600ms
  • • Warning: p95 > 180ms or p99 > 450ms (1 min)
  • • Critical: p95 > 200ms or p99 > 500ms (2 min)
  • • Auto-mitigate: Reduce rollout 50% (3 min)
  • • Kill switch: p95 > 225ms or p99 > 550ms (3 min)

Throughput

  • • Requests per second
  • • Target: Support 3k RPS peak
  • • Alert: RPS > 3.5k (capacity risk)

Error Rate

  • • 5xx errors per minute
  • • Target: < 0.1%
  • • Alert: > 1% for 2 min

System Health Metrics

Cache Performance

  • • Cache hit rate
  • • Target: > 90%
  • • Redis latency (p99)
  • • Memory usage

Database Health

  • • Query latency (p95)
  • • Connection pool usage
  • • Slow query log
  • • Replication lag

Event Pipeline

  • • Kafka lag per partition
  • • Events processed/sec
  • • Processing errors
  • • Signal freshness

Business Metrics

Adoption

  • • % tenants enabled
  • • % users receiving personalized feeds
  • • Feature usage trend

Engagement

  • • Click-through rate (CTR)
  • • Video completion rate
  • • Session length
  • • Personalized vs. non-personalized

Cold Start

  • • New user engagement
  • • Time to first personalized feed
  • • Default recommendations CTR

Alerting Strategy

Critical

Page On-Call Immediately

  • • p95 latency > 200ms or p99 > 500ms for 2 minutes
  • • Error rate > 5% for 2 minutes
  • • Redis cluster down or >90% memory usage
  • • Database replication lag > 60 seconds
  • • Event pipeline stopped (no events for 10 minutes)
Warning

Slack Notification

  • • p95 latency > 180ms or p99 > 450ms for 1 minute
  • • Cache hit rate < 80% for 10 minutes
  • • Error rate > 1% for 5 minutes
  • • Kafka lag > 10 minutes
  • • Database connection pool > 80% used
Info

Dashboard Only

  • • p95 latency > 150ms or p99 > 400ms (above design target)
  • • Unusual traffic patterns (sudden 2x spike)
  • • New slow queries detected
  • • Feature flag changes

Logging & Debugging

Structured Logging

All logs are JSON-formatted with consistent fields for easy querying and analysis.

{
  "timestamp": "2026-01-12T10:30:45.123Z",
  "level": "info",
  "service": "feed-api",
  "request_id": "req_abc123",
  "user_id_hash": "a3f7b2c9...",
  "tenant_id": "tenant1",
  "endpoint": "/v1/feed",
  "response_time_ms": 142,
  "cache_hit": false,
  "personalized": true,
  "video_count": 20,
  "ranking_time_ms": 45
}

Tracing

Distributed tracing with unique request_id propagated through all services:

  • • API Gateway → Feed Service → Cache → Database
  • • Identify bottlenecks in request pipeline
  • • Debug timeouts and cascading failures

Debug Mode

Optional query parameter for detailed debugging information:

GET /v1/feed?user_id=a3f7b2c9&tenant_id=tenant1&debug=true

Response includes:
- Detailed scoring breakdown per video
- Cache hit/miss per data source
- Query execution times
- Applied filters and weights
- Candidate pool size before/after ranking