Rollout & Observability

Feature flags, monitoring strategy, and gradual deployment plan

Feature Flag Architecture

Personalization is controlled via a multi-level feature flag system that allows fine-grained control over rollout and instant kill switches.

Flag Hierarchy

Global Kill Switch

PERSONALIZATION_ENABLED=true/false
Environment variable. Disables all personalization system-wide. Used for emergencies.

Per-Tenant Flag

tenant_configs.personalization_enabled
Database configuration. Allows selective enablement per tenant. Can be toggled via CMS.

Percentage Rollout

tenant_configs.rollout_percentage=0-100
Gradual rollout within a tenant. Hash user_id to determine eligibility. Allows A/B testing.

Flag Evaluation Logic

Note: The code below reflects the prototype implementation using synchronous in-memory data. In production, getTenantConfig() would be an async database call.

Helper Function: Consistent hashing for percentage-based rollout

function hashUserId(userId) {
  // Use SHA-256 for consistent, deterministic hashing
  const hash = crypto.createHash('sha256').update(userId).digest('hex');
  // Convert to integer and get percentile (0-99)
  return parseInt(hash.substring(0, 8), 16) % 100;
}

Main Decision Function: Multi-level feature flag evaluation

function shouldPersonalize(tenantId, userId, tenantConfig) {
  // L1: Global kill switch
  if (!PERSONALIZATION_ENABLED) {
    return { 
      enabled: false, 
      reason: 'Global personalization disabled' 
    };
  }
  
  // L2: Tenant-specific flag
  if (!tenantConfig.personalization_enabled) {
    return { 
      enabled: false, 
      reason: 'Personalization disabled for tenant' 
    };
  }
  
  // L3: Percentage rollout
  const userHash = hashUserId(userId);
  if (userHash >= tenantConfig.rollout_percentage) {
    return { 
      enabled: false, 
      reason: 'User not in rollout (percentile: ' + userHash + ')' 
    };
  }
  
  return { 
    enabled: true, 
    reason: 'All checks passed' 
  };
}

Prototype vs Production Implementation

Return Value:

• Prototype: Returns object with { enabled, reason } for better debugging and metadata
• Production: Could return simple boolean, or keep the object for observability

Tenant Config Retrieval:

• Prototype: Synchronous in-memory object lookup (tenantConfigs[tenantId])
• Production: Async database query (await getTenantConfig(tenantId)) with caching

Parameter Passing:

• Prototype: Passes tenantConfig as parameter (already loaded)
• Production: Could fetch config inside function or cache at request middleware level

Consistency Guarantee:

• SHA-256 hashing ensures: Same user always gets same experience (deterministic)
• Rollout is gradual: Changing from 50% → 75% adds users, never removes them
• Distribution is uniform: Users evenly spread across percentiles

Production Implementation Considerations

The prototype demonstrates the core logic with in-memory data. Production deployments require additional considerations for database access, caching, and error handling.

Async/Await Pattern

In production with database calls, the function signature would be:

async function shouldPersonalize(tenantId, userId) {
  // L1: Global kill switch (env var, still synchronous)
  if (!process.env.PERSONALIZATION_ENABLED) {
    return { enabled: false, reason: 'Global kill switch' };
  }
  
  // L2: Fetch tenant config from database (with caching)
  const tenantConfig = await getTenantConfig(tenantId);
  if (!tenantConfig?.personalization_enabled) {
    return { enabled: false, reason: 'Tenant disabled' };
  }
  
  // L3: Percentage rollout (same logic)
  const userHash = hashUserId(userId);
  if (userHash >= tenantConfig.rollout_percentage) {
    return { enabled: false, reason: 'Not in rollout' };
  }
  
  return { enabled: true, reason: 'Personalization enabled' };
}

Caching Strategy

• Cache tenant configs in Redis (TTL: 5 minutes)
• Fallback to database if cache miss
• Invalidate cache when tenant settings change via CMS
• Consider in-memory LRU cache (process-level) for hot tenants

Error Handling

• If tenant config fetch fails → default to enabled: false (fail safe)
• Log all feature flag decisions for audit trail
• Include request_id in logs for traceability
• Monitor cache hit rates and database query latency

Config Refresh

• Global kill switch: Read from environment, restart not required (use config service)
• Tenant flags: Updated via CMS, propagated via Redis pub/sub
• Rollout percentage: Gradual changes (e.g., 25% → 50% → 75% → 100%)
• Maximum propagation delay: ~30 seconds across all instances

Gradual Rollout Plan

Phase 1: Internal Testing (Week 1)

• Enable for 1-2 internal test tenants
• rollout_percentage=100 (all users)
• Monitor latency, error rates, cache performance
• Validate data pipeline and signal accuracy
• Goal: Identify critical bugs, performance issues

Phase 2: Beta Tenants (Week 2-3)

• Enable for 5-10 volunteer beta tenants
• rollout_percentage=50 (A/B test)
• Compare engagement metrics: CTR, completion rate, session length
• Collect qualitative feedback from tenant partners
• Goal: Validate business value, gather UX feedback

Phase 3: Gradual Expansion (Week 4-6)

• Enable for 25% of tenants (30 tenants)
• rollout_percentage=100 for enabled tenants
• Monitor system load, database performance
• Scale infrastructure as needed (add Redis nodes, DB replicas)
• Goal: Stress test at moderate scale

Phase 4: Full Rollout (Week 7-8)

• Enable for all 120 tenants
• rollout_percentage=100
• Monitor closely for first 48 hours
• Keep global kill switch ready
• Goal: Achieve stable 3k RPS with p95 < 250ms

Kill Switch Procedure

When to Activate

• p95 latency exceeds 225ms for 3 minutes (approaching 250ms non-negotiable)
• OR p99 latency exceeds 550ms for 3 minutes (approaching 600ms non-negotiable)
• Error rate exceeds 1% for any endpoint
• Database or Redis cluster experiencing failures
• Significant drop in engagement metrics (>20%)
• Security incident requiring immediate isolation

Activation Steps

Set PERSONALIZATION_ENABLED=false in environment config

Restart API instances (or wait for config hot-reload ~30 seconds)

Verify all traffic routes to non-personalized feed endpoint

Investigate root cause while system operates in safe mode

Re-enable gradually starting with single test tenant

Fallback Behavior

When personalization is disabled, the system automatically falls back to editorial curation:

• Videos sorted by editorial_boost DESC, created_at DESC
• No user signals queried (reduces database load)
• Cache TTL extended to 5 minutes (higher reuse)
• Response time drops to ~30ms (vs ~150ms personalized)

Latency Threshold Design Rationale

Our latency thresholds are designed with a tiered approach that protects the non-negotiable requirements (p95 < 250ms, p99 < 600ms) through early detection and progressive intervention.

Design Philosophy

Non-negotiables are cliff edges — We must never breach p95 < 250ms or p99 < 600ms. These are hard limits, not targets.
Design with headroom — Normal operation targets are p95 < 150ms and p99 < 400ms, providing 100ms and 200ms safety margins respectively.
Progressive intervention — Alerts escalate from warnings to automatic mitigation to kill switch activation, all before reaching the non-negotiables.
Sustained violations only — All thresholds require sustained degradation (1-3 minutes) to avoid false positives from transient spikes.

Threshold Progression

Info: p95 > 150ms or p99 > 400ms

Above design targets but still safe. Dashboard tracking only. Margins: 100ms (p95), 200ms (p99).

Warning: p95 > 180ms or p99 > 450ms (1 min)

Degradation detected. Slack notification sent. Margins: 70ms (p95), 150ms (p99).

Critical: p95 > 200ms or p99 > 500ms (2 min)

Serious degradation. Page on-call immediately. Margins: 50ms (p95), 100ms (p99).

Auto-Mitigate: Same thresholds (3 min)

Automatic rollout reduction by 50%. Reduces load to restore performance.

Kill Switch: p95 > 225ms or p99 > 550ms (3 min)

Catastrophic failure. Full personalization disable. Final safety net with 25ms (p95) and 50ms (p99) margins.

Why These Specific Values?

• 150ms/400ms design targets: Provides comfortable operational headroom under normal load
• 180ms/450ms warnings: Early detection while still ~25% below limits
• 200ms/500ms critical: Clear signal that intervention is needed (~20% below limits)
• 225ms/550ms kill switch: Last resort protection (~10% below limits)
• Progressive timing: 1min → 2min → 3min allows for human response before automation

Key Metrics & Dashboards

Performance Metrics

Latency

• Design targets: p95 < 150ms, p99 < 400ms
• Non-negotiables: p95 < 250ms, p99 < 600ms
• Warning: p95 > 180ms or p99 > 450ms (1 min)
• Critical: p95 > 200ms or p99 > 500ms (2 min)
• Auto-mitigate: Reduce rollout 50% (3 min)
• Kill switch: p95 > 225ms or p99 > 550ms (3 min)

Throughput

• Requests per second
• Target: Support 3k RPS peak
• Alert: RPS > 3.5k (capacity risk)

Error Rate

• 5xx errors per minute
• Target: < 0.1%
• Alert: > 1% for 2 min

System Health Metrics

Cache Performance

• Cache hit rate
• Target: > 90%
• Redis latency (p99)
• Memory usage

Database Health

• Query latency (p95)
• Connection pool usage
• Slow query log
• Replication lag

Event Pipeline

• Kafka lag per partition
• Events processed/sec
• Processing errors
• Signal freshness

Business Metrics

Adoption

• % tenants enabled
• % users receiving personalized feeds
• Feature usage trend

Engagement

• Click-through rate (CTR)
• Video completion rate
• Session length
• Personalized vs. non-personalized

Cold Start

• New user engagement
• Time to first personalized feed
• Default recommendations CTR

Alerting Strategy

Critical

Page On-Call Immediately

• p95 latency > 200ms or p99 > 500ms for 2 minutes
• Error rate > 5% for 2 minutes
• Redis cluster down or >90% memory usage
• Database replication lag > 60 seconds
• Event pipeline stopped (no events for 10 minutes)

Warning

Slack Notification

• p95 latency > 180ms or p99 > 450ms for 1 minute
• Cache hit rate < 80% for 10 minutes
• Error rate > 1% for 5 minutes
• Kafka lag > 10 minutes
• Database connection pool > 80% used

Info

Dashboard Only

• p95 latency > 150ms or p99 > 400ms (above design target)
• Unusual traffic patterns (sudden 2x spike)
• New slow queries detected
• Feature flag changes

Logging & Debugging

Structured Logging

All logs are JSON-formatted with consistent fields for easy querying and analysis.

{
  "timestamp": "2026-01-12T10:30:45.123Z",
  "level": "info",
  "service": "feed-api",
  "request_id": "req_abc123",
  "user_id_hash": "a3f7b2c9...",
  "tenant_id": "tenant1",
  "endpoint": "/v1/feed",
  "response_time_ms": 142,
  "cache_hit": false,
  "personalized": true,
  "video_count": 20,
  "ranking_time_ms": 45
}

Tracing

Distributed tracing with unique request_id propagated through all services:

• API Gateway → Feed Service → Cache → Database
• Identify bottlenecks in request pipeline
• Debug timeouts and cascading failures

Debug Mode

Optional query parameter for detailed debugging information:

GET /v1/feed?user_id=a3f7b2c9&tenant_id=tenant1&debug=true

Response includes:
- Detailed scoring breakdown per video
- Cache hit/miss per data source
- Query execution times
- Applied filters and weights
- Candidate pool size before/after ranking