Rollout & Observability
Feature flags, monitoring strategy, and gradual deployment plan
Feature Flag Architecture
Personalization is controlled via a multi-level feature flag system that allows fine-grained control over rollout and instant kill switches.
Flag Hierarchy
Global Kill Switch
PERSONALIZATION_ENABLED=true/false
Environment variable. Disables all personalization system-wide. Used for emergencies.
Per-Tenant Flag
tenant_configs.personalization_enabled
Database configuration. Allows selective enablement per tenant. Can be toggled via CMS.
Percentage Rollout
tenant_configs.rollout_percentage=0-100
Gradual rollout within a tenant. Hash user_id to determine eligibility. Allows A/B testing.
Flag Evaluation Logic
Note: The code below reflects the prototype implementation using synchronous in-memory data. In production, getTenantConfig() would be an async database call.
Helper Function: Consistent hashing for percentage-based rollout
function hashUserId(userId) {
// Use SHA-256 for consistent, deterministic hashing
const hash = crypto.createHash('sha256').update(userId).digest('hex');
// Convert to integer and get percentile (0-99)
return parseInt(hash.substring(0, 8), 16) % 100;
}Main Decision Function: Multi-level feature flag evaluation
function shouldPersonalize(tenantId, userId, tenantConfig) {
// L1: Global kill switch
if (!PERSONALIZATION_ENABLED) {
return {
enabled: false,
reason: 'Global personalization disabled'
};
}
// L2: Tenant-specific flag
if (!tenantConfig.personalization_enabled) {
return {
enabled: false,
reason: 'Personalization disabled for tenant'
};
}
// L3: Percentage rollout
const userHash = hashUserId(userId);
if (userHash >= tenantConfig.rollout_percentage) {
return {
enabled: false,
reason: 'User not in rollout (percentile: ' + userHash + ')'
};
}
return {
enabled: true,
reason: 'All checks passed'
};
}Prototype vs Production Implementation
Return Value:
- • Prototype: Returns object with
{ enabled, reason }for better debugging and metadata - • Production: Could return simple boolean, or keep the object for observability
Tenant Config Retrieval:
- • Prototype: Synchronous in-memory object lookup (
tenantConfigs[tenantId]) - • Production: Async database query (
await getTenantConfig(tenantId)) with caching
Parameter Passing:
- • Prototype: Passes
tenantConfigas parameter (already loaded) - • Production: Could fetch config inside function or cache at request middleware level
Consistency Guarantee:
- • SHA-256 hashing ensures: Same user always gets same experience (deterministic)
- • Rollout is gradual: Changing from 50% → 75% adds users, never removes them
- • Distribution is uniform: Users evenly spread across percentiles
Production Implementation Considerations
The prototype demonstrates the core logic with in-memory data. Production deployments require additional considerations for database access, caching, and error handling.
Async/Await Pattern
In production with database calls, the function signature would be:
async function shouldPersonalize(tenantId, userId) {
// L1: Global kill switch (env var, still synchronous)
if (!process.env.PERSONALIZATION_ENABLED) {
return { enabled: false, reason: 'Global kill switch' };
}
// L2: Fetch tenant config from database (with caching)
const tenantConfig = await getTenantConfig(tenantId);
if (!tenantConfig?.personalization_enabled) {
return { enabled: false, reason: 'Tenant disabled' };
}
// L3: Percentage rollout (same logic)
const userHash = hashUserId(userId);
if (userHash >= tenantConfig.rollout_percentage) {
return { enabled: false, reason: 'Not in rollout' };
}
return { enabled: true, reason: 'Personalization enabled' };
}Caching Strategy
- • Cache tenant configs in Redis (TTL: 5 minutes)
- • Fallback to database if cache miss
- • Invalidate cache when tenant settings change via CMS
- • Consider in-memory LRU cache (process-level) for hot tenants
Error Handling
- • If tenant config fetch fails → default to
enabled: false(fail safe) - • Log all feature flag decisions for audit trail
- • Include request_id in logs for traceability
- • Monitor cache hit rates and database query latency
Config Refresh
- • Global kill switch: Read from environment, restart not required (use config service)
- • Tenant flags: Updated via CMS, propagated via Redis pub/sub
- • Rollout percentage: Gradual changes (e.g., 25% → 50% → 75% → 100%)
- • Maximum propagation delay: ~30 seconds across all instances
Gradual Rollout Plan
Phase 1: Internal Testing (Week 1)
- • Enable for 1-2 internal test tenants
- • rollout_percentage=100 (all users)
- • Monitor latency, error rates, cache performance
- • Validate data pipeline and signal accuracy
- • Goal: Identify critical bugs, performance issues
Phase 2: Beta Tenants (Week 2-3)
- • Enable for 5-10 volunteer beta tenants
- • rollout_percentage=50 (A/B test)
- • Compare engagement metrics: CTR, completion rate, session length
- • Collect qualitative feedback from tenant partners
- • Goal: Validate business value, gather UX feedback
Phase 3: Gradual Expansion (Week 4-6)
- • Enable for 25% of tenants (30 tenants)
- • rollout_percentage=100 for enabled tenants
- • Monitor system load, database performance
- • Scale infrastructure as needed (add Redis nodes, DB replicas)
- • Goal: Stress test at moderate scale
Phase 4: Full Rollout (Week 7-8)
- • Enable for all 120 tenants
- • rollout_percentage=100
- • Monitor closely for first 48 hours
- • Keep global kill switch ready
- • Goal: Achieve stable 3k RPS with p95 < 250ms
Kill Switch Procedure
When to Activate
- • p95 latency exceeds 225ms for 3 minutes (approaching 250ms non-negotiable)
- • OR p99 latency exceeds 550ms for 3 minutes (approaching 600ms non-negotiable)
- • Error rate exceeds 1% for any endpoint
- • Database or Redis cluster experiencing failures
- • Significant drop in engagement metrics (>20%)
- • Security incident requiring immediate isolation
Activation Steps
Set PERSONALIZATION_ENABLED=false in environment config
Restart API instances (or wait for config hot-reload ~30 seconds)
Verify all traffic routes to non-personalized feed endpoint
Investigate root cause while system operates in safe mode
Re-enable gradually starting with single test tenant
Fallback Behavior
When personalization is disabled, the system automatically falls back to editorial curation:
- • Videos sorted by editorial_boost DESC, created_at DESC
- • No user signals queried (reduces database load)
- • Cache TTL extended to 5 minutes (higher reuse)
- • Response time drops to ~30ms (vs ~150ms personalized)
Latency Threshold Design Rationale
Our latency thresholds are designed with a tiered approach that protects the non-negotiable requirements (p95 < 250ms, p99 < 600ms) through early detection and progressive intervention.
Design Philosophy
- Non-negotiables are cliff edges — We must never breach p95 < 250ms or p99 < 600ms. These are hard limits, not targets.
- Design with headroom — Normal operation targets are p95 < 150ms and p99 < 400ms, providing 100ms and 200ms safety margins respectively.
- Progressive intervention — Alerts escalate from warnings to automatic mitigation to kill switch activation, all before reaching the non-negotiables.
- Sustained violations only — All thresholds require sustained degradation (1-3 minutes) to avoid false positives from transient spikes.
Threshold Progression
Info: p95 > 150ms or p99 > 400ms
Above design targets but still safe. Dashboard tracking only. Margins: 100ms (p95), 200ms (p99).
Warning: p95 > 180ms or p99 > 450ms (1 min)
Degradation detected. Slack notification sent. Margins: 70ms (p95), 150ms (p99).
Critical: p95 > 200ms or p99 > 500ms (2 min)
Serious degradation. Page on-call immediately. Margins: 50ms (p95), 100ms (p99).
Auto-Mitigate: Same thresholds (3 min)
Automatic rollout reduction by 50%. Reduces load to restore performance.
Kill Switch: p95 > 225ms or p99 > 550ms (3 min)
Catastrophic failure. Full personalization disable. Final safety net with 25ms (p95) and 50ms (p99) margins.
Why These Specific Values?
- • 150ms/400ms design targets: Provides comfortable operational headroom under normal load
- • 180ms/450ms warnings: Early detection while still ~25% below limits
- • 200ms/500ms critical: Clear signal that intervention is needed (~20% below limits)
- • 225ms/550ms kill switch: Last resort protection (~10% below limits)
- • Progressive timing: 1min → 2min → 3min allows for human response before automation
Key Metrics & Dashboards
Performance Metrics
Latency
- • Design targets: p95 < 150ms, p99 < 400ms
- • Non-negotiables: p95 < 250ms, p99 < 600ms
- • Warning: p95 > 180ms or p99 > 450ms (1 min)
- • Critical: p95 > 200ms or p99 > 500ms (2 min)
- • Auto-mitigate: Reduce rollout 50% (3 min)
- • Kill switch: p95 > 225ms or p99 > 550ms (3 min)
Throughput
- • Requests per second
- • Target: Support 3k RPS peak
- • Alert: RPS > 3.5k (capacity risk)
Error Rate
- • 5xx errors per minute
- • Target: < 0.1%
- • Alert: > 1% for 2 min
System Health Metrics
Cache Performance
- • Cache hit rate
- • Target: > 90%
- • Redis latency (p99)
- • Memory usage
Database Health
- • Query latency (p95)
- • Connection pool usage
- • Slow query log
- • Replication lag
Event Pipeline
- • Kafka lag per partition
- • Events processed/sec
- • Processing errors
- • Signal freshness
Business Metrics
Adoption
- • % tenants enabled
- • % users receiving personalized feeds
- • Feature usage trend
Engagement
- • Click-through rate (CTR)
- • Video completion rate
- • Session length
- • Personalized vs. non-personalized
Cold Start
- • New user engagement
- • Time to first personalized feed
- • Default recommendations CTR
Alerting Strategy
Page On-Call Immediately
- • p95 latency > 200ms or p99 > 500ms for 2 minutes
- • Error rate > 5% for 2 minutes
- • Redis cluster down or >90% memory usage
- • Database replication lag > 60 seconds
- • Event pipeline stopped (no events for 10 minutes)
Slack Notification
- • p95 latency > 180ms or p99 > 450ms for 1 minute
- • Cache hit rate < 80% for 10 minutes
- • Error rate > 1% for 5 minutes
- • Kafka lag > 10 minutes
- • Database connection pool > 80% used
Dashboard Only
- • p95 latency > 150ms or p99 > 400ms (above design target)
- • Unusual traffic patterns (sudden 2x spike)
- • New slow queries detected
- • Feature flag changes
Logging & Debugging
Structured Logging
All logs are JSON-formatted with consistent fields for easy querying and analysis.
{
"timestamp": "2026-01-12T10:30:45.123Z",
"level": "info",
"service": "feed-api",
"request_id": "req_abc123",
"user_id_hash": "a3f7b2c9...",
"tenant_id": "tenant1",
"endpoint": "/v1/feed",
"response_time_ms": 142,
"cache_hit": false,
"personalized": true,
"video_count": 20,
"ranking_time_ms": 45
}Tracing
Distributed tracing with unique request_id propagated through all services:
- • API Gateway → Feed Service → Cache → Database
- • Identify bottlenecks in request pipeline
- • Debug timeouts and cascading failures
Debug Mode
Optional query parameter for detailed debugging information:
GET /v1/feed?user_id=a3f7b2c9&tenant_id=tenant1&debug=true
Response includes:
- Detailed scoring breakdown per video
- Cache hit/miss per data source
- Query execution times
- Applied filters and weights
- Candidate pool size before/after ranking