Implementation
Technical decisions, trade-offs, and future improvements
Key Technical Decisions
Decision 1: Weighted Scoring vs. Machine Learning
Choice: Implemented a weighted scoring algorithm rather than ML-based recommendations.
Pros
- ✓ Predictable, debuggable behavior
- ✓ No training data requirements
- ✓ Sub-50ms computation time
- ✓ Easy to tune per tenant
- ✓ Works well with cold start
- ✓ Small team can maintain
Cons
- ✗ Less sophisticated than ML
- ✗ Manual weight tuning needed
- ✗ Limited pattern discovery
- ✗ No collaborative filtering
Rationale: For a small team with 120 tenants, simplicity and maintainability outweigh the marginal accuracy gains of ML. The system meets all performance requirements and is reversible—we can add ML later without architectural changes.
Decision 2: Asynchronous Event Processing
Choice: User events processed asynchronously with up to 5-minute lag.
Pros
- ✓ Decouples writes from reads
- ✓ Batch processing efficiency
- ✓ Handles traffic spikes
- ✓ Reduced database load
- ✓ Easier to scale writes
Cons
- ✗ Not real-time personalization
- ✗ Eventual consistency complexity
- ✗ Additional infrastructure (Kafka/SQS)
Rationale: The 5-minute lag is acceptable per requirements. Async processing is critical for scale—synchronous writes would bottleneck at 3k RPS. Kafka provides reliable event delivery and natural backpressure handling.
Decision 3: Redis for Caching vs. In-Memory
Choice: Redis cluster for distributed caching rather than application-level caching.
Pros
- ✓ Shared cache across instances
- ✓ Sub-ms lookup times
- ✓ Built-in TTL support
- ✓ Persistence options
- ✓ Scales horizontally
Cons
- ✗ Network hop overhead (~1-2ms)
- ✗ Additional infrastructure cost
- ✗ Single point of failure (mitigated by clustering)
Rationale: Distributed caching is essential when running multiple API instances. In-memory caching would cause cache misses on different instances, negating benefits. Redis cluster provides HA and sufficient performance for our latency targets.
Decision 4: PostgreSQL vs. NoSQL for User Signals
Choice: PostgreSQL with monthly partitioning for time-series event data.
Pros
- ✓ ACID guarantees
- ✓ Powerful JOIN queries
- ✓ Mature ecosystem
- ✓ Built-in partitioning
- ✓ Team expertise
Cons
- ✗ Slower than specialized time-series DBs
- ✗ Vertical scaling limits
- ✗ Partition maintenance overhead
Rationale: PostgreSQL handles our query patterns efficiently and the team has deep expertise. Monthly partitioning solves the retention problem elegantly. We can migrate to Cassandra or TimescaleDB later if write volume becomes problematic.
Decision 5: JavaScript for Prototype Implementation
Choice: Implemented the prototype in JavaScript rather than TypeScript.
Pros
- ✓ Faster iteration speed
- ✓ No compilation step
- ✓ Simpler setup for prototype
- ✓ Direct demonstration focus
Cons
- ✗ No compile-time type safety
- ✗ Potential runtime errors
- ✗ Less IDE autocomplete
- ✗ Manual API contract validation
Rationale: For a demonstration prototype, JavaScript provides faster iteration without compilation overhead. For production, TypeScript would be recommended to prevent configuration errors (invalid weights, percentages) and ensure API contract safety across 120 tenants.
What Was Prioritized
Meeting Non-Negotiables
Every architectural decision was evaluated against the six core constraints. Latency, scale, and privacy requirements drove technology choices and caching strategy.
Operational Simplicity
Chose proven technologies (PostgreSQL, Redis, Kafka) over cutting-edge solutions. Prioritized debuggability and team expertise to enable fast iteration.
Safe Rollout
Feature flags and graceful degradation were first-class concerns. Every component has a fallback path to ensure system stability during deployment.
Tenant Flexibility
Per-tenant configuration allows experimentation. Tenants can adjust weights, test personalization, and maintain editorial control without code changes.
Reversible vs. Irreversible Decisions
Key Reversible Decisions
Weighted scoring algorithm: Abstracted behind interface. Add ML as parallel path via feature flag.
Trigger: 6+ months signal data accumulated OR CTR lift <5% vs. baseline (3-month ML implementation)
Redis cache layer: CacheService interface allows swap to Memcached, Varnish, or CDN edge caching via dual-write cutover.
Trigger: Redis costs >$5k/month OR p95 latency >5ms OR geo-distribution needed (3-4 weeks)
PostgreSQL for events: Monthly partitions enable incremental migration to TimescaleDB, ClickHouse, or S3 data lake.
Trigger: Writes >10k/sec OR query p95 >50ms OR storage costs >$10k/month (6-8 weeks)
60-second cache TTL: Tenant-configurable (tenant_configs.cache_ttl_seconds). Adjust per tenant without deploy.
Trigger: Stale content complaints OR memory >80% OR hit rate <90% (immediate config change)
Kafka event pipeline: Producer interface abstracted. Migrate to AWS SQS or RabbitMQ via dual-write without data model changes.
Trigger: Operational burden too high (incidents >2/month) OR cost >$3k/month (4-6 weeks)
Key Irreversible / High-Cost Decisions
Event-driven architecture (no user profiles): Adding profiles = 6+ month rearchitecture. Requires schema migration (120 tenants), GDPR consent flows, RTBF APIs, privacy audit.
Blocks: Collaborative filtering, cross-device sync, demographic targeting, user-initiated features (email notifications).
Mitigation: Avoids 80% of GDPR compliance work. Session-based CF remains possible. Revisit if 10+ tenants request cross-device sync.
Tenant-scoped data isolation: Hard-partitioned by tenant_id (FK, indexes, queries). Cross-tenant features require data lake + privacy review.
Blocks: Shared ML models, platform benchmarking, cross-tenant abuse detection.
Mitigation: Privacy-first design. Per-tenant compliance (EU vs. US rules). Horizontal scale without bottlenecks.
API response schema (pre-computed scores): 120 tenants with mobile apps. Schema changes = breaking change (App Store review + user adoption = weeks).
Blocks: Client-side ranking, per-user scoring experiments, explanation features ("why this video?").
Mitigation: Centralized ranking simplifies compliance. API versioning (/v2) provides migration path (forces app updates).
90-day retention (monthly partition drops): Data permanently deleted. Cannot retroactively extend without changing policy before expiration.
Blocks: Long-term behavior trends, ML training on historical patterns, "year in review" features.
Mitigation: Data minimization (privacy best practice). Reduces storage costs. Archive to S3 before drops if needed.
Hashed user IDs (SHA-256, one-way): Cannot recover raw IDs. Features requiring raw IDs need SDK changes + schema migration + ID mapping table.
Blocks: Push notifications, cross-device deduplication, CRM/analytics linking, email-based debugging.
Mitigation: Data breach exposes hashes not PII. Meets GDPR/CCPA without opt-in. Contractual requirement for some tenants.
Top Risks Snapshot
| Risk | Impact | Mitigation | Reversible? |
|---|---|---|---|
| PostgreSQL write bottleneck (>10k/sec) | High | Shard by tenant_id or migrate to TimescaleDB. Monitor CPU >70%, latency >50ms p95. | Yes |
| Tenant requests cross-device sync (needs profiles) | High | 6-month rearchitecture. Defer until >10 tenant requests. Consent flows required. | No |
| API schema breaking change (120 tenants) | High | Versioned API (/v2). Force updates via minimum version check (3-month migration). | No |
| Weighted scoring underperforms (<5% CTR lift) | Med | Feature flag enables ML path (3 months). A/B test to validate improvement. | Yes |
| Kafka operational burden (small team) | Med | Migrate to AWS SQS if incidents >2/month (4 weeks). Interface abstracted. | Yes |
| Data breach exposes hashed behavior data | Critical | 90-day retention limits exposure. Encryption at rest. Row-level tenant_id security. | N/A |
Future Improvements
Immediate Follow-Ups (If Given 1–2 More Weeks)
Cost analysis dashboard: Per-service breakdown (Redis, PostgreSQL, Kafka), per-tenant attribution, cost-per-request.Effort: S
Value: Identify optimization targets (unused indexes, oversized instances, cache tuning).
Deferred: Requires production traffic for meaningful baseline. Premature optimization would misallocate effort. (Trigger: 1 month post-launch OR costs >$10k/month)
Observability dashboard: Per-tenant CTR, completion rates, personalization lift, cache hit rates, scoring latency.Effort: S
Value: Enable data-driven optimization and tenant support. Replace manual log queries.
Deferred: Pre-launch metrics would be synthetic. Logging exists, manual queries sufficient for MVP. (Trigger: 2 weeks post-launch)
Query profiling + indexing: pg_stat_statements monitoring, identify slow queries (>50ms p95), add composite indexes.Effort: S
Value: Reduce query latency, lower database CPU, extend headroom before needing read replicas.
Deferred: Current indexes theory-driven. Real access patterns emerge under production load with diverse tenants. (Trigger: DB CPU >50% OR query p95 >50ms)
Cache warming: Pre-compute feeds for high-traffic tenants during off-peak hours to prevent cold cache latency spikes.Effort: M
Value: Eliminate morning traffic surge latency spikes (cache misses cause 10x slower response).
Deferred: Adds complexity (cron jobs, priority queues) without ROI until traffic patterns known. (Trigger: Cache miss latency spikes correlate with business hours)
Cold start optimization: Boost trending content (view velocity), aggressive demographic hints, slower editorial decay for new users.Effort: M
Value: Improve first-session engagement for new users (currently rely on editorial + recency).
Deferred: Requires A/B testing framework. Risk of over-optimizing before understanding engagement drivers. (Trigger: New user churn >40% in first session OR 20%+ retention gap)
Circuit breaker: If DB queries fail >5 times in 10s, serve stale cache for 60s. Prevent cascading failures.Effort: S
Value: Maintain availability during DB incidents. Better than fallback to non-personalized feed (stale > generic).
Deferred: Adds retry complexity. Current graceful degradation sufficient for MVP. Stale cache needs SLA definition. (Trigger: First DB incident with API error rate >1%)
Scale-Triggered Improvements
Read replicas: Distribute query load when writes OK but reads cause CPU contention.Effort: M
Trigger: Primary DB CPU >70% sustained OR read p95 >100ms (expected at 5-7k RPS). Unnecessary today—95% cache hit rate means few DB reads.
Kafka consumer scaling: Partition-based parallelism (15+ partitions) when single consumer bottlenecks ingestion.Effort: M
Trigger: Consumer lag >5 min OR writes >8k/sec sustained. Unnecessary today—partition key design critical (tenant_id vs. user_id_hash). Wait for traffic patterns.
Finer partitioning: Weekly or daily user_signals partitions when monthly partitions >100GB (drop operations lock for minutes).Effort: S
Trigger: Monthly partition >100GB OR partition drops cause >10s write latency. Unnecessary today—more partitions = query planning overhead.
CDN edge caching: Geo-distribute cache when distant regions (Asia, LATAM) exceed 250ms latency due to RTT.Effort: L
Trigger: >20% traffic outside US/EU AND p95 >300ms for those requests. Unnecessary today—initial 120 tenants US/EU-based.
Tenant sharding: Separate DBs for largest tenants when single tenant dominates resources (noisy neighbor).Effort: L
Trigger: Any tenant >30% total DB load OR user_signals >200GB. Unnecessary today—adds multi-cluster complexity.
Deliberately Postponed Bets
ML-based scoring: TensorFlow Recommenders, embeddings, collaborative filtering.3-month project
Evidence to revisit: 6+ months signal data accumulated AND weighted scoring delivers <10% CTR lift. OR tenant funds 3-month implementation. No training data day 1; team lacks ML ops; debuggability suffers.
Real-time streaming: Flink pipeline reducing signal lag from 5min to <10sec.2-3 month project
Evidence to revisit: A/B test proves sub-minute lag increases engagement >15%. OR premium tenant pays 3x cost. Requirements allow 5-min lag; cost 3-5x higher; ROI unproven for video (not e-commerce).
User profiles table: Persistent state enabling cross-device sync, preference UI, "continue watching".6+ month rearchitecture
Evidence to revisit: >30% tenants request cross-device sync OR collaborative filtering fails (sparse data). Requires 6-month rearchitecture: GDPR consent, schema migration, legal review. Increases compliance 80%.
Microservices decomposition: Split to ingestion, scoring, serving services for independent scaling.4-6 month project
Evidence to revisit: Team >15 engineers (org boundaries) OR scoring needs 10x scaling vs. serving. Small team can't maintain 3+ services; monolith scales fine; service boundaries unclear pre-bottleneck.