𧨠Fintech Traffic Failure War-Game β 10 Deep Scenarios
π₯ Scenario 1 β Users getting random 502 from CloudFront, origin is ALB
What interviewer wants: edge vs origin debugging flow.
Answer (how you respond): First I check CloudFront metrics β origin error rate and edge error rate. If errors are only from specific paths, I check behavior routing and origin mapping. Then I check ALB target health and 5xx metrics. If ALB is healthy, I check origin timeout vs CloudFront origin timeout mismatch. I also check header forwarding rules β sometimes missing Host header breaks backend routing. Lesson: Edge 502 doesnβt always mean origin down β often config mismatch.
π₯ Scenario 2 β API Gateway suddenly returning throttling errors for one partner only
Answer: I check usage plan and API key limits first. Then stage-level throttling settings. Then CloudWatch metrics per API key. If only one partner is affected, itβs usually usage plan burst or rate exceeded. I would confirm whether their traffic pattern changed. Lesson: API Gateway throttling is often policy, not outage.
π₯ Scenario 3 β ALB shows healthy targets but users see timeouts
Answer: Health check endpoints may be shallow. I verify readiness endpoint actually checks DB/queue dependencies. Then check target response time metrics and ALB target response latency. Then check ALB idle timeout vs backend response time. I also check connection saturation on pods. Lesson: Healthy β ready for real traffic.
π₯ Scenario 4 β After deploy, traffic drops only for large file uploads
Answer: I suspect ALB buffering + timeout. Check ALB idle timeout and max body size at ingress/app. Also check CloudFront behavior if in front β large payload forwarding rules. Iβd recommend switching large uploads to S3 pre-signed URLs. Lesson: LB buffering + upload size is a classic failure point.
π₯ Scenario 5 β gRPC service behind ALB failing intermittently
Answer: ALB gRPC support is version/feature dependent and sensitive to config. I check protocol version, health checks, and whether HTTP/2 is preserved end-to-end. Often NLB is more stable for gRPC. Lesson: Wrong LB type for protocol causes flaky behavior.
π₯ Scenario 6 β Sudden spike causes widespread 503 at ALB
Answer: Check ALB surge queue and target connection metrics. Then pod HPA and cluster autoscaler events. Likely scaling lag. Iβd check pending pods and node capacity. Short term: increase replicas manually. Long term: tune autoscaling and pre-scale for peak windows. Lesson: LB exposes scaling lag instantly.
π₯ Scenario 7 β Only one AZ users failing, others fine
Answer: Check target distribution by AZ and cross-zone load balancing setting. If disabled and AZ lost targets, traffic fails there. Also check subnet/NACL issue in that AZ. Lesson: AZ imbalance + cross-zone off = partial outage.
π₯ Scenario 8 β CloudFront serving stale fintech rates data after update
Answer: Cache TTL too high or cache key missing query/version parameter. Iβd verify cache policy and invalidation status. Recommend versioned object keys instead of invalidation for dynamic-but-cacheable data. Lesson: Cache design is data correctness risk in fintech.
π₯ Scenario 9 β API Gateway + Lambda β intermittent 504 errors
Answer: Check Lambda duration vs API Gateway integration timeout. Look for cold start spikes or downstream DB slowness. Verify Lambda concurrency throttling. Also check VPC-attached Lambda ENI exhaustion. Lesson: Gateway timeout < backend timeout = false failure.
π₯ Scenario 10 β Partner says your API IP changed and they are blocked
Answer: If using ALB β IPs are not static. Thatβs expected. For allowlist partners, design should use NLB with Elastic IP or Global Accelerator. Lesson: IP allowlist requirements drive LB choice.