🧨 Fintech Traffic Failure War-Game — 10 Deep Scenarios

🔥 Scenario 1 — Users getting random 502 from CloudFront, origin is ALB

What interviewer wants: edge vs origin debugging flow.

Answer (how you respond): First I check CloudFront metrics — origin error rate and edge error rate. If errors are only from specific paths, I check behavior routing and origin mapping. Then I check ALB target health and 5xx metrics. If ALB is healthy, I check origin timeout vs CloudFront origin timeout mismatch. I also check header forwarding rules — sometimes missing Host header breaks backend routing. Lesson: Edge 502 doesn’t always mean origin down — often config mismatch.

🔥 Scenario 2 — API Gateway suddenly returning throttling errors for one partner only

Answer: I check usage plan and API key limits first. Then stage-level throttling settings. Then CloudWatch metrics per API key. If only one partner is affected, it’s usually usage plan burst or rate exceeded. I would confirm whether their traffic pattern changed. Lesson: API Gateway throttling is often policy, not outage.

🔥 Scenario 3 — ALB shows healthy targets but users see timeouts

Answer: Health check endpoints may be shallow. I verify readiness endpoint actually checks DB/queue dependencies. Then check target response time metrics and ALB target response latency. Then check ALB idle timeout vs backend response time. I also check connection saturation on pods. Lesson: Healthy ≠ ready for real traffic.

🔥 Scenario 4 — After deploy, traffic drops only for large file uploads

Answer: I suspect ALB buffering + timeout. Check ALB idle timeout and max body size at ingress/app. Also check CloudFront behavior if in front — large payload forwarding rules. I’d recommend switching large uploads to S3 pre-signed URLs. Lesson: LB buffering + upload size is a classic failure point.

🔥 Scenario 5 — gRPC service behind ALB failing intermittently

Answer: ALB gRPC support is version/feature dependent and sensitive to config. I check protocol version, health checks, and whether HTTP/2 is preserved end-to-end. Often NLB is more stable for gRPC. Lesson: Wrong LB type for protocol causes flaky behavior.

🔥 Scenario 6 — Sudden spike causes widespread 503 at ALB

Answer: Check ALB surge queue and target connection metrics. Then pod HPA and cluster autoscaler events. Likely scaling lag. I’d check pending pods and node capacity. Short term: increase replicas manually. Long term: tune autoscaling and pre-scale for peak windows. Lesson: LB exposes scaling lag instantly.

🔥 Scenario 7 — Only one AZ users failing, others fine

Answer: Check target distribution by AZ and cross-zone load balancing setting. If disabled and AZ lost targets, traffic fails there. Also check subnet/NACL issue in that AZ. Lesson: AZ imbalance + cross-zone off = partial outage.

🔥 Scenario 8 — CloudFront serving stale fintech rates data after update

Answer: Cache TTL too high or cache key missing query/version parameter. I’d verify cache policy and invalidation status. Recommend versioned object keys instead of invalidation for dynamic-but-cacheable data. Lesson: Cache design is data correctness risk in fintech.

🔥 Scenario 9 — API Gateway + Lambda — intermittent 504 errors

Answer: Check Lambda duration vs API Gateway integration timeout. Look for cold start spikes or downstream DB slowness. Verify Lambda concurrency throttling. Also check VPC-attached Lambda ENI exhaustion. Lesson: Gateway timeout < backend timeout = false failure.

🔥 Scenario 10 — Partner says your API IP changed and they are blocked

Answer: If using ALB — IPs are not static. That’s expected. For allowlist partners, design should use NLB with Elastic IP or Global Accelerator. Lesson: IP allowlist requirements drive LB choice.

Ecs Fargate Iam