AWS
Fintechtrafficflow

🧨 Fintech Traffic Failure War-Game β€” 10 Deep Scenarios


πŸ”₯ Scenario 1 β€” Users getting random 502 from CloudFront, origin is ALB

What interviewer wants: edge vs origin debugging flow.

Answer (how you respond): First I check CloudFront metrics β€” origin error rate and edge error rate. If errors are only from specific paths, I check behavior routing and origin mapping. Then I check ALB target health and 5xx metrics. If ALB is healthy, I check origin timeout vs CloudFront origin timeout mismatch. I also check header forwarding rules β€” sometimes missing Host header breaks backend routing. Lesson: Edge 502 doesn’t always mean origin down β€” often config mismatch.


πŸ”₯ Scenario 2 β€” API Gateway suddenly returning throttling errors for one partner only

Answer: I check usage plan and API key limits first. Then stage-level throttling settings. Then CloudWatch metrics per API key. If only one partner is affected, it’s usually usage plan burst or rate exceeded. I would confirm whether their traffic pattern changed. Lesson: API Gateway throttling is often policy, not outage.


πŸ”₯ Scenario 3 β€” ALB shows healthy targets but users see timeouts

Answer: Health check endpoints may be shallow. I verify readiness endpoint actually checks DB/queue dependencies. Then check target response time metrics and ALB target response latency. Then check ALB idle timeout vs backend response time. I also check connection saturation on pods. Lesson: Healthy β‰  ready for real traffic.


πŸ”₯ Scenario 4 β€” After deploy, traffic drops only for large file uploads

Answer: I suspect ALB buffering + timeout. Check ALB idle timeout and max body size at ingress/app. Also check CloudFront behavior if in front β€” large payload forwarding rules. I’d recommend switching large uploads to S3 pre-signed URLs. Lesson: LB buffering + upload size is a classic failure point.


πŸ”₯ Scenario 5 β€” gRPC service behind ALB failing intermittently

Answer: ALB gRPC support is version/feature dependent and sensitive to config. I check protocol version, health checks, and whether HTTP/2 is preserved end-to-end. Often NLB is more stable for gRPC. Lesson: Wrong LB type for protocol causes flaky behavior.


πŸ”₯ Scenario 6 β€” Sudden spike causes widespread 503 at ALB

Answer: Check ALB surge queue and target connection metrics. Then pod HPA and cluster autoscaler events. Likely scaling lag. I’d check pending pods and node capacity. Short term: increase replicas manually. Long term: tune autoscaling and pre-scale for peak windows. Lesson: LB exposes scaling lag instantly.


πŸ”₯ Scenario 7 β€” Only one AZ users failing, others fine

Answer: Check target distribution by AZ and cross-zone load balancing setting. If disabled and AZ lost targets, traffic fails there. Also check subnet/NACL issue in that AZ. Lesson: AZ imbalance + cross-zone off = partial outage.


πŸ”₯ Scenario 8 β€” CloudFront serving stale fintech rates data after update

Answer: Cache TTL too high or cache key missing query/version parameter. I’d verify cache policy and invalidation status. Recommend versioned object keys instead of invalidation for dynamic-but-cacheable data. Lesson: Cache design is data correctness risk in fintech.


πŸ”₯ Scenario 9 β€” API Gateway + Lambda β€” intermittent 504 errors

Answer: Check Lambda duration vs API Gateway integration timeout. Look for cold start spikes or downstream DB slowness. Verify Lambda concurrency throttling. Also check VPC-attached Lambda ENI exhaustion. Lesson: Gateway timeout < backend timeout = false failure.


πŸ”₯ Scenario 10 β€” Partner says your API IP changed and they are blocked

Answer: If using ALB β€” IPs are not static. That’s expected. For allowlist partners, design should use NLB with Elastic IP or Global Accelerator. Lesson: IP allowlist requirements drive LB choice.



πŸ’¬ Need a Quick Summary?

Hey! Don't have time to read everything? I get it. 😊
Click below and I'll give you the main points and what matters most on this page.
Takes about 5 seconds β€’ Uses Perplexity AI