🔴 Round 3 — Scenario-Based EKS / Kubernetes (Advanced–Production Critical)
✅ Q1 — Sudden Traffic Spike, Cluster Meltdown
Question: Traffic spiked 5×. HPA scaled pods, Cluster Autoscaler added nodes, but app still timing out. Why can this still fail?
Answer: Bottleneck may be DB, downstream API, or connection pool limits — not pods. Also pod startup time + image pull delay causes lag. Check service saturation, DB connections, and queue depth. Scaling compute doesn’t fix external bottlenecks.
✅ Q2 — Multi-AZ Cluster, App Still Went Down When One AZ Failed
Question: Nodes were in 3 AZs but outage still happened. What likely design mistake?
Answer: Pods likely scheduled into single AZ due to no topology spread or anti-affinity. Also EBS volumes are AZ-bound — StatefulSet may be tied to failed AZ. Need topologySpreadConstraints + multi-AZ storage strategy.
✅ Q3 — Cluster Autoscaler Causing Cost Explosion
Question: Autoscaler keeps adding nodes every night due to batch jobs. How do you control this?
Answer: Use separate node group for batch with taints/tolerations. Set max node group size and pod resource requests correctly. Also use scheduled scaling or Karpenter with constraints. Don’t mix batch and critical workloads.
✅ Q4 — Pods Frequently Evicted Under Load
Question: Pods getting evicted, not crashing. Why?
Answer:
Node memory pressure or ephemeral storage pressure. Check kubectl describe node conditions. Eviction happens when requests are too low and limits too high. Fix by realistic requests and node sizing.
✅ Q5 — Need Pod-Level Network Security Like EC2 SG
Question: How do you implement that in EKS?
Answer: Use Security Groups for Pods feature. Attach SG via pod annotation and ENI trunking. Useful for DB or regulated access patterns. Requires VPC CNI advanced config.
✅ Q6 — Large Cluster, Scheduling Slow
Question: Pod scheduling taking minutes. Why?
Answer: Too many complex affinity/anti-affinity rules slow scheduler. Also huge cluster with insufficient scheduler resources. Reduce rule complexity and use topology spread instead of heavy anti-affinity.
✅ Q7 — Production Outage After ConfigMap Change
Question: ConfigMap updated, all pods restarted, app broke. What safer pattern?
Answer: Use canary rollout with new ConfigMap version name. Update deployment gradually. Never mutate shared ConfigMap live for critical apps. Version configs like code.
✅ Q8 — Need Guaranteed Capacity for Critical App
Question: Spot + on-demand mixed cluster — how ensure critical pods never land on spot?
Answer: Taint spot nodes and add tolerations only to noncritical workloads. Critical workloads get node affinity to on-demand node group. Also use PriorityClass.
✅ Q9 — API Server Rate Limiting During Incident
Question: kubectl and controllers timing out — API throttling. Cause?
Answer: Too many controllers/operators or bad loops spamming API. Also massive watch events during scale storms. Reduce controller count and tune client QPS/burst settings.
✅ Q10 — Need Safe Multi-Cluster Strategy
Question: When do you choose multi-cluster over multi-namespace?
Answer: Use multi-cluster for blast-radius isolation, compliance, or env separation. Namespaces are not hard isolation. Production vs staging should be separate clusters in most real orgs.