🔴 Round 3 — Scenario-Based EKS / Kubernetes (Advanced–Production Critical)

✅ Q1 — Sudden Traffic Spike, Cluster Meltdown

Question: Traffic spiked 5×. HPA scaled pods, Cluster Autoscaler added nodes, but app still timing out. Why can this still fail?

Answer: Bottleneck may be DB, downstream API, or connection pool limits — not pods. Also pod startup time + image pull delay causes lag. Check service saturation, DB connections, and queue depth. Scaling compute doesn’t fix external bottlenecks.

✅ Q2 — Multi-AZ Cluster, App Still Went Down When One AZ Failed

Question: Nodes were in 3 AZs but outage still happened. What likely design mistake?

Answer: Pods likely scheduled into single AZ due to no topology spread or anti-affinity. Also EBS volumes are AZ-bound — StatefulSet may be tied to failed AZ. Need topologySpreadConstraints + multi-AZ storage strategy.

✅ Q3 — Cluster Autoscaler Causing Cost Explosion

Question: Autoscaler keeps adding nodes every night due to batch jobs. How do you control this?

Answer: Use separate node group for batch with taints/tolerations. Set max node group size and pod resource requests correctly. Also use scheduled scaling or Karpenter with constraints. Don’t mix batch and critical workloads.

✅ Q4 — Pods Frequently Evicted Under Load

Question: Pods getting evicted, not crashing. Why?

Answer: Node memory pressure or ephemeral storage pressure. Check kubectl describe node conditions. Eviction happens when requests are too low and limits too high. Fix by realistic requests and node sizing.

✅ Q5 — Need Pod-Level Network Security Like EC2 SG

Question: How do you implement that in EKS?

Answer: Use Security Groups for Pods feature. Attach SG via pod annotation and ENI trunking. Useful for DB or regulated access patterns. Requires VPC CNI advanced config.

✅ Q6 — Large Cluster, Scheduling Slow

Question: Pod scheduling taking minutes. Why?

Answer: Too many complex affinity/anti-affinity rules slow scheduler. Also huge cluster with insufficient scheduler resources. Reduce rule complexity and use topology spread instead of heavy anti-affinity.

✅ Q7 — Production Outage After ConfigMap Change

Question: ConfigMap updated, all pods restarted, app broke. What safer pattern?

Answer: Use canary rollout with new ConfigMap version name. Update deployment gradually. Never mutate shared ConfigMap live for critical apps. Version configs like code.

✅ Q8 — Need Guaranteed Capacity for Critical App

Question: Spot + on-demand mixed cluster — how ensure critical pods never land on spot?

Answer: Taint spot nodes and add tolerations only to noncritical workloads. Critical workloads get node affinity to on-demand node group. Also use PriorityClass.

✅ Q9 — API Server Rate Limiting During Incident

Question: kubectl and controllers timing out — API throttling. Cause?

Answer: Too many controllers/operators or bad loops spamming API. Also massive watch events during scale storms. Reduce controller count and tune client QPS/burst settings.

✅ Q10 — Need Safe Multi-Cluster Strategy

Question: When do you choose multi-cluster over multi-namespace?

Answer: Use multi-cluster for blast-radius isolation, compliance, or env separation. Namespaces are not hard isolation. Production vs staging should be separate clusters in most real orgs.

R2 Rbac