🟡 Round 2 — Scenario-Based EKS / Kubernetes (Intermediate–Production)
✅ Q1 — Cluster Autoscaler didn’t scale nodes
Question: Pods are Pending due to insufficient CPU, but Cluster Autoscaler didn’t add nodes. Why?
Answer:
Check if pods have nodeSelector/affinity that doesn’t match any node group. CA only scales groups where scheduling is possible. Also verify CA logs, ASG tags (k8s.io/cluster-autoscaler/enabled). If max node group size reached — no scale.
✅ Q2 — Pods can’t get IPs
Question: New pods fail with CNI IP allocation errors.
Answer:
Likely VPC CNI IP exhaustion. Check node ENI/IP limits per instance type. Run kubectl logs aws-node -n kube-system. Fix by enabling prefix delegation or using larger instance types / more nodes.
✅ Q3 — Pod has S3 access denied in EKS but works locally
Question: Using IRSA — still AccessDenied. What to verify?
Answer: Check service account annotation with IAM role ARN. Verify IAM role trust policy has correct OIDC provider and SA subject. Exec into pod → check token file exists. Confirm no conflicting node IAM role policy assumption.
✅ Q4 — Ingress created but wrong target group health
Question: ALB shows targets unhealthy.
Answer: Check readiness probe path vs ALB health check path mismatch. Verify service targetPort mapping. Confirm pod is listening on correct port. Check security group allows ALB → node/pod traffic.
✅ Q5 — Rolling update causes downtime despite replicas=3
Question: Why?
Answer:
No readiness probe or wrong probe config — traffic sent to unready pods. Also check maxUnavailable too high. Without PDB + proper probes, rolling update can still drop traffic.
✅ Q6 — Node drain stuck forever
Question: kubectl drain hangs.
Answer:
Usually due to PDB blocking eviction. Check kubectl get pdb. Also DaemonSet pods or pods with local storage block drain unless forced. Need --ignore-daemonsets or adjust PDB.
✅ Q7 — After EKS upgrade, workloads unstable
Question: What addon mismatch commonly breaks clusters?
Answer: VPC CNI, CoreDNS, kube-proxy version mismatch. Always upgrade addons after control plane upgrade. Many outages happen because people forget CNI compatibility matrix.
✅ Q8 — One service slower than others in same node
Question: Same node, different performance — why?
Answer:
Resource requests/limits differ. One pod may be CPU throttled due to limit. Check kubectl describe pod → throttling metrics. Also noisy neighbor effect without limits.
✅ Q9 — Need pod-to-pod network restriction
Question: How do you block traffic between namespaces?
Answer: Use NetworkPolicies. Default deny + allow rules per namespace/app. Requires CNI that supports it (AWS VPC CNI + Calico or Cilium for enforcement).
✅ Q10 — Want zero-downtime node group upgrade
Question: How do you upgrade nodes safely?
Answer: Create new node group with new AMI → cordon & drain old nodes gradually → shift workloads → delete old group. Blue/green node group strategy is safest.