🟡 Round 2 — Scenario-Based EKS / Kubernetes (Intermediate–Production)

✅ Q1 — Cluster Autoscaler didn’t scale nodes

Question: Pods are Pending due to insufficient CPU, but Cluster Autoscaler didn’t add nodes. Why?

Answer: Check if pods have nodeSelector/affinity that doesn’t match any node group. CA only scales groups where scheduling is possible. Also verify CA logs, ASG tags (k8s.io/cluster-autoscaler/enabled). If max node group size reached — no scale.

✅ Q2 — Pods can’t get IPs

Question: New pods fail with CNI IP allocation errors.

Answer: Likely VPC CNI IP exhaustion. Check node ENI/IP limits per instance type. Run kubectl logs aws-node -n kube-system. Fix by enabling prefix delegation or using larger instance types / more nodes.

✅ Q3 — Pod has S3 access denied in EKS but works locally

Question: Using IRSA — still AccessDenied. What to verify?

Answer: Check service account annotation with IAM role ARN. Verify IAM role trust policy has correct OIDC provider and SA subject. Exec into pod → check token file exists. Confirm no conflicting node IAM role policy assumption.

✅ Q4 — Ingress created but wrong target group health

Question: ALB shows targets unhealthy.

Answer: Check readiness probe path vs ALB health check path mismatch. Verify service targetPort mapping. Confirm pod is listening on correct port. Check security group allows ALB → node/pod traffic.

✅ Q5 — Rolling update causes downtime despite replicas=3

Question: Why?

Answer: No readiness probe or wrong probe config — traffic sent to unready pods. Also check maxUnavailable too high. Without PDB + proper probes, rolling update can still drop traffic.

✅ Q6 — Node drain stuck forever

Question: kubectl drain hangs.

Answer: Usually due to PDB blocking eviction. Check kubectl get pdb. Also DaemonSet pods or pods with local storage block drain unless forced. Need --ignore-daemonsets or adjust PDB.

✅ Q7 — After EKS upgrade, workloads unstable

Question: What addon mismatch commonly breaks clusters?

Answer: VPC CNI, CoreDNS, kube-proxy version mismatch. Always upgrade addons after control plane upgrade. Many outages happen because people forget CNI compatibility matrix.

✅ Q8 — One service slower than others in same node

Question: Same node, different performance — why?

Answer: Resource requests/limits differ. One pod may be CPU throttled due to limit. Check kubectl describe pod → throttling metrics. Also noisy neighbor effect without limits.

✅ Q9 — Need pod-to-pod network restriction

Question: How do you block traffic between namespaces?

Answer: Use NetworkPolicies. Default deny + allow rules per namespace/app. Requires CNI that supports it (AWS VPC CNI + Calico or Cilium for enforcement).

✅ Q10 — Want zero-downtime node group upgrade

Question: How do you upgrade nodes safely?

Answer: Create new node group with new AMI → cordon & drain old nodes gradually → shift workloads → delete old group. Blue/green node group strategy is safest.

R1 R3