EKS
Eks Upgrade

πŸš€ Part 1 β€” EKS Upgrade Process & Strategy (Production Grade)

Interviewers don’t want β€œclick upgrade.” They want risk-managed upgrade strategy.

Start your answer like this:

I treat EKS upgrades as a staged, low-risk rollout across control plane, node groups, and add-ons β€” never a one-step upgrade.

Strong opener.


βœ… EKS Upgrade Has 3 Layers (Say This Clearly)

EKS upgrades are done in order:

1️⃣ Control Plane 2️⃣ Cluster Add-ons 3️⃣ Worker Nodes

If you mix order β€” breakage risk.


🧠 Step 1 β€” Pre-Upgrade Assessment

Before upgrading:

  • check Kubernetes version skew policy
  • read EKS release notes
  • check deprecated APIs (very important)
  • scan manifests for removed APIs
  • check CRDs compatibility
  • check add-on compatibility matrix
  • verify CNI / CoreDNS / kube-proxy versions
  • check Helm charts compatibility

Senior signal: mention API deprecation scan.

Tools you can mention:

  • kubent
  • pluto
  • kube-no-trouble

πŸ›  Step 2 β€” Upgrade Control Plane

EKS control plane upgrade is managed by AWS.

Process:

  • upgrade via console/CLI
  • no node restart yet
  • control plane becomes new version
  • worker nodes can remain one version behind (skew allowed)

Risk is low but:

  • webhook / admission controllers can break
  • API removal can break controllers

πŸ”Œ Step 3 β€” Upgrade Add-ons

Critical add-ons:

  • VPC CNI
  • CoreDNS
  • kube-proxy
  • EBS/EFS CSI drivers
  • Load balancer controller

These must match cluster version.

Many outages happen here β€” not control plane.


πŸ–₯ Step 4 β€” Upgrade Node Groups (Safest Pattern)

Best practice = blue/green node group upgrade

Pattern:

Create new node group with:

  • new AMI
  • new kubelet version
  • new CNI version

Then:

  • cordon old nodes
  • drain pods respecting PDB
  • shift workloads
  • delete old node group

Never in-place patch all nodes blindly.


πŸ”„ Step 5 β€” Workload Safety Controls

Must mention:

  • PodDisruptionBudgets
  • readiness probes
  • rolling deployments
  • maxUnavailable tuning
  • surge capacity available

Without PDB β€” upgrade can cause outage.


πŸ§ͺ Step 6 β€” Pre-Prod Upgrade First

Senior answer must include:

I always upgrade staging cluster first and run smoke + load tests before production.


⏱ Upgrade Frequency Strategy

Good interview answer:

  • stay within 1–2 versions behind
  • avoid big jumps
  • schedule quarterly upgrades
  • treat as routine β€” not rare event

🌐 Part 2 β€” EKS Networking & CNI Models (Deep Interview Topic)

This is where many candidates get confused. Let’s make it clean.


🧩 First β€” What CNI Does

CNI decides:

  • pod IP allocation
  • pod routing
  • pod ↔ pod communication
  • pod ↔ VPC communication
  • network policy support
  • IP scaling limits

🟒 AWS VPC CNI (Default)


βœ… How AWS CNI Works

Pods get real VPC IPs from subnet.

Pod = VPC-native IP No overlay network.


βœ… Strengths

  • native VPC routing
  • no encapsulation overhead
  • high performance
  • security groups integration
  • works well with AWS LBs
  • simplest for AWS-native workloads

⚠️ Weakness

Consumes VPC IPs fast. Subnet exhaustion is common at scale.


πŸš€ Prefix Delegation (AWS CNI Enhancement)


βœ… What It Does

Instead of attaching many secondary IPs β€” attach IP prefixes per ENI.

One prefix = block of pod IPs.


βœ… Benefits

  • massive pod density increase
  • fewer ENIs needed
  • faster pod startup
  • reduces IP exhaustion pressure
  • best for high-scale clusters

🧠 Interview Position

Say:

Prefix delegation is the preferred scaling model for high pod density on AWS CNI.


πŸŒ‰ VPC CNI Custom Networking


βœ… What It Solves

Use separate subnets for pod IPs instead of node subnets.

Node subnet β‰  pod subnet.


βœ… When Used

  • node subnet IP exhausted
  • want pod IP segmentation
  • network isolation
  • multi-subnet strategy

⚠️ Tradeoff

More routing complexity. Harder troubleshooting. Must design route tables correctly.


πŸ›‘ Calico


βœ… What Calico Adds

  • strong NetworkPolicy engine
  • fine-grained policy control
  • can run with AWS CNI (policy-only mode)
  • can run full overlay mode

βœ… When Choose Calico

  • strict microsegmentation needed
  • fintech compliance network isolation
  • zero-trust east-west controls

⚠️ Tradeoff

More operational complexity.


πŸ•Έ Weave Net


Characteristics

  • overlay network
  • simple setup
  • not AWS-native
  • more latency
  • less used in EKS now

Interview answer: rarely chosen today for EKS.


🧡 Flannel


Characteristics

  • simple overlay CNI
  • basic networking
  • no strong policy engine
  • good for small clusters
  • not common in EKS production

🧠 Interview Decision Matrix β€” Say This


βœ… Default Production EKS

AWS VPC CNI + Prefix Delegation


βœ… Need Network Policy

AWS CNI + Calico policy mode


βœ… Extreme Pod Density

AWS CNI + Prefix Delegation + custom networking


βœ… Hybrid / On-prem style cluster

Calico full mode


❌ Avoid for EKS Prod

Weave / Flannel (unless special case)


🧨 Interview Trap β€” Network Policy Support

AWS CNI alone historically lacked policy β†’ needed Calico. Newer AWS network policy features exist β€” but Calico still stronger.

Mention this nuance = senior signal.


🧠 Senior One-Shot Summary Answer

If asked:

β€œWhat networking model would you choose for large fintech EKS?”

Answer:

I’d use AWS VPC CNI with prefix delegation for scale and native VPC routing, add Calico for network policy enforcement, and use custom networking only if subnet IP pressure requires separation. Overlay CNIs like Weave or Flannel I avoid for production EKS due to performance and operability tradeoffs.


πŸ’¬ Need a Quick Summary?

Hey! Don't have time to read everything? I get it. 😊
Click below and I'll give you the main points and what matters most on this page.
Takes about 5 seconds β€’ Uses Perplexity AI