EKS
Kuberntesprod12

πŸ”₯ Question 1

Explain full Kubernetes control plane architecture and request flow from kubectl to pod creation.

βœ… Real Production Answer

Kubernetes control plane mainly consists of:

  • kube-apiserver
  • etcd
  • kube-scheduler
  • kube-controller-manager
  • (cloud-controller-manager if cloud provider integrated)

Now request flow:

  1. When I run kubectl apply -f deployment.yaml, kubectl sends REST API request to kube-apiserver.

  2. kube-apiserver:

    • Authenticates (cert/token)
    • Authorizes (RBAC)
    • Validates object schema
    • Writes object state into etcd
  3. Now the object is stored as desired state in etcd.

  4. Deployment controller (inside controller-manager) sees new Deployment object.

    • It creates a ReplicaSet.
  5. ReplicaSet controller sees desired replicas and creates Pod objects.

  6. Pods are now in Pending state.

  7. kube-scheduler watches for unscheduled pods.

    • Applies filtering (resource availability, taints, affinity)
    • Scores nodes
    • Assigns pod to a node
  8. kubelet on that node:

    • Watches API server
    • Pulls image
    • Creates container via container runtime (containerd / CRI-O)
    • Reports status back
  9. Pod becomes Running.

Production insight:

  • API server is the only component talking to etcd.
  • Everything else works via watch mechanism.
  • If scheduler is down β†’ pods stay Pending.
  • If controller-manager down β†’ state reconciliation stops.

πŸ”₯ Question 2

What happens internally when you create a Deployment?

βœ… Real Production Answer

Deployment is a higher-level abstraction.

When I create a Deployment:

  1. API server stores Deployment object.
  2. Deployment controller creates a ReplicaSet.
  3. ReplicaSet ensures desired replica count.
  4. Pods get created.

Deployment does not directly manage pods. It manages ReplicaSets.

During update:

  • A new ReplicaSet is created.
  • Old ReplicaSet scaled down gradually.
  • Controlled by maxSurge and maxUnavailable.

Production insight:

  • Rollbacks happen by scaling older ReplicaSet.
  • If rollout fails, check ReplicaSet events.
  • If readiness probe fails β†’ rollout stalls.

πŸ”₯ Question 3

Difference between Deployment, StatefulSet, DaemonSet β€” with production use cases.

βœ… Deployment

  • Stateless apps
  • Web servers
  • APIs
  • Horizontally scalable

Pods are interchangeable.

Example: Frontend app behind LoadBalancer.


βœ… StatefulSet

  • Stable pod identity
  • Ordered startup/shutdown
  • Stable persistent storage
  • Predictable DNS

Used for:

  • Databases (MySQL, MongoDB)
  • Kafka
  • Elasticsearch

Each pod gets:

  • pod-0, pod-1 naming
  • Dedicated PVC

Production insight: Don’t use StatefulSet unless you need stable identity or storage.


βœ… DaemonSet

  • One pod per node
  • Runs on every node

Used for:

  • Logging agents (Fluent Bit)
  • Monitoring (Node Exporter)
  • Security agents

If new node joins β†’ DaemonSet pod auto-created.


πŸ”₯ Question 4

When should you use StatefulSet over Deployment β€” and why not always?

βœ… Use StatefulSet when:

  • Application needs stable hostname
  • Persistent storage tied to instance
  • Ordered scaling
  • Clustered systems (Kafka, DB)

Example: Database cluster where each node has its own disk.


❌ Why not always?

  • StatefulSets are slower to scale
  • More complex
  • Harder rolling updates
  • Storage management complexity
  • Cannot freely replace pods

If app is stateless β†’ Deployment is simpler and safer.

Interview insight: If someone says β€œI use StatefulSet for everything” β†’ red flag.


πŸ”₯ Question 5

How kube-scheduler makes scheduling decisions?

βœ… Real Production Answer

Scheduler works in two phases:

1️⃣ Filtering (Predicate phase)

Eliminates nodes that cannot run the pod:

  • Not enough CPU/memory
  • Taints not tolerated
  • NodeSelector mismatch
  • Affinity rules fail
  • Volume binding constraints

2️⃣ Scoring phase

Ranks remaining nodes based on:

  • Resource availability
  • Spread
  • Affinity preferences
  • Topology

Best score wins.

Scheduler then binds pod to node.

Production insight:

  • If pod stuck Pending β†’ scheduler logs are key.
  • If requests not defined β†’ scheduling becomes unpredictable.
  • Resource requests are critical.

πŸ”₯ Question 6

What are scheduler predicates and priorities (or scheduling framework plugins)?

βœ… Real Production Answer

Older Kubernetes versions used:

  • Predicates β†’ filtering phase
  • Priorities β†’ scoring phase

Now in modern Kubernetes, this is handled through the Scheduling Framework plugins, but conceptually same idea.

πŸ”Ή Filtering (Predicates Equivalent)

Scheduler removes nodes that don’t satisfy:

  • Insufficient CPU/memory
  • Taints not tolerated
  • NodeSelector mismatch
  • Node affinity required rules
  • Volume binding constraints
  • Node not Ready

If no node passes β†’ Pod stays Pending.


πŸ”Ή Scoring (Priorities Equivalent)

Among eligible nodes, scheduler scores based on:

  • Least requested resources
  • Balanced resource allocation
  • Pod affinity preferences
  • Topology spread constraints

Highest score wins.


πŸ”₯ Production Insight

If a pod is stuck Pending:

First check:

kubectl describe pod <pod>

Look at:

  • Events section
  • β€œ0/5 nodes available”

That tells you which predicate failed.

At 12 LPA level, you must know:

Scheduling mostly depends on resource requests, not limits.


πŸ”₯ Question 7

How does kube-controller-manager work? Name key controllers.

βœ… Real Production Answer

kube-controller-manager runs multiple controllers that reconcile desired state with actual state.

It constantly:

  • Watches API server
  • Compares desired vs current
  • Takes action to fix drift

This is reconciliation loop.


πŸ”Ή Important Controllers

  1. Deployment Controller
  2. ReplicaSet Controller
  3. Node Controller
  4. Job Controller
  5. Endpoint Controller
  6. ServiceAccount Controller
  7. Namespace Controller
  8. PersistentVolume Controller

Example:

If a pod crashes β†’ ReplicaSet controller detects fewer replicas β†’ creates new pod.


πŸ”₯ Production Insight

If controller-manager is down:

  • No new pods created
  • No node health checks
  • No scaling actions
  • Cluster state drifts

But existing running pods continue working.


πŸ”₯ Question 8

What happens if kube-controller-manager goes down?

βœ… Real Production Answer

Existing workloads continue running because kubelet works independently.

But:

  • No self-healing
  • No scaling
  • No ReplicaSet enforcement
  • No Job completion
  • No node failure handling

Example:

If a node dies:

  • Node controller won't mark it NotReady.
  • Pods won't get rescheduled.

Cluster slowly degrades.


πŸ”₯ Real Production Fix

  • Control plane should run in HA mode.
  • Controller-manager usually runs as static pod on master nodes.
  • If one instance fails, another takes leadership.

πŸ”₯ Question 9

How etcd stores data β€” and why quorum matters?

βœ… Real Production Answer

etcd is a distributed key-value store.

It stores:

  • All cluster state
  • Pods
  • Deployments
  • ConfigMaps
  • Secrets
  • Node info

Everything in Kubernetes = object in etcd.


πŸ”Ή How It Works

  • Uses Raft consensus algorithm
  • Requires majority to agree before commit
  • Strong consistency

If you have 3 etcd nodes:

  • Minimum 2 required for quorum.

If you have 5:

  • Minimum 3 required.

πŸ”₯ Why Quorum Matters

If quorum lost:

  • Cluster becomes read-only
  • API server cannot write
  • No new objects created
  • Cluster effectively dead

That’s why: Never run single-node etcd in production.


πŸ”₯ Production Best Practice

  • Always odd number of etcd nodes (3 or 5)
  • Regular snapshots
  • Separate etcd from worker load

πŸ”₯ Question 10

How do you design HA control plane?

βœ… Real Production Answer

For production:

πŸ”Ή Control Plane HA Components

  1. Multiple control plane nodes (minimum 3)
  2. etcd cluster with quorum
  3. Load Balancer in front of API servers

Flow:

kubectl β†’ LoadBalancer β†’ multiple kube-apiserver instances

Each API server:

  • Talks to etcd cluster
  • Leader election used for controllers

πŸ”Ή Options

Managed (EKS, GKE, AKS):

  • HA handled by cloud provider.

Self-managed:

  • kubeadm with stacked etcd
  • External etcd cluster

πŸ”₯ Production Insight

Common mistakes:

  • Single master
  • Single etcd
  • No LB in front of API server
  • No etcd backup

πŸ”₯ Interview Upgrade Answer

Mention:

  • Use 3 control plane nodes
  • Separate etcd disks (SSD)
  • Enable API server audit logging
  • Regular etcd snapshots
  • Test restore process

If you mention restore testing β†’ interviewer knows you’ve done real work.


Now we’re entering the area where most 2–3 year DevOps engineers collapse β€” networking.

If you master this section, your interview confidence will jump massively.


πŸ”₯ Question 11

How does pod-to-pod communication work across nodes?

βœ… Real Production Answer

In Kubernetes, every pod gets:

  • Its own IP
  • Flat networking model
  • No NAT between pods

Kubernetes follows:

Every pod can talk to every other pod directly via IP.


πŸ”Ή Same Node Communication

  • Pods connected via Linux bridge (like cni0)
  • Traffic stays local

πŸ”Ή Cross-Node Communication

This is where CNI plugin matters.

Example (AWS EKS with VPC CNI):

  • Pods get real VPC IPs
  • ENIs attached to worker nodes
  • Traffic routed via VPC

Example (Calico):

  • Uses overlay networking (VXLAN/IPIP)
  • Encapsulates traffic

Flow: Pod A β†’ Node network β†’ CNI routing β†’ Node B β†’ Pod B


πŸ”₯ Production Insight

If cross-node traffic fails:

Check:

  • CNI plugin logs
  • Node routes (ip route)
  • Security groups (in cloud)
  • NetworkPolicy

Networking issues are 80% of real cluster debugging.


πŸ”₯ Question 12

What is CNI β€” and what breaks if CNI fails?

βœ… Real Production Answer

CNI = Container Network Interface.

It is the plugin responsible for:

  • Assigning pod IP
  • Configuring networking
  • Managing routes

Without CNI:

  • Pods won’t get IP
  • Pods stuck in ContainerCreating
  • Cross-node communication fails

πŸ”Ή Common CNIs

  • AWS VPC CNI
  • Calico
  • Cilium
  • Flannel
  • Weave

πŸ”₯ Production Insight

If CNI pods crash:

  • Entire cluster networking unstable
  • New pods fail to start
  • Services may break

Always monitor:

  • CNI DaemonSet health
  • IP exhaustion (very common in AWS)

IP exhaustion is a classic production issue.


πŸ”₯ Question 13

Difference between ClusterIP, NodePort, LoadBalancer in real usage.

βœ… ClusterIP (Default)

  • Internal-only
  • Accessible inside cluster
  • Used for microservices communication

Example: Backend service accessed by frontend.


βœ… NodePort

  • Exposes service on every node’s IP + static port (30000–32767)
  • Mostly used for testing
  • Not ideal for production

βœ… LoadBalancer

  • Cloud provider provisions external LB
  • Exposes service publicly
  • Used for production traffic

Example: Public API service.


πŸ”₯ Production Insight

Best practice: LoadBalancer β†’ Ingress Controller β†’ ClusterIP services

Avoid exposing every service with LoadBalancer (cost issue).


πŸ”₯ Question 14

How kube-proxy works (iptables vs ipvs modes)?

βœ… Real Production Answer

kube-proxy manages service routing.

When you create a Service:

  • kube-proxy sets up rules on nodes.

πŸ”Ή iptables Mode

  • Uses Linux iptables rules
  • Simple
  • Slower at scale (large clusters)

πŸ”Ή IPVS Mode

  • Uses Linux IP Virtual Server
  • More efficient
  • Better performance
  • Recommended for large clusters

πŸ”₯ Production Insight

If service routing fails:

Check:

kubectl get svc
kubectl describe svc
iptables -L -n

In large clusters, IPVS performs better.


πŸ”₯ Question 15

What is headless service and when used?

βœ… Real Production Answer

Headless Service = Service without ClusterIP.

Defined as:

clusterIP: None

No load balancing.

Instead:

  • DNS returns all pod IPs.

πŸ”Ή Used In:

  • StatefulSets
  • Databases
  • Direct pod-to-pod communication
  • Kafka clusters

Example: mysql-0.mysql-headless.default.svc.cluster.local

Each pod gets stable DNS.


πŸ”₯ Production Insight

If you need:

  • Direct communication between cluster members
  • Stable identity
  • Peer discovery

Use headless service.


πŸ”₯ Question 16

How DNS resolution works inside the cluster?

βœ… Real Production Answer

Kubernetes uses CoreDNS for internal DNS.

When a pod starts:

  1. kubelet injects DNS config into /etc/resolv.conf
  2. Nameserver usually points to CoreDNS service IP
  3. Pod queries CoreDNS
  4. CoreDNS checks Kubernetes API for service/pod records
  5. Returns IP

πŸ”Ή Service DNS Format

<service-name>.<namespace>.svc.cluster.local

Example:

backend.default.svc.cluster.local

Short names work because of search domains.


πŸ”Ή For Headless Services

DNS returns:

  • Multiple A records (one per pod)

πŸ”₯ Production Debugging

If DNS fails:

Check:

kubectl get pods -n kube-system

(CoreDNS running?)

Test inside pod:

nslookup service-name

Common issues:

  • CoreDNS crash
  • NetworkPolicy blocking DNS (UDP 53)
  • CNI issues

DNS failure = full microservice meltdown.


πŸ”₯ Question 17

How would you debug if one pod cannot reach another pod?

βœ… Real Production Approach

I follow structured debugging:


Step 1: Basic Connectivity

From source pod:

ping target-ip
curl target-service

If IP works but service name fails β†’ DNS issue.


Step 2: Check Service

kubectl get svc
kubectl describe svc
kubectl get endpoints

If endpoints empty β†’ selector mismatch.


Step 3: Check NetworkPolicy

Very common mistake.

kubectl get networkpolicy

If policy exists β†’ verify ingress/egress rules.


Step 4: CNI & Node Level

  • Check CNI pod health
  • Check node routes
  • Security groups (cloud)

πŸ”₯ Production Insight

80% of inter-pod issues are:

  • Wrong label selector
  • NetworkPolicy blocking
  • Port mismatch

πŸ”₯ Question 18

How is NetworkPolicy enforced and common mistakes?

βœ… Real Production Answer

NetworkPolicy defines allowed traffic at pod level.

But important:

NetworkPolicy only works if CNI supports it.

Example:

  • Calico supports
  • AWS VPC CNI alone doesn’t enforce (needs Calico)

πŸ”Ή Enforcement

NetworkPolicy:

  • Applied at pod level
  • Uses labels
  • Default deny model if policy exists

If any NetworkPolicy applied to a namespace: β†’ Traffic not explicitly allowed is denied.


πŸ”₯ Common Mistakes

  • Forgetting egress rules
  • Forgetting DNS (UDP 53)
  • Wrong pod labels
  • Applying policy but CNI doesn’t support it

Production issue example: App can't call external API because egress blocked.


πŸ”₯ Question 19

Difference between Ingress and Gateway API?

βœ… Ingress

  • Older abstraction
  • Layer 7 HTTP routing
  • Requires Ingress Controller (Nginx, ALB, Traefik)

Supports:

  • Host-based routing
  • Path-based routing
  • TLS termination

βœ… Gateway API (Newer & More Powerful)

  • More flexible
  • Role-based separation
  • Better traffic control
  • Supports advanced routing

Gateway API separates:

  • Gateway (infra)
  • HTTPRoute (app routing)

πŸ”₯ Production Insight

Ingress still widely used.

Gateway API is future direction.

If you say:

"Gateway API gives better separation between infra and app teams"

Interviewer will be impressed.


πŸ”₯ Question 20

How TLS termination works with Ingress controller?

βœ… Real Production Flow

  1. User hits HTTPS endpoint.

  2. LoadBalancer forwards traffic to Ingress Controller.

  3. Ingress Controller:

    • Uses TLS secret
    • Terminates TLS
    • Forwards HTTP to backend service

πŸ”Ή TLS Secret

Stored as:

type: kubernetes.io/tls

Contains:

  • tls.crt
  • tls.key

πŸ”Ή Production Best Practice

Use cert-manager:

  • Automatically issues certificates (Let's Encrypt)
  • Auto-renewal
  • Reduces manual errors

πŸ”₯ Real Production Failure Cases

  • Expired certificate
  • Secret missing
  • Wrong host in Ingress rule
  • Port mismatch

πŸ”₯ Question 21

Difference between resource requests and limits β€” real impact on scheduling?

βœ… Real Production Answer

In Kubernetes:

  • Requests β†’ used for scheduling
  • Limits β†’ enforced at runtime

πŸ”Ή Requests

When a pod is scheduled, kube-scheduler checks:

  • CPU request
  • Memory request

Scheduler ensures the node has at least that much available.

If no node satisfies β†’ Pod stays Pending.


πŸ”Ή Limits

Enforced by container runtime using cgroups.

  • CPU limit β†’ throttling
  • Memory limit β†’ OOMKill

πŸ”₯ Real Production Impact

If you don’t define requests:

  • Scheduler may overcommit node
  • Many pods land on same node
  • Node pressure increases
  • Random OOMs later

If you don’t define limits:

  • One bad pod can consume entire node memory
  • Node becomes unstable

Best practice: Always define both.


πŸ”₯ Question 22

What happens if limits are not defined?

βœ… Real Production Answer

If limits not defined:

  • CPU β†’ unlimited usage (can starve others)
  • Memory β†’ can consume entire node
  • Node may enter MemoryPressure
  • Kernel OOM killer may kill random pods

In worst case:

  • Node crashes
  • Multiple services affected

πŸ”₯ Production Insight

In shared clusters: Never allow workloads without limits.

Use:

  • LimitRange
  • ResourceQuota

To enforce guardrails at namespace level.


πŸ”₯ Question 23

What is OOMKilled and how to prevent it?

βœ… Real Production Answer

OOMKilled happens when:

  • Container exceeds memory limit
  • Linux kernel kills it

Pod status shows:

Reason: OOMKilled

πŸ”Ή Root Causes

  • Memory leak in app
  • Too low memory limit
  • Traffic spike
  • Poor request/limit tuning

πŸ”₯ Debugging Approach

  1. Check pod describe
  2. Check previous logs:
kubectl logs pod-name --previous
  1. Compare usage vs limits (Prometheus/Grafana)

πŸ”₯ Prevention

  • Set realistic memory requests & limits
  • Use HPA
  • Profile application memory
  • Avoid equal request=limit unless needed

πŸ”₯ Question 24

How does HPA calculate scaling decisions?

βœ… Real Production Answer

HPA works based on:

Current Metric / Target Metric

Example:

If CPU target = 60% Current average CPU = 90%

New replicas = (90 / 60) Γ— current replicas


πŸ”Ή Requirements

  • Metrics Server installed
  • CPU requests defined

If requests missing β†’ HPA won’t work properly.


πŸ”Ή Scaling Cycle

  • HPA checks metrics periodically (default 15s)
  • Calculates desired replicas
  • Updates Deployment
  • ReplicaSet creates new pods

πŸ”₯ Production Insight

Common issue: HPA scales up fast, scales down slowly (stabilization window).

You must tune:

  • minReplicas
  • maxReplicas
  • scaleDown behavior

πŸ”₯ Question 25

Metrics Server vs Prometheus for HPA β€” difference?

βœ… Metrics Server

  • Lightweight
  • Provides CPU & memory metrics only
  • Used by HPA
  • Not long-term storage

βœ… Prometheus

  • Full monitoring system
  • Stores historical metrics
  • Custom metrics support
  • Can integrate with HPA via adapter

πŸ”₯ Production Insight

Default HPA uses Metrics Server.

For advanced scaling (like requests per second): Use:

  • Prometheus Adapter
  • Custom metrics API

Example: Scale based on:

  • Queue length
  • HTTP requests/sec
  • Kafka lag

That’s more production-grade scaling.


πŸ”₯ Question 26

When HPA fails to scale β€” what are your debugging steps?

βœ… Real Production Answer

If HPA is not scaling, I check systematically:


Step 1: Check HPA Status

kubectl get hpa
kubectl describe hpa <name>

Look for:

  • Current metrics
  • Target metrics
  • Events
  • Conditions

If it says:

failed to get CPU utilization

β†’ Metrics Server issue.


Step 2: Verify Metrics Server

kubectl get pods -n kube-system

Check metrics-server is running.

Test:

kubectl top pods

If this fails β†’ HPA won't work.


Step 3: Check Resource Requests

HPA calculates based on CPU requests.

If CPU request not defined:

  • Scaling won’t behave correctly.

Step 4: Check min/maxReplicas

Sometimes HPA not scaling because:

  • Already at maxReplicas
  • Current replicas equal calculated replicas

Step 5: Stabilization Window

Scale down might not happen due to:

  • Stabilization window (default 300s)

πŸ”₯ Production Insight

Most common causes:

  • Missing CPU requests
  • Metrics Server misconfigured
  • Target utilization unrealistic

πŸ”₯ Question 27

Difference between HPA, VPA, and Cluster Autoscaler?

βœ… HPA (Horizontal Pod Autoscaler)

  • Scales number of pods
  • Based on CPU/memory/custom metrics

Used for:

  • Web apps
  • APIs

βœ… VPA (Vertical Pod Autoscaler)

  • Adjusts CPU/memory requests & limits
  • Does NOT scale pod count
  • Often restarts pods to apply new values

Used for:

  • Stateful workloads
  • Apps needing tuning

⚠️ Important

Do NOT run HPA and VPA on same resource for CPU β€” conflict risk.


βœ… Cluster Autoscaler

  • Scales nodes
  • Adds/removes worker nodes
  • Works when pods are Pending due to lack of resources

Flow: HPA scales pods β†’ No space β†’ Cluster Autoscaler adds nodes.


πŸ”₯ Production Insight

Scaling hierarchy:

  1. HPA tries first
  2. If node capacity full β†’ Cluster Autoscaler triggers
  3. Node joins β†’ Pending pods scheduled

πŸ”₯ Question 28

PodDisruptionBudget β€” real production use case?

βœ… Real Production Answer

PDB ensures minimum pods stay available during voluntary disruptions.

Voluntary disruptions:

  • Node drain
  • Cluster upgrade
  • Manual eviction

Example:

3 replicas running.

PDB:

minAvailable: 2

During node upgrade: Only 1 pod can be evicted at a time.


πŸ”₯ Why Important?

Without PDB: During node drain β†’ all pods might go down β†’ outage.


πŸ”₯ Production Scenario

While upgrading EKS:

  • Node draining respects PDB
  • Ensures zero downtime

πŸ”₯ Question 29

Taints & tolerations β€” when have you used them?

βœ… Real Production Answer

Taints repel pods.

Tolerations allow pods to run on tainted nodes.


Real Use Cases

  1. Dedicated GPU nodes

    • Taint GPU nodes
    • Only ML workloads tolerate
  2. Infra nodes

    • Taint monitoring/logging nodes
    • Prevent regular apps from scheduling
  3. Spot instances

    • Taint spot nodes
    • Only fault-tolerant workloads run there

πŸ”₯ Production Insight

If pod Pending with:

node(s) had taint that pod didn't tolerate

Add toleration in spec.


πŸ”₯ Question 30

Node affinity vs Pod affinity vs Anti-affinity β€” real scenario usage?

βœ… Node Affinity

Controls which nodes pod can schedule on.

Example: Schedule only on:

  • SSD nodes
  • GPU nodes
  • Specific AZ

βœ… Pod Affinity

Schedule pod close to another pod.

Example: App + cache in same zone for latency reduction.


βœ… Pod Anti-Affinity

Ensure pods are NOT on same node.

Example: 3 replicas of API β†’ spread across 3 nodes.


πŸ”₯ Question 31

Rolling update β€” what parameters control its behavior?

βœ… Real Production Answer

Rolling update is default strategy in Deployment.

Controlled by:

strategy:
  type: RollingUpdate
  rollingUpdate:
    maxUnavailable: 1
    maxSurge: 1

πŸ”Ή maxUnavailable

How many pods can be unavailable during update.

Example: Replicas = 4 maxUnavailable = 1

At least 3 pods always running.


πŸ”Ή maxSurge

How many extra pods can be created above desired replicas.

Example: Replicas = 4 maxSurge = 1

Kubernetes can temporarily run 5 pods.


πŸ”₯ Production Insight

For high-traffic apps:

  • maxUnavailable = 0
  • maxSurge = 1 or 25%

Ensures zero downtime.


πŸ”₯ Question 32

How do maxUnavailable and maxSurge affect rollout?

βœ… Real Production Example

Replicas = 10 maxUnavailable = 2 maxSurge = 3

During rollout:

  • Up to 3 new pods created
  • Up to 2 old pods taken down

So total pods can go up to 13 temporarily.


πŸ”₯ Impact

If maxUnavailable too high: β†’ Risk downtime

If maxSurge too high: β†’ Resource pressure


Production Balance

Low traffic app β†’ aggressive rollout OK Critical production β†’ conservative rollout


πŸ”₯ Question 33

How to implement Blue-Green deployment in Kubernetes?

βœ… Real Production Answer

Blue-Green = two identical environments.

Approach:

  1. Deploy:

    • deployment-blue
    • deployment-green
  2. Service points to one of them.

Switch by changing:

  • Service selector OR
  • Ingress route

Flow:

Current: Service β†’ blue

Deploy green Test green Switch service selector β†’ green Remove blue later


πŸ”₯ Production Insight

Benefits:

  • Instant rollback (just switch back)
  • Safe for big schema changes

Downside:

  • Double resource usage

πŸ”₯ Question 34

How to implement Canary deployment in Kubernetes?

βœ… Real Production Answer

Canary = gradual traffic shift.


Basic Method (Simple)

Deploy:

  • app-v1 (stable)
  • app-v2 (canary, fewer replicas)

Traffic automatically distributed by Service.

Example: 10 replicas v1 1 replica v2 ~10% traffic to v2


Advanced Method (Ingress Based)

Using:

  • NGINX Ingress annotations
  • Istio / service mesh

Example: Route:

  • 90% β†’ v1
  • 10% β†’ v2

Gradually increase.


πŸ”₯ Production Insight

True canary requires:

  • Monitoring
  • Automated rollback
  • Metrics comparison

Without metrics β†’ it's blind rollout.


πŸ”₯ Question 35

How to rollback a bad deployment safely?

βœ… Real Production Answer

First: Check rollout status:

kubectl rollout status deployment app

If broken:

kubectl rollout undo deployment app

This restores previous ReplicaSet.


πŸ”₯ What Actually Happens?

Deployment scales down new ReplicaSet. Scales up old ReplicaSet.


Production-Level Rollback Strategy

Better approach:

  • Use readiness probes properly
  • Monitor error rate
  • Use automated rollback (Argo Rollouts / Flagger)

πŸ”₯ Interview Upgrade Answer

Mention:

  • Don’t rely only on manual rollback

  • Monitor:

    • HTTP 5xx
    • Latency
    • CPU spike
  • Use progressive delivery tools

That signals maturity.


πŸ”₯ Question 36

How does readiness probe affect rollout?

βœ… Real Production Answer

Readiness probe determines whether a pod is ready to receive traffic.

During rollout:

  • New pod is created
  • Kubernetes waits until readiness probe passes
  • Only then does it send traffic
  • Only then old pod is terminated (based on rollout strategy)

πŸ”₯ What If Readiness Probe Fails?

  • Pod remains in NotReady
  • Service does NOT route traffic to it
  • Rollout may get stuck

If maxUnavailable = 0 And new pods never become Ready β†’ Rollout blocks completely.


πŸ”₯ Production Insight

Bad readiness configuration can cause:

  • Stuck deployment
  • Traffic imbalance
  • Partial outages

Best practice: Readiness should check:

  • App health
  • DB connectivity (if critical)
  • Dependencies ready

πŸ”₯ Question 37

Liveness vs Readiness vs Startup probe β€” failure impact?

βœ… Liveness Probe

Checks:

Should this container be restarted?

If liveness fails:

  • Container restarted

Used to detect:

  • Deadlocks
  • Stuck processes

βœ… Readiness Probe

Checks:

Should this pod receive traffic?

If fails:

  • Traffic stops
  • Pod not restarted

βœ… Startup Probe

Used for:

  • Slow starting apps

Disables liveness until startup passes.


πŸ”₯ Production Mistake

Common error: Using liveness probe for DB connection check.

Result: Temporary DB issue β†’ pod restarts continuously β†’ worse outage.

Correct approach:

  • DB check in readiness, not liveness.

πŸ”₯ Question 38

How to achieve zero downtime deployment?

βœ… Real Production Strategy

  1. Use RollingUpdate

    • maxUnavailable: 0
    • maxSurge: 1
  2. Proper readiness probe

  3. Multiple replicas

  4. PodDisruptionBudget

  5. Graceful shutdown handling in app


πŸ”₯ Critical Element

App must handle:

  • SIGTERM signal
  • Stop accepting traffic
  • Finish ongoing requests
  • Exit cleanly

If app ignores SIGTERM: β†’ Rolling update causes dropped requests.


Production Add-ons

  • Use preStop hook
  • Increase terminationGracePeriodSeconds

πŸ”₯ Question 39

What breaks zero downtime deploy most often?

βœ… Real Production Failures

  1. Single replica app
  2. No readiness probe
  3. DB migration blocking
  4. App not handling SIGTERM
  5. Wrong resource limits causing crash
  6. HPA scaling too slow
  7. Sticky sessions not handled

πŸ”₯ Real Example

Deploy new version. Readiness passes. But new version has memory leak. After traffic shift β†’ OOMKilled β†’ outage.

Lesson: Deployment success β‰  production success.

Monitoring is mandatory.


πŸ”₯ Question 40

How do you manage config changes without rebuilding image?

βœ… Real Production Answer

Use:

  • ConfigMap (non-sensitive config)
  • Secret (sensitive data)

Mounted as:

  • Environment variables
  • Files

πŸ”Ή Config Update Without Rebuild

Update ConfigMap:

kubectl apply -f config.yaml

But important:

Pods DO NOT auto-restart.

Options:

  1. Manually restart deployment
  2. Use hash annotation in Deployment
  3. Use Reloader controller
  4. Use Helm upgrade

πŸ”₯ Production Insight

For zero downtime config update:

  • Update ConfigMap
  • Rolling restart deployment

Never bake config into image in production.


πŸ”₯ Question 41

PV vs PVC vs StorageClass β€” full lifecycle explanation

βœ… Real Production Answer

πŸ”Ή PersistentVolume (PV)

  • Actual storage resource

  • Could be:

    • EBS
    • NFS
    • EFS
    • Ceph
    • Local disk

Cluster-level object.


πŸ”Ή PersistentVolumeClaim (PVC)

  • Request for storage by pod

  • Namespace-level object

  • Specifies:

    • Size
    • Access mode
    • StorageClass

πŸ”Ή StorageClass

Defines:

  • Provisioner
  • Parameters
  • Reclaim policy
  • Volume binding mode

Used for dynamic provisioning.


πŸ”„ Full Lifecycle (Dynamic Provisioning Example)

  1. Pod creates PVC.

  2. PVC references StorageClass.

  3. StorageClass provisioner creates actual volume (like EBS).

  4. PV created and bound to PVC.

  5. Pod mounts PVC.

  6. Pod writes data.

  7. If PVC deleted:

    • Reclaim policy decides:

      • Delete
      • Retain

πŸ”₯ Production Insight

Always check:

kubectl get pvc
kubectl describe pvc

If PVC stuck in Pending β†’ StorageClass issue.


πŸ”₯ Question 42

Static vs Dynamic provisioning

βœ… Static Provisioning

Admin manually creates PV. PVC binds to matching PV.

Used when:

  • Pre-existing storage
  • Special compliance cases

Hard to scale.


βœ… Dynamic Provisioning

Most common.

PVC β†’ StorageClass β†’ Auto create volume.

Example in AWS:

  • PVC triggers EBS creation.

πŸ”₯ Production Best Practice

Always prefer dynamic provisioning unless special need.

Reduces manual mistakes.


πŸ”₯ Question 43

How volume binding works?

βœ… Real Production Answer

Binding process:

  1. PVC created.

  2. Kubernetes searches for:

    • Matching PV OR
    • Uses StorageClass to provision new PV.

Matching based on:

  • Access mode
  • Storage size
  • StorageClass name

Once matched: PVC status β†’ Bound


πŸ”₯ VolumeBindingMode

Important field in StorageClass:

volumeBindingMode: WaitForFirstConsumer

This delays volume creation until pod scheduled.

Why important?

For:

  • Multi-AZ clusters
  • Ensures volume created in same zone as pod

Without this: Volume may create in wrong AZ β†’ scheduling failure.


πŸ”₯ Question 44

When PVC stays Pending β€” root causes?

βœ… Real Production Debug Flow

If PVC Pending:

Check:

kubectl describe pvc <name>

Common causes:

  1. No matching StorageClass
  2. Wrong StorageClass name
  3. Insufficient quota
  4. Provisioner not running
  5. VolumeBindingMode conflict
  6. Cloud permission issue (IAM)

πŸ”₯ Real Production Case

In AWS:

EBS provisioner fails because:

  • Worker node IAM role missing permission
  • Subnet not tagged properly

PVC remains Pending.


πŸ”₯ Question 45

Stateful app storage best practices

βœ… Real Production Best Practices

  1. Use StatefulSet
  2. Use dynamic provisioning
  3. Use WaitForFirstConsumer
  4. Ensure backups enabled
  5. Avoid deleting PVC blindly
  6. Use appropriate access mode

πŸ”₯ Important

For database:

  • One PVC per replica
  • Never share RWO volume across pods
  • Always test restore process

πŸ”₯ Production Risk

Deleting StatefulSet does NOT delete PVC by default.

Good: Prevents accidental data loss.

Bad: Leftover storage cost if not cleaned.


πŸ”₯ Question 46

RWX vs RWO β€” production implications?

βœ… RWO (ReadWriteOnce)

  • Volume can be mounted by one node at a time
  • Most common (EBS in AWS)
  • Safe for databases

Example: MySQL pod using EBS volume β†’ RWO.


βœ… RWX (ReadWriteMany)

  • Volume can be mounted by multiple nodes simultaneously

  • Requires shared filesystem:

    • EFS
    • NFS
    • CephFS

Used for:

  • Shared content
  • File uploads
  • ML shared datasets

πŸ”₯ Production Implications

RWO:

  • Better performance
  • Lower complexity
  • Zone-bound

RWX:

  • More flexible
  • Higher latency (network FS)
  • Needs careful permission handling

⚠️ Common Mistake

Trying to use EBS (RWO) with multiple replicas β†’ fails.

Know your backend storage limitations.


πŸ”₯ Question 47

How do you design RBAC with least privilege?

βœ… Real Production Answer

RBAC has:

  • Role / ClusterRole
  • RoleBinding / ClusterRoleBinding
  • ServiceAccount

Principle: Grant only required permissions.


πŸ”Ή Example

If app only needs to:

  • Read ConfigMaps

Create:

Role:

verbs: ["get", "list"]
resources: ["configmaps"]

Bind to ServiceAccount.


πŸ”₯ Production Best Practices

  • Never use cluster-admin for apps
  • Separate infra vs app roles
  • Audit API server logs
  • Use namespace isolation

πŸ”₯ Red Flag in Interview

If someone says:

β€œI give cluster-admin to simplify things.”

That’s a security risk.


πŸ”₯ Question 48

Difference between Role and ClusterRole?

βœ… Role

  • Namespace-scoped
  • Limited to one namespace

Used for:

  • App-specific permissions

βœ… ClusterRole

  • Cluster-wide

  • Can:

    • Access all namespaces
    • Access non-namespaced resources (nodes, PV)

πŸ”₯ Important

ClusterRole can still be bound to a single namespace using RoleBinding.


Production Use Case

Monitoring tool: Needs to read pods in all namespaces β†’ ClusterRole.

App: Needs access only in its namespace β†’ Role.


πŸ”₯ Question 49

How ServiceAccount is used by pods?

βœ… Real Production Answer

Every pod runs with a ServiceAccount.

If not specified: β†’ default ServiceAccount.


πŸ”Ή What It Does

  • Provides identity to pod
  • Used for API access
  • Mounts token inside pod

Token location:

/var/run/secrets/kubernetes.io/serviceaccount/

πŸ”₯ Production Best Practice

  • Create custom ServiceAccount per app
  • Attach minimal RBAC
  • Disable auto-mount token if not needed

πŸ”₯ In Cloud (Example: EKS)

ServiceAccount can be linked with IAM role (IRSA).

Pod β†’ IAM role β†’ AWS API securely.

Very important for production AWS setups.


πŸ”₯ Question 50

How secrets are stored β€” and why base64 is not encryption?

βœ… Real Production Answer

By default:

Secrets stored in etcd as base64 encoded.

Base64 β‰  encryption.

Anyone with etcd access can decode.


πŸ”₯ Secure Production Setup

Enable:

Encryption at Rest

Using:

  • KMS provider
  • EncryptionConfiguration

πŸ”₯ Best Practices

  • Never commit secrets in Git

  • Use external secret managers:

    • AWS Secrets Manager
    • HashiCorp Vault
  • Use sealed secrets or External Secrets Operator


πŸ”₯ Interview Upgrade Answer

If you mention:

  • etcd encryption
  • KMS integration
  • IRSA
  • Secret rotation strategy

You’re signaling production maturity.


πŸ’¬ Need a Quick Summary?

Hey! Don't have time to read everything? I get it. 😊
Click below and I'll give you the main points and what matters most on this page.
Takes about 5 seconds β€’ Uses Perplexity AI