🧠 DevOps System Design — 20 Realistic Scenario Interview Q&A

✅ Q1 — Design CI/CD for a monolith app deployed on VMs.

I’d build pipeline stages: build → test → artifact → image/package → deploy. Artifact is versioned and stored in repo. Deployment uses rolling or blue-green on VM group. Config is externalized. Rollback = redeploy previous artifact version.

✅ Q2 — Now migrate same monolith to containers — what changes?

CI builds container image instead of package. Registry becomes artifact store. Deployment becomes container-based (ECS/EKS/K8s). Health checks and readiness added. Infra becomes immutable image-based.

✅ Q3 — Design CI/CD for 200 microservices.

Each service has its own pipeline but shared templates via pipeline library. Build + scan + image push standardized. GitOps handles deploy. Promotion uses same artifact across envs. Platform team owns pipeline framework, not each team reinventing.

✅ Q4 — How do you standardize pipelines across org?

Create shared pipeline libraries and golden templates. Enforce via repo scaffolding. Provide reusable steps for build/scan/deploy. Version the pipeline library. This prevents pipeline drift.

✅ Q5 — Design Kubernetes platform for many teams.

Separate namespaces per team. Resource quotas and limit ranges enforced. Network policies isolate. Shared ingress and observability stack. RBAC per team. Platform addons centrally managed.

✅ Q6 — How do you design multi-environment (dev/stage/prod)?

Separate clusters for prod vs non-prod ideally. At minimum separate namespaces + accounts. Separate state backends and secrets. Promotion is artifact-based, not rebuild-based.

✅ Q7 — How do you design zero-downtime deployments?

Use rolling or canary with readiness probes. Ensure replicas >1. Add PDB. Use connection draining at LB. Database changes must be backward compatible.

✅ Q8 — Design logging + metrics for microservices platform.

Metrics via Prometheus, logs via Fluent Bit → store, traces via OTel. Standard labels across services. Correlation IDs required. Alert on SLO, not CPU.

✅ Q9 — Design secrets management across platform.

Use central secret manager (Vault/Cloud SM). Apps fetch at runtime via identity. No secrets in Git or images. Rotation automated where possible.

✅ Q10 — How do you design cost-efficient K8s platform?

Right-size requests. Use HPA + cluster autoscaler/Karpenter. Spot nodes for safe workloads. Tiered storage. Monitor cost metrics. Idle envs auto-scale down.

✅ Q11 — Design HA for Kubernetes control plane (cloud).

Use managed control plane (EKS/GKE). Multi-AZ nodes. Multiple node groups. PDB enforced. Critical addons replicated.

✅ Q12 — How do you design platform observability for scale?

Central metrics with long-term store (Thanos/Cortex). Log tiering. Alert routing by team. Meta-monitoring enabled. Cardinality control enforced.

✅ Q13 — Multi-region active-passive app — design deploy flow.

Primary region live, secondary warm. CI/CD deploys to both. DB replication enabled. DNS failover with health checks. Regular DR drills.

✅ Q14 — How do you handle schema migrations in microservices?

Versioned migrations per service. Backward compatible first. Deploy app that supports both schemas. Then migrate. Then clean old fields.

✅ Q15 — Design internal developer platform (IDP).

Self-service templates, golden paths, pipeline templates, namespace provisioning automation. Guardrails via policy. Platform team provides paved road.

✅ Q16 — How do you design safe production deploy approvals?

PR approval + pipeline checks + manual gate before prod. Change record stored. Auto deploy allowed only for low-risk services.

✅ Q17 — How do you design artifact traceability?

Every artifact tagged with commit SHA + build ID. Deployment records artifact version. Logs include version label. Rollback becomes deterministic.

✅ Q18 — Platform for GPU/ML workloads — design notes?

Separate node pools with GPU. Taints/tolerations. Large storage throughput. Job-based scheduling. Cost controls strict.

✅ Q19 — How do you prevent noisy neighbor in shared cluster?

Resource quotas, limits, priority classes. Node pools per workload class. Monitoring per namespace.

✅ Q20 — Biggest DevOps platform design mistake?

Tool-first design instead of workflow-first design. Buying tools doesn’t build platform — workflows, guardrails, and standards do.

Platformengineer Deployments