💾 DR & Reliability — 15 Real-World DevOps Interview Questions

✅ Q1 — What is the difference between RPO and RTO?

RPO (Recovery Point Objective) is how much data loss is acceptable — measured in time. RTO (Recovery Time Objective) is how long the system can be down. RPO drives backup frequency. RTO drives recovery automation level.

✅ Q2 — How do you design backup strategy for production databases?

Use automated scheduled backups plus continuous replication if possible. Store backups cross-region. Test restore regularly — backup without restore test is fake safety. Encrypt and version backups.

✅ Q3 — How often should DR drills be done?

At least quarterly for critical systems. Include restore + traffic switch tests. DR plans that are not tested usually fail in real incidents. Runbooks must be validated by drills.

✅ Q4 — App is stateless but depends on DB — what is your DR focus?

Compute can be rebuilt — data cannot. Priority is DB replication, backups, and failover. Stateless layers are redeployable via IaC. Data layer defines DR complexity.

✅ Q5 — How do you design DR for Kubernetes cluster?

Infra defined via Terraform + GitOps so cluster is reproducible. Persistent volumes backed by snapshot-capable storage. Back up cluster state (etcd or manifests). Container images stored in remote registry.

✅ Q6 — Multi-AZ vs Multi-Region — when choose which?

Multi-AZ protects from datacenter failure — lower cost and latency. Multi-region protects from region outage — higher cost and complexity. Most apps start with multi-AZ, critical apps go multi-region.

✅ Q7 — How do you design database failover?

Use managed DB with automatic failover or replication with promotion. Health checks trigger role switch. App must use endpoint/cluster DNS, not fixed IP. Test failover behavior under load.

✅ Q8 — What is backup vs snapshot — difference?

Backup is logical or physical copy stored separately. Snapshot is point-in-time storage state, often incremental. Snapshots are fast but usually same platform. Backups are more portable.

✅ Q9 — How do you ensure backups are not corrupted?

Use checksum validation and test restores. Store in versioned storage. Monitor backup job success metrics. Silent backup failures are common risk.

✅ Q10 — How do you design zero-data-loss systems?

Use synchronous replication and write quorum. Accept higher latency. Not all systems need this — it’s cost vs durability tradeoff. Financial systems often require it.

✅ Q11 — What is active-active vs active-passive DR?

Active-active runs traffic in multiple regions simultaneously. Active-passive keeps standby region idle/warm. Active-active is faster failover but harder consistency. Active-passive is simpler and cheaper.

✅ Q12 — How do you protect against accidental deletion?

Enable soft delete, versioning, and retention locks. Use least-privilege IAM. Add prevent_destroy in Terraform for critical resources. Human error is top outage cause.

✅ Q13 — What should a good DR runbook contain?

Step-by-step recovery steps, commands, credentials source, decision tree, rollback steps, contact chain. Written for execution under stress. No assumptions.

✅ Q14 — How do you design artifact & image DR?

Use multi-region registry replication. Keep build reproducibility via Dockerfile + lockfiles. Store artifacts in durable repository with versioning. CI should rebuild if registry lost.

✅ Q15 — Biggest real-world DR mistake teams make?

Having backups but no restore procedure or test. DR is not backup — DR is verified recovery ability. Untested DR plans fail during real outages.

Docker Linux