DevOps
Dr

πŸ’Ύ DR & Reliability β€” 15 Real-World DevOps Interview Questions


βœ… Q1 β€” What is the difference between RPO and RTO?

RPO (Recovery Point Objective) is how much data loss is acceptable β€” measured in time. RTO (Recovery Time Objective) is how long the system can be down. RPO drives backup frequency. RTO drives recovery automation level.


βœ… Q2 β€” How do you design backup strategy for production databases?

Use automated scheduled backups plus continuous replication if possible. Store backups cross-region. Test restore regularly β€” backup without restore test is fake safety. Encrypt and version backups.


βœ… Q3 β€” How often should DR drills be done?

At least quarterly for critical systems. Include restore + traffic switch tests. DR plans that are not tested usually fail in real incidents. Runbooks must be validated by drills.


βœ… Q4 β€” App is stateless but depends on DB β€” what is your DR focus?

Compute can be rebuilt β€” data cannot. Priority is DB replication, backups, and failover. Stateless layers are redeployable via IaC. Data layer defines DR complexity.


βœ… Q5 β€” How do you design DR for Kubernetes cluster?

Infra defined via Terraform + GitOps so cluster is reproducible. Persistent volumes backed by snapshot-capable storage. Back up cluster state (etcd or manifests). Container images stored in remote registry.


βœ… Q6 β€” Multi-AZ vs Multi-Region β€” when choose which?

Multi-AZ protects from datacenter failure β€” lower cost and latency. Multi-region protects from region outage β€” higher cost and complexity. Most apps start with multi-AZ, critical apps go multi-region.


βœ… Q7 β€” How do you design database failover?

Use managed DB with automatic failover or replication with promotion. Health checks trigger role switch. App must use endpoint/cluster DNS, not fixed IP. Test failover behavior under load.


βœ… Q8 β€” What is backup vs snapshot β€” difference?

Backup is logical or physical copy stored separately. Snapshot is point-in-time storage state, often incremental. Snapshots are fast but usually same platform. Backups are more portable.


βœ… Q9 β€” How do you ensure backups are not corrupted?

Use checksum validation and test restores. Store in versioned storage. Monitor backup job success metrics. Silent backup failures are common risk.


βœ… Q10 β€” How do you design zero-data-loss systems?

Use synchronous replication and write quorum. Accept higher latency. Not all systems need this β€” it’s cost vs durability tradeoff. Financial systems often require it.


βœ… Q11 β€” What is active-active vs active-passive DR?

Active-active runs traffic in multiple regions simultaneously. Active-passive keeps standby region idle/warm. Active-active is faster failover but harder consistency. Active-passive is simpler and cheaper.


βœ… Q12 β€” How do you protect against accidental deletion?

Enable soft delete, versioning, and retention locks. Use least-privilege IAM. Add prevent_destroy in Terraform for critical resources. Human error is top outage cause.


βœ… Q13 β€” What should a good DR runbook contain?

Step-by-step recovery steps, commands, credentials source, decision tree, rollback steps, contact chain. Written for execution under stress. No assumptions.


βœ… Q14 β€” How do you design artifact & image DR?

Use multi-region registry replication. Keep build reproducibility via Dockerfile + lockfiles. Store artifacts in durable repository with versioning. CI should rebuild if registry lost.


βœ… Q15 β€” Biggest real-world DR mistake teams make?

Having backups but no restore procedure or test. DR is not backup β€” DR is verified recovery ability. Untested DR plans fail during real outages.



πŸ’¬ Need a Quick Summary?

Hey! Don't have time to read everything? I get it. 😊
Click below and I'll give you the main points and what matters most on this page.
Takes about 5 seconds β€’ Uses Perplexity AI