πΎ DR & Reliability β 15 Real-World DevOps Interview Questions
β Q1 β What is the difference between RPO and RTO?
RPO (Recovery Point Objective) is how much data loss is acceptable β measured in time. RTO (Recovery Time Objective) is how long the system can be down. RPO drives backup frequency. RTO drives recovery automation level.
β Q2 β How do you design backup strategy for production databases?
Use automated scheduled backups plus continuous replication if possible. Store backups cross-region. Test restore regularly β backup without restore test is fake safety. Encrypt and version backups.
β Q3 β How often should DR drills be done?
At least quarterly for critical systems. Include restore + traffic switch tests. DR plans that are not tested usually fail in real incidents. Runbooks must be validated by drills.
β Q4 β App is stateless but depends on DB β what is your DR focus?
Compute can be rebuilt β data cannot. Priority is DB replication, backups, and failover. Stateless layers are redeployable via IaC. Data layer defines DR complexity.
β Q5 β How do you design DR for Kubernetes cluster?
Infra defined via Terraform + GitOps so cluster is reproducible. Persistent volumes backed by snapshot-capable storage. Back up cluster state (etcd or manifests). Container images stored in remote registry.
β Q6 β Multi-AZ vs Multi-Region β when choose which?
Multi-AZ protects from datacenter failure β lower cost and latency. Multi-region protects from region outage β higher cost and complexity. Most apps start with multi-AZ, critical apps go multi-region.
β Q7 β How do you design database failover?
Use managed DB with automatic failover or replication with promotion. Health checks trigger role switch. App must use endpoint/cluster DNS, not fixed IP. Test failover behavior under load.
β Q8 β What is backup vs snapshot β difference?
Backup is logical or physical copy stored separately. Snapshot is point-in-time storage state, often incremental. Snapshots are fast but usually same platform. Backups are more portable.
β Q9 β How do you ensure backups are not corrupted?
Use checksum validation and test restores. Store in versioned storage. Monitor backup job success metrics. Silent backup failures are common risk.
β Q10 β How do you design zero-data-loss systems?
Use synchronous replication and write quorum. Accept higher latency. Not all systems need this β itβs cost vs durability tradeoff. Financial systems often require it.
β Q11 β What is active-active vs active-passive DR?
Active-active runs traffic in multiple regions simultaneously. Active-passive keeps standby region idle/warm. Active-active is faster failover but harder consistency. Active-passive is simpler and cheaper.
β Q12 β How do you protect against accidental deletion?
Enable soft delete, versioning, and retention locks. Use least-privilege IAM. Add prevent_destroy in Terraform for critical resources. Human error is top outage cause.
β Q13 β What should a good DR runbook contain?
Step-by-step recovery steps, commands, credentials source, decision tree, rollback steps, contact chain. Written for execution under stress. No assumptions.
β Q14 β How do you design artifact & image DR?
Use multi-region registry replication. Keep build reproducibility via Dockerfile + lockfiles. Store artifacts in durable repository with versioning. CI should rebuild if registry lost.
β Q15 β Biggest real-world DR mistake teams make?
Having backups but no restore procedure or test. DR is not backup β DR is verified recovery ability. Untested DR plans fail during real outages.