Tools
Olly Designround

🧠 DevOps Monitoring + Logging β€” End-to-End Architecture Interview Round (15 Q&A)


βœ… Q1 β€” Design a full observability stack for a Kubernetes production platform.

I design three pillars: metrics, logs, and traces. Metrics via Prometheus + Grafana, logs via Fluent Bit β†’ Elasticsearch/Loki, traces via OpenTelemetry + Jaeger/Tempo. Alerts go through Alertmanager β†’ PagerDuty/Slack. Everything runs HA and is namespace-isolated. Data retention and cost controls are defined from day one.


βœ… Q2 β€” How do you decide what to monitor at infra vs app level?

Infra layer: node CPU, memory, disk, network, kubelet, API server. Platform layer: pod restarts, scheduling failures, autoscaler events. App layer: request rate, latency, error %, queue depth. Business layer: SLO/SLA metrics. Monitoring must map to user impact β€” not just system stats.


βœ… Q3 β€” How do you prevent alert fatigue in large systems?

Use severity levels and alert grouping. Alert on symptoms at service level, not every pod. Add inhibition rules so root-cause alerts suppress child alerts. Every alert must have a runbook. If alert is not actionable β€” delete it.


βœ… Q4 β€” What is your alert design philosophy?

Alert on user impact and SLO breach, not raw resource usage. CPU 90% alone is not an alert β€” error rate spike is. Use multi-signal alerts (rate + latency). Alerts must be few, meaningful, and actionable.


βœ… Q5 β€” How do you design log pipeline for high-volume microservices?

Use node-level agents (Fluent Bit DaemonSet). Parse minimally at edge, enrich centrally. Buffer using Kafka if volume is high. Store structured JSON logs. Apply index lifecycle and retention rules early.


βœ… Q6 β€” Logs vs Metrics vs Traces β€” when use which?

Metrics = trend & alerting (cheap, numeric, long-term). Logs = detailed event records (debug & audit). Traces = request path across services (latency root cause). You need all three β€” they answer different questions.


βœ… Q7 β€” How do you correlate metrics and logs during an incident?

Use common labels like service, pod, trace_id. Dashboards link to log queries. Traces include request IDs logged by services. Correlation fields must be standardized across stack.


βœ… Q8 β€” How do you monitor ephemeral workloads like Kubernetes jobs?

Use kube-state-metrics for job status and completion metrics. Push job metrics via Pushgateway if needed. Logs are critical since metrics lifetime is short. Alerts focus on failure count and duration.


βœ… Q9 β€” What retention strategy do you design for metrics and logs?

Metrics: short high-resolution + long downsampled remote storage. Logs: hot (7–14 days searchable) + warm/archive storage. Retention is cost-driven and compliance-driven. Never keep everything forever.


βœ… Q10 β€” How do you design monitoring for multi-cluster setup?

Each cluster runs local collectors. Metrics remote-write to central long-term store (Thanos/Cortex). Logs aggregated centrally. Dashboards are multi-cluster aware via labels. Avoid single-cluster blind spots.


βœ… Q11 β€” What breaks first when observability stack is under-sized?

Usually Elasticsearch or Prometheus memory first. High cardinality kills Prometheus. High shard count kills Elasticsearch. Capacity planning must be done for observability stack itself.


βœ… Q12 β€” How do you monitor the monitoring system?

Meta-monitoring. Prometheus scrapes itself. Alerts on scrape failures, ingestion rate drop, rule eval errors. Logging pipeline has health metrics. If monitoring dies silently β€” you’re blind.


βœ… Q13 β€” How do you design SLO-based monitoring?

Define SLI metrics like success rate and latency percentile. Create SLO targets (e.g., 99.9%). Use burn-rate alerts instead of raw thresholds. This ties alerts to user experience.


βœ… Q14 β€” Security considerations in logging architecture?

Mask PII and secrets at source. Restrict log access via RBAC. Encrypt in transit and at rest. Logs often contain sensitive data β€” treat as high-risk asset.


βœ… Q15 β€” Biggest real-world observability mistake teams make?

Collecting everything but defining nothing. No SLOs, no alert philosophy, no retention control. Result = huge cost + alert noise + zero signal. Observability must be designed, not accumulated.


πŸ’¬ Need a Quick Summary?

Hey! Don't have time to read everything? I get it. 😊
Click below and I'll give you the main points and what matters most on this page.
Takes about 5 seconds β€’ Uses Perplexity AI