π§ DevOps Monitoring + Logging β End-to-End Architecture Interview Round (15 Q&A)
β Q1 β Design a full observability stack for a Kubernetes production platform.
I design three pillars: metrics, logs, and traces. Metrics via Prometheus + Grafana, logs via Fluent Bit β Elasticsearch/Loki, traces via OpenTelemetry + Jaeger/Tempo. Alerts go through Alertmanager β PagerDuty/Slack. Everything runs HA and is namespace-isolated. Data retention and cost controls are defined from day one.
β Q2 β How do you decide what to monitor at infra vs app level?
Infra layer: node CPU, memory, disk, network, kubelet, API server. Platform layer: pod restarts, scheduling failures, autoscaler events. App layer: request rate, latency, error %, queue depth. Business layer: SLO/SLA metrics. Monitoring must map to user impact β not just system stats.
β Q3 β How do you prevent alert fatigue in large systems?
Use severity levels and alert grouping. Alert on symptoms at service level, not every pod. Add inhibition rules so root-cause alerts suppress child alerts. Every alert must have a runbook. If alert is not actionable β delete it.
β Q4 β What is your alert design philosophy?
Alert on user impact and SLO breach, not raw resource usage. CPU 90% alone is not an alert β error rate spike is. Use multi-signal alerts (rate + latency). Alerts must be few, meaningful, and actionable.
β Q5 β How do you design log pipeline for high-volume microservices?
Use node-level agents (Fluent Bit DaemonSet). Parse minimally at edge, enrich centrally. Buffer using Kafka if volume is high. Store structured JSON logs. Apply index lifecycle and retention rules early.
β Q6 β Logs vs Metrics vs Traces β when use which?
Metrics = trend & alerting (cheap, numeric, long-term). Logs = detailed event records (debug & audit). Traces = request path across services (latency root cause). You need all three β they answer different questions.
β Q7 β How do you correlate metrics and logs during an incident?
Use common labels like service, pod, trace_id. Dashboards link to log queries. Traces include request IDs logged by services. Correlation fields must be standardized across stack.
β Q8 β How do you monitor ephemeral workloads like Kubernetes jobs?
Use kube-state-metrics for job status and completion metrics. Push job metrics via Pushgateway if needed. Logs are critical since metrics lifetime is short. Alerts focus on failure count and duration.
β Q9 β What retention strategy do you design for metrics and logs?
Metrics: short high-resolution + long downsampled remote storage. Logs: hot (7β14 days searchable) + warm/archive storage. Retention is cost-driven and compliance-driven. Never keep everything forever.
β Q10 β How do you design monitoring for multi-cluster setup?
Each cluster runs local collectors. Metrics remote-write to central long-term store (Thanos/Cortex). Logs aggregated centrally. Dashboards are multi-cluster aware via labels. Avoid single-cluster blind spots.
β Q11 β What breaks first when observability stack is under-sized?
Usually Elasticsearch or Prometheus memory first. High cardinality kills Prometheus. High shard count kills Elasticsearch. Capacity planning must be done for observability stack itself.
β Q12 β How do you monitor the monitoring system?
Meta-monitoring. Prometheus scrapes itself. Alerts on scrape failures, ingestion rate drop, rule eval errors. Logging pipeline has health metrics. If monitoring dies silently β youβre blind.
β Q13 β How do you design SLO-based monitoring?
Define SLI metrics like success rate and latency percentile. Create SLO targets (e.g., 99.9%). Use burn-rate alerts instead of raw thresholds. This ties alerts to user experience.
β Q14 β Security considerations in logging architecture?
Mask PII and secrets at source. Restrict log access via RBAC. Encrypt in transit and at rest. Logs often contain sensitive data β treat as high-risk asset.
β Q15 β Biggest real-world observability mistake teams make?
Collecting everything but defining nothing. No SLOs, no alert philosophy, no retention control. Result = huge cost + alert noise + zero signal. Observability must be designed, not accumulated.