Monitoring – ahmadrazalab

✅ Q1 — Prometheus stopped showing new metrics — what do you check first?

First check Prometheus targets page (/targets) to see scrape status. If targets are down, check service discovery and endpoint reachability. Then check Prometheus pod logs for scrape errors. Most failures are network, TLS, or wrong scrape config.

✅ Q2 — Metrics missing for one Kubernetes service only — why?

Likely ServiceMonitor/PodMonitor selector mismatch. I verify labels on service/pods vs monitor selector. Then check if metrics endpoint path/port is correct. Also test endpoint manually with curl from Prometheus pod.

✅ Q3 — Prometheus memory usage keeps growing — root causes?

High cardinality metrics are the most common cause. Labels like user_id, request_id explode series count. I check top series using Prometheus TSDB status. Fix by reducing label cardinality at exporter level.

✅ Q4 — What is high cardinality and why is it dangerous?

High cardinality means too many unique label combinations. Each combination becomes a new time series. It increases memory and query cost massively. Bad label design can crash Prometheus.

✅ Q5 — Difference between exporter and instrumentation?

Instrumentation is when app exposes its own metrics endpoint. Exporter is a side component that converts metrics from another system. Example: node-exporter vs app metrics endpoint.

✅ Q6 — Alerts are firing too frequently (flapping). How do you fix?

Add for: duration in alert rule so condition must persist. Tune thresholds based on baseline. Add aggregation instead of single-instance alerts. Flapping alerts kill trust in monitoring.

✅ Q7 — Prometheus is down — do you lose all metrics?

Yes, Prometheus stores metrics locally unless using remote write. No backfill by default. For critical systems, use remote storage like Thanos or Cortex. HA Prometheus pairs are also used.

✅ Q8 — How do you monitor Kubernetes cluster health with Prometheus?

Use kube-state-metrics + node-exporter + cAdvisor metrics. Track node status, pod restarts, resource usage, and API server latency. Dashboards should include cluster, node, workload, and app layers.

✅ Q9 — Grafana dashboard shows “No data” but Prometheus has metrics — why?

Datasource query may be wrong or time range mismatch. Check query directly in Prometheus UI. Verify label names and metric spelling. Grafana variable filters often cause hidden filtering.

✅ Q10 — How do you design alert severity levels?

Define severity like info, warning, critical. Critical = user impact or data loss risk. Warning = degradation trend. Severity must map to response urgency and escalation path.

✅ Q11 — What is Alertmanager’s role vs Prometheus alerts?

Prometheus evaluates alert rules. Alertmanager handles routing, grouping, deduplication, and notifications. It decides who gets notified and how. Prometheus does detection — Alertmanager does delivery control.

✅ Q12 — Too many alerts during outage — how control noise?

Use alert grouping in Alertmanager. Group by service or cluster labels. Add inhibition rules so root-cause alert suppresses symptom alerts. Noise reduction is essential for incident response.

✅ Q13 — How do you monitor application SLOs with Prometheus?

Use RED or USE metrics (rate, errors, duration). Define SLO queries using recording rules. Alert on error rate and latency percentiles. SLO monitoring is better than raw CPU alerts.

✅ Q14 — Prometheus queries are slow — optimization steps?

Use recording rules to precompute heavy queries. Reduce time range and label filters. Avoid regex-heavy selectors. Optimize cardinality first — query tuning second.

✅ Q15 — Production best practice for Prometheus retention and storage?

Keep short retention locally (7–15 days). Use remote storage for long-term metrics. Monitor TSDB size and compaction. Never let Prometheus disk fill — it will crash.

Jenkins Olly Designround