DevOps
Linux

🐧 Linux Debugging β€” 15 Real-World DevOps Interview Q&A


βœ… Q1 β€” Server CPU is 100%. How do you debug?

First I run top or htop to identify the process consuming CPU. Then ps -eo pid,cmd,%cpu --sort=-%cpu | head. If needed, I check per-thread usage with top -H. Then I analyze whether it’s expected load, loop bug, or runaway job.


βœ… Q2 β€” Memory usage is high β€” how do you find root cause?

Use free -m to see used vs available. Then top sorted by memory. Check ps aux --sort=-%mem. I also check cache vs real usage β€” Linux uses memory for cache aggressively. Swap usage is key signal.


βœ… Q3 β€” System is slow but CPU is low β€” what next?

Check IO wait using top (%wa) and iostat -x. High IO wait means disk bottleneck. Then check disk busy processes using iotop. Slow systems are often IO-bound, not CPU-bound.


βœ… Q4 β€” Disk full β€” but du shows less usage β€” why?

Likely deleted files still held by running processes. Check with lsof | grep deleted. Restarting that process releases space. Common with log files.


βœ… Q5 β€” Which process is using a port?

Use ss -tulpn or netstat -tulpn. Shows PID and program bound to port. Useful for port conflicts or security checks.


βœ… Q6 β€” App cannot connect to remote host β€” debug steps?

Test DNS with dig or nslookup. Test network with ping and traceroute. Test port with nc -zv host port or telnet. This separates DNS vs routing vs firewall issue.


βœ… Q7 β€” High load average β€” what does it actually mean?

Load average = runnable + waiting processes. Not just CPU usage. High load with low CPU often means IO wait or lock contention. I correlate with CPU and IO stats.


βœ… Q8 β€” Process keeps crashing β€” how investigate?

Check logs first. Then journalctl -u service if systemd. Check exit code and core dumps. Run process manually in foreground if possible. Look for OOM kills in dmesg.


βœ… Q9 β€” How do you detect OOM killer events?

Run dmesg | grep -i oom. Kernel logs show killed process. Also visible in syslog/journal. Indicates memory limit breach.


βœ… Q10 β€” File descriptor limit reached β€” symptoms & fix?

App errors like β€œtoo many open files”. Check with lsof | wc -l and ulimit -n. Increase limits in /etc/security/limits.conf and systemd config. Restart service.


βœ… Q11 β€” Zombie processes β€” what are they and fix?

Zombie = finished process not reaped by parent. Seen as β€œZ” in ps output. Fix parent process or restart it. Zombies themselves use little resource but indicate bug.


βœ… Q12 β€” Service not starting β€” systemd debug steps?

Use systemctl status service. Then journalctl -xe. Check ExecStart path and permissions. Run command manually. Most failures are path or env issues.


βœ… Q13 β€” Network connections leaking β€” how detect?

Use ss -s for summary. lsof -i for per-process sockets. Check TIME_WAIT flood. Often caused by connection pool misconfig.


βœ… Q14 β€” Cron job not running β€” debug?

Check crontab entry and user crontab. Check /var/log/cron or journal. Verify PATH inside cron β€” it’s minimal. Use full paths in cron commands.


βœ… Q15 β€” How do you quickly inspect what changed recently on system?

Check last, lastlog, bash history, file mtimes with ls -lt. Check config dirs diff vs backup. Check package updates with package manager logs.


πŸ’¬ Need a Quick Summary?

Hey! Don't have time to read everything? I get it. 😊
Click below and I'll give you the main points and what matters most on this page.
Takes about 5 seconds β€’ Uses Perplexity AI