π§ Linux Debugging β 15 Real-World DevOps Interview Q&A
β Q1 β Server CPU is 100%. How do you debug?
First I run top or htop to identify the process consuming CPU. Then ps -eo pid,cmd,%cpu --sort=-%cpu | head. If needed, I check per-thread usage with top -H. Then I analyze whether itβs expected load, loop bug, or runaway job.
β Q2 β Memory usage is high β how do you find root cause?
Use free -m to see used vs available. Then top sorted by memory. Check ps aux --sort=-%mem. I also check cache vs real usage β Linux uses memory for cache aggressively. Swap usage is key signal.
β Q3 β System is slow but CPU is low β what next?
Check IO wait using top (%wa) and iostat -x. High IO wait means disk bottleneck. Then check disk busy processes using iotop. Slow systems are often IO-bound, not CPU-bound.
β Q4 β Disk full β but du shows less usage β why?
Likely deleted files still held by running processes. Check with lsof | grep deleted. Restarting that process releases space. Common with log files.
β Q5 β Which process is using a port?
Use ss -tulpn or netstat -tulpn. Shows PID and program bound to port. Useful for port conflicts or security checks.
β Q6 β App cannot connect to remote host β debug steps?
Test DNS with dig or nslookup. Test network with ping and traceroute. Test port with nc -zv host port or telnet. This separates DNS vs routing vs firewall issue.
β Q7 β High load average β what does it actually mean?
Load average = runnable + waiting processes. Not just CPU usage. High load with low CPU often means IO wait or lock contention. I correlate with CPU and IO stats.
β Q8 β Process keeps crashing β how investigate?
Check logs first. Then journalctl -u service if systemd. Check exit code and core dumps. Run process manually in foreground if possible. Look for OOM kills in dmesg.
β Q9 β How do you detect OOM killer events?
Run dmesg | grep -i oom. Kernel logs show killed process. Also visible in syslog/journal. Indicates memory limit breach.
β Q10 β File descriptor limit reached β symptoms & fix?
App errors like βtoo many open filesβ. Check with lsof | wc -l and ulimit -n. Increase limits in /etc/security/limits.conf and systemd config. Restart service.
β Q11 β Zombie processes β what are they and fix?
Zombie = finished process not reaped by parent. Seen as βZβ in ps output. Fix parent process or restart it. Zombies themselves use little resource but indicate bug.
β Q12 β Service not starting β systemd debug steps?
Use systemctl status service. Then journalctl -xe. Check ExecStart path and permissions. Run command manually. Most failures are path or env issues.
β Q13 β Network connections leaking β how detect?
Use ss -s for summary. lsof -i for per-process sockets. Check TIME_WAIT flood. Often caused by connection pool misconfig.
β Q14 β Cron job not running β debug?
Check crontab entry and user crontab. Check /var/log/cron or journal. Verify PATH inside cron β itβs minimal. Use full paths in cron commands.
β Q15 β How do you quickly inspect what changed recently on system?
Check last, lastlog, bash history, file mtimes with ls -lt. Check config dirs diff vs backup. Check package updates with package manager logs.