π CloudWatch Agent on EC2 β Interview Deep Dive
β Q1 β What is CloudWatch Agent and why do we use it?
CloudWatch Agent is an installable agent that collects OS-level metrics and logs from EC2 and sends them to CloudWatch.
Default EC2 metrics only include:
- CPU
- network
- disk ops (basic)
CloudWatch Agent adds:
- memory usage
- disk usage %
- swap
- process metrics
- custom metrics
- application logs
Interview line:
Default EC2 metrics are hypervisor-level; CloudWatch Agent gives OS-level and application-level visibility.
β Q2 β Real production use cases
Real uses you should mention:
- memory alerts (not available by default)
- disk space alerts
- log shipping to CloudWatch Logs
- application log centralization
- custom metrics (queue depth, process count)
- compliance log retention
- audit trail ingestion
Fintech β audit + log retention β very common.
π Q3 β IAM Role Design for CloudWatch Agent
Never use access keys on EC2. Always use instance profile role.
β Minimum Required Permissions Pattern
IAM role attached to EC2:
Permissions for:
- PutMetricData
- CreateLogStream
- PutLogEvents
- DescribeLogGroups
Usually use managed policy:
CloudWatchAgentServerPolicy
π§ Interview Line
I always attach an instance role with least-privilege CloudWatch permissions instead of embedding credentials.
βοΈ Q4 β Installation Automation Pattern (systemd + bootstrap)
Interviewers like automation β not manual steps.
β Production Install Pattern
Done via:
- user-data script
- AMI baking
- config management (Ansible)
- launch template bootstrap
Example Flow (describe, donβt paste commands in interview)
Bootstrap script:
- install CloudWatch agent package
- fetch config from SSM Parameter Store
- write config file
- start agent service
- enable systemd auto-start
π§© Q5 β CloudWatch Agent Config Design
Agent uses JSON config file.
Config defines:
- metrics to collect
- interval
- log file paths
- log group names
- dimensions (instance-id, ASG name)
β Production Pattern
Store config in:
- SSM Parameter Store
Then agent pulls config at startup.
Why this is good:
- centralized config
- change without AMI rebuild
- versionable
π Q6 β systemd Service Automation
CloudWatch Agent runs as systemd service.
β systemd Design Pattern
Enable auto-start:
- start on boot
- restart on failure
- dependency after network-online
Interview phrase
I enable the CloudWatch agent as a systemd managed service with restart policy so metric collection survives reboot and transient failures.
π¦ Q7 β Auto Scaling Group Pattern
Critical for interviews.
β Fleet Pattern
For ASG:
- launch template includes IAM role
- user-data installs agent
- config pulled from SSM
- service enabled via systemd
Result: Every new node auto-registers metrics/logs.
π§ Senior Line
Monitoring bootstrap is part of instance launch template β not a manual post-step.
πͺ΅ Q8 β Log Collection Pattern
β Typical Logs Collected
- /var/log/messages
- /var/log/secure
- nginx logs
- app logs
- audit logs
β Log Group Strategy
Log group naming pattern:
/app/<service>/<env>
/os/<role>/<env>Retention set explicitly β not default infinite.
Fintech β retention policy matters.
π Q9 β Custom Metrics Example (Good Interview Add)
Example:
Collect process count for critical service or memory usage by app.
Agent supports StatsD and collectd input too.
Mention this = bonus points.
π¨ Q10 β Alerting Pattern
Metrics β CloudWatch alarms β SNS β PagerDuty/Slack.
Example alerts:
- memory > 85%
- disk > 80%
- log error pattern metric filter
𧨠Q11 β Common Failure Cases (Interview Gold)
Mention 3β4 of these β shows hands-on experience:
- IAM role missing permissions β agent errors
- config JSON invalid β agent wonβt start
- log group not existing β ingestion fails
- network egress blocked β no metrics sent
- systemd service not enabled β lost after reboot
- wrong region config
βοΈ Q12 β CloudWatch Agent vs Fluent Bit vs Prometheus Node Exporter
Interview comparison:
CloudWatch Agent β AWS native metrics/logs Node Exporter β Prometheus metrics Fluent Bit β log forwarding
Often used together β not exclusive.
π§ Strong Interview Summary Answer
If interviewer asks open-ended:
I install CloudWatch Agent on EC2 using launch template bootstrap and manage it as a systemd service. The agent config is stored in SSM Parameter Store and fetched at startup. Instances use an IAM role with least-privilege CloudWatch permissions. This setup collects OS metrics and logs centrally and scales automatically with ASGs. Alerts are built on custom metrics like memory and disk which default EC2 metrics donβt provide.