AWS
Cloudwatch

πŸ“Š CloudWatch Agent on EC2 β€” Interview Deep Dive


βœ… Q1 β€” What is CloudWatch Agent and why do we use it?

CloudWatch Agent is an installable agent that collects OS-level metrics and logs from EC2 and sends them to CloudWatch.

Default EC2 metrics only include:

  • CPU
  • network
  • disk ops (basic)

CloudWatch Agent adds:

  • memory usage
  • disk usage %
  • swap
  • process metrics
  • custom metrics
  • application logs

Interview line:

Default EC2 metrics are hypervisor-level; CloudWatch Agent gives OS-level and application-level visibility.


βœ… Q2 β€” Real production use cases

Real uses you should mention:

  • memory alerts (not available by default)
  • disk space alerts
  • log shipping to CloudWatch Logs
  • application log centralization
  • custom metrics (queue depth, process count)
  • compliance log retention
  • audit trail ingestion

Fintech β†’ audit + log retention β†’ very common.


πŸ” Q3 β€” IAM Role Design for CloudWatch Agent

Never use access keys on EC2. Always use instance profile role.


βœ… Minimum Required Permissions Pattern

IAM role attached to EC2:

Permissions for:

  • PutMetricData
  • CreateLogStream
  • PutLogEvents
  • DescribeLogGroups

Usually use managed policy:

CloudWatchAgentServerPolicy


🧠 Interview Line

I always attach an instance role with least-privilege CloudWatch permissions instead of embedding credentials.


βš™οΈ Q4 β€” Installation Automation Pattern (systemd + bootstrap)

Interviewers like automation β€” not manual steps.


βœ… Production Install Pattern

Done via:

  • user-data script
  • AMI baking
  • config management (Ansible)
  • launch template bootstrap

Example Flow (describe, don’t paste commands in interview)

Bootstrap script:

  • install CloudWatch agent package
  • fetch config from SSM Parameter Store
  • write config file
  • start agent service
  • enable systemd auto-start

🧩 Q5 β€” CloudWatch Agent Config Design

Agent uses JSON config file.

Config defines:

  • metrics to collect
  • interval
  • log file paths
  • log group names
  • dimensions (instance-id, ASG name)

βœ… Production Pattern

Store config in:

  • SSM Parameter Store

Then agent pulls config at startup.

Why this is good:

  • centralized config
  • change without AMI rebuild
  • versionable

πŸ” Q6 β€” systemd Service Automation

CloudWatch Agent runs as systemd service.


βœ… systemd Design Pattern

Enable auto-start:

  • start on boot
  • restart on failure
  • dependency after network-online

Interview phrase

I enable the CloudWatch agent as a systemd managed service with restart policy so metric collection survives reboot and transient failures.


πŸ“¦ Q7 β€” Auto Scaling Group Pattern

Critical for interviews.


βœ… Fleet Pattern

For ASG:

  • launch template includes IAM role
  • user-data installs agent
  • config pulled from SSM
  • service enabled via systemd

Result: Every new node auto-registers metrics/logs.


🧠 Senior Line

Monitoring bootstrap is part of instance launch template β€” not a manual post-step.


πŸͺ΅ Q8 β€” Log Collection Pattern


βœ… Typical Logs Collected

  • /var/log/messages
  • /var/log/secure
  • nginx logs
  • app logs
  • audit logs

βœ… Log Group Strategy

Log group naming pattern:

/app/<service>/<env>
/os/<role>/<env>

Retention set explicitly β€” not default infinite.

Fintech β†’ retention policy matters.


πŸ“ˆ Q9 β€” Custom Metrics Example (Good Interview Add)

Example:

Collect process count for critical service or memory usage by app.

Agent supports StatsD and collectd input too.

Mention this = bonus points.


🚨 Q10 β€” Alerting Pattern

Metrics β†’ CloudWatch alarms β†’ SNS β†’ PagerDuty/Slack.

Example alerts:

  • memory > 85%
  • disk > 80%
  • log error pattern metric filter

🧨 Q11 β€” Common Failure Cases (Interview Gold)

Mention 3–4 of these β€” shows hands-on experience:

  • IAM role missing permissions β†’ agent errors
  • config JSON invalid β†’ agent won’t start
  • log group not existing β†’ ingestion fails
  • network egress blocked β†’ no metrics sent
  • systemd service not enabled β†’ lost after reboot
  • wrong region config

βš–οΈ Q12 β€” CloudWatch Agent vs Fluent Bit vs Prometheus Node Exporter

Interview comparison:

CloudWatch Agent β†’ AWS native metrics/logs Node Exporter β†’ Prometheus metrics Fluent Bit β†’ log forwarding

Often used together β€” not exclusive.


🧠 Strong Interview Summary Answer

If interviewer asks open-ended:

I install CloudWatch Agent on EC2 using launch template bootstrap and manage it as a systemd service. The agent config is stored in SSM Parameter Store and fetched at startup. Instances use an IAM role with least-privilege CloudWatch permissions. This setup collects OS metrics and logs centrally and scales automatically with ASGs. Alerts are built on custom metrics like memory and disk which default EC2 metrics don’t provide.


πŸ’¬ Need a Quick Summary?

Hey! Don't have time to read everything? I get it. 😊
Click below and I'll give you the main points and what matters most on this page.
Takes about 5 seconds β€’ Uses Perplexity AI