Skip to main content

CloudWatch: Monitoring and Observability

CloudWatch: Monitoring and Observability

Amazon CloudWatch is a monitoring and observability service that provides data and actionable insights for AWS applications, infrastructure, and services. CloudWatch collects monitoring and operational data in the form of logs, metrics, and events, providing a unified view of AWS resources, applications, and services running on AWS and on-premises servers.

What is CloudWatch?

CloudWatch enables you to monitor your AWS resources and applications in real-time. It collects and tracks metrics, monitors log files, sets alarms, and automatically reacts to changes in your AWS resources. CloudWatch provides a single platform to monitor, troubleshoot, and optimize your applications and infrastructure.

Key Concepts

Metrics

Metrics are data points about the performance of your systems. CloudWatch collects metrics about AWS resources and applications.

Namespace
  • Metric Container: Metrics belong to a namespace (e.g., AWS/EC2, AWS/S3)
  • Organization: Namespaces help organize and identify metrics
  • Custom Namespaces: You can create custom namespaces for your applications
Metric Dimensions
  • Unique Identifier: Dimensions are name-value pairs that uniquely identify a metric
  • Filtering: Use dimensions to filter and query metrics
  • Multiple Dimensions: Metrics can have up to 10 dimensions
Metric Data Types
  • Counters: Values that can only increase (e.g., request count)
  • Gauges: Values that can increase or decrease (e.g., current connections)
  • Histograms: Distribution of values over time (e.g., request latency)

Logs

CloudWatch Logs enables you to centralize logs from all your systems, applications, and AWS services.

Log Groups
  • Container: Log groups are containers for log streams
  • Retention: Set retention policies per log group
  • Encryption: Enable encryption at rest for log groups
Log Streams
  • Instance/Service: Log streams represent sequences of log events from a single source
  • Automatic: Automatically created when logs are sent
  • Sequential: Logs within a stream are ordered by time
Log Events
  • Individual Entries: Log events are individual log entries
  • Timestamp: Each event has a timestamp
  • Message: Log event contains the actual log message

Alarms

CloudWatch alarms monitor metrics and send notifications or take actions when thresholds are crossed.

Alarm States
  • OK: Metric is within the defined threshold
  • ALARM: Metric has breached the threshold
  • INSUFFICIENT_DATA: Not enough data to determine the alarm state
Alarm Actions
  • SNS Notifications: Send notifications via SNS topics
  • Auto Scaling: Trigger Auto Scaling actions
  • EC2 Actions: Stop, terminate, or recover EC2 instances
  • Lambda Functions: Invoke Lambda functions

Monitoring AWS Services

EC2 Monitoring

  • Instance Metrics: CPU utilization, network in/out, disk read/write
  • Status Checks: System status and instance status checks
  • Detailed Monitoring: Enhanced monitoring (1-minute granularity)
  • Custom Metrics: Publish custom metrics from your applications

EBS Monitoring

  • Volume Metrics: Read/write operations, throughput, IOPS
  • Snapshot Metrics: Snapshot creation and completion
  • Volume Status: Volume health and performance

RDS Monitoring

  • Database Metrics: CPU utilization, database connections, read/write latency
  • Storage Metrics: Storage space, I/O operations
  • Enhanced Monitoring: Additional OS-level metrics

S3 Monitoring

  • Bucket Metrics: Request metrics, storage metrics, replication metrics
  • Storage Class Analysis: Analyze storage class usage
  • Request Metrics: Track GET, PUT, DELETE requests

Lambda Monitoring

  • Invocation Metrics: Invocations, errors, duration, throttles
  • Concurrent Executions: Track concurrent function executions
  • Dead Letter Queue: Monitor failed invocations

CloudWatch Logs Insights

CloudWatch Logs Insights enables you to interactively search and analyze your log data.

Query Language

  • SQL-like Syntax: Query logs using a SQL-like syntax
  • Time Ranges: Query logs for specific time ranges
  • Aggregations: Aggregate and summarize log data
  • Visualizations: Create visualizations from query results

Example Queries

-- Count errors by hour
fields @timestamp, @message
| filter @message like /ERROR/
| stats count() by bin(1h)

-- Top 10 IP addresses
fields @timestamp, sourceIP
| stats count() by sourceIP
| sort count desc
| limit 10

CloudWatch Dashboards

CloudWatch Dashboards are customizable home pages in the CloudWatch console.

Features

  • Visualization: Create visualizations of your metrics
  • Multiple Widgets: Add multiple widgets to a dashboard
  • Real-Time Updates: Dashboards update in real-time
  • Sharing: Share dashboards with team members

Widget Types

  • Line Charts: Time series line charts
  • Number Widgets: Single number displays
  • Stacked Area Charts: Stacked area visualizations
  • Pie Charts: Distribution visualizations
  • Custom Metrics: Display custom application metrics

CloudWatch Events (EventBridge)

CloudWatch Events (now part of Amazon EventBridge) delivers a near real-time stream of system events.

Event Sources

  • AWS Services: Events from AWS services
  • Custom Applications: Custom events from your applications
  • Scheduled Events: Cron-like scheduled events
  • Partner Events: Events from AWS partners

Event Targets

  • Lambda Functions: Invoke Lambda functions
  • SNS Topics: Publish to SNS topics
  • SQS Queues: Send messages to SQS queues
  • ECS Tasks: Run ECS tasks
  • Step Functions: Trigger Step Functions state machines

CloudWatch Synthetics

CloudWatch Synthetics creates canaries that monitor your endpoints and APIs.

Canaries

  • Automated Testing: Automated tests for your applications
  • Scheduled Runs: Run canaries on a schedule
  • Real Browsers: Run tests using real browsers
  • Alerting: Get alerts when tests fail

Use Cases

  • API Monitoring: Monitor API endpoints
  • Web Page Monitoring: Monitor web page availability
  • User Journeys: Test complete user workflows

Best Practices

Metrics

  • Custom Metrics: Publish custom metrics for application-specific monitoring
  • Metric Dimensions: Use dimensions effectively for filtering
  • High-Resolution Metrics: Use high-resolution metrics when needed
  • Cost Optimization: Monitor metric costs and use standard resolution when possible

Logs

  • Log Retention: Set appropriate retention policies
  • Log Filtering: Use metric filters to create metrics from logs
  • Log Compression: Compress logs to reduce storage costs
  • Log Parsing: Parse structured logs for better analysis

Alarms

  • Thresholds: Set appropriate thresholds for alarms
  • Multiple Alarms: Create alarms for different severity levels
  • Alarm Actions: Configure actions for alarm states
  • Testing: Test alarm actions regularly

Cost Optimization

  • Log Retention: Reduce log retention periods for non-critical logs
  • Metric Resolution: Use standard resolution when high resolution isn't needed
  • Custom Metrics: Monitor costs of custom metrics
  • Log Filtering: Use log filtering to reduce stored log volume

Integration with Other Services

CloudWatch + Auto Scaling

  • Scaling Triggers: Use CloudWatch alarms as scaling triggers
  • Target Tracking: Use target tracking for automatic scaling
  • Predictive Scaling: Use predictive scaling based on metrics

CloudWatch + Lambda

  • Log Groups: Automatic log groups for Lambda functions
  • Metrics: Automatic metrics for Lambda invocations
  • X-Ray Integration: Trace requests through Lambda functions

CloudWatch + SNS

  • Alarm Notifications: Send alarm notifications via SNS
  • Multiple Recipients: Notify multiple recipients through SNS topics
  • Fan-Out: Fan out notifications to multiple channels

By leveraging CloudWatch's comprehensive monitoring capabilities, you can gain visibility into your AWS resources and applications, troubleshoot issues quickly, and optimize performance. Always refer to AWS documentation for the latest features and monitoring best practices.