Key metrics for AWS monitoring
AWS observability and monitoring is de facto standard in all organizations. It’s worth considering the difference between observability and monitoring. Monitoring lets you know whether a system is working. Observability lets you understand why it isn’t working.
There are few important metrics that we should have complete visibility to avoid outages or so. I will not be including basic cloudwatch metrics such as CPU/memory/storage etc. as these are default part of our monitoring stack.
Covering below services —> RDS, NAT Gateway, Ec2, ALB, Lambda and also AWS status page check.
I would recommend start using Amazon Managed Grafana with cross account IAM roles. We can use Cloudwatch as datasource and start using it right away.
You can also use Amazon Managed prometheus for metric aggregation but we do need to remote_write metrics from custom prometheus to Amazon Managed Prometheus.
AWS managed services for building observability solution:-
1. RDS :
- DiskQueueDepth : The number of outstanding I/Os (read/write requests) waiting to access the disk.
- DatabaseConnections : The number of client network connections to the database instance.
- WriteLatency : The average amount of time taken per disk I/O operation.
2. NAT Gateway :
- ErrorPortAllocation : NAT gateways support up to 55,000 simultaneous connections to each destination. If this threshold is crossed, then new connections to the destination fail and the ErrorPortAllocation metric for the NAT gateway increases in Amazon CloudWatch.
- PacketsDropCount : A healthy NAT gateway will always have a value of zero. A non-zero value indicates an on-going transient issue with the NAT gateway. If the value is not zero, refer to the AWS Personal Health Dashboard. If there are no notifications on the AWS Personal Health Dashboard, open a case with AWS Support.
Resolution : https://aws.amazon.com/premiumsupport/knowledge-center/cloudwatch-nat-gateway-bandwidth/
- BytesOutToDestination : Monitor this value as we should not be using much egress traffic. Unless there is some business justification; we should be using internal apt/yum/artifactory/private repos for deployment. For other public services access from VPC we should be creating VPC endpoints so traffic remains local.
3. Ec2 (Compute):
- StatusCheckFailed (any): Reports whether the instance has passed both the instance status check and the system status check in the last minute.This metric can be either 0 (passed) or 1 (failed).
The following are examples of problems that can cause system status checks to fail:
- Loss of network connectivity
- Loss of system power
- Software issues on the physical host.
- Hardware issues on the physical host that impact network reachability
4. Application Load Balancer metrics:
- HTTPCode_ELB_4XX_Count: The number of HTTP 4XX client error codes that originate from the load balancer. This count does not include response codes generated by targets. Client errors are generated when requests are malformed or incomplete. These requests were not received by the target, other than in the case where the load balancer returns an HTTP 460 error code. This count does not include any response codes generated by the targets.
- HTTPCode_ELB_5XX_Count: The number of HTTP 5XX server error codes that originate from the load balancer. This count does not include any response codes generated by the targets.
- RejectedConnectionCount: The number of connections that were rejected because the load balancer had reached its maximum number of connections.
- HealthyHostCount: The number of targets that are considered healthy.
- TargetResponseTime: The time elapsed, in seconds, after the request leaves the load balancer until a response from the target is received. This is equivalent to the
target_processing_timefield in the access logs.
- TargetTLSNegotiationErrorCount: The number of TLS connections initiated by the load balancer that did not establish a session with the target. Possible causes include a mismatch of ciphers or protocols. This metric does not apply if the target is a Lambda function.
- UnHealthyHostCount: The number of targets that are considered unhealthy.
- Errors: This logs the number of errors thrown by a function. It can be used with the Invocations metric to calculate the total percentage of errors.
- Duration: This is the amount of time taken for a Lambda invocation. Apart from the impact on cost, it’s also important to monitor any functions that are running close to their timeout value.
- ConcurrentExecutions: monitor this value to ensure that your functions are not running close to the total concurrency limit for your AWS account. You can request a Service Quota increase, if needed.
6. AWS Status Page:
- Scrape AWS status page (https://status.aws.amazon.com )for all services that you support and set alerting. Example: AWS SSO (authz/authn) , Internet connectivity (all regions), LB degradation, NAT Gateways, Route53 resolver, RDS, Directconnect etc. Make sure to cover all services related to your production stack.
- We have also observed that AWS does decommission cloudfront edge locations (as part of maintenance) and it’s possible that DNS might be resolving to old IP addresses and starts failing. Check out the Geo DNS resolution for CDN and flush DNS cache so it can start resolving new IP addresses.
If this post was helpful, please click the clap 👏 button below a few times to show your support for the author 👇