Introducing GPU Health Monitoring and Auto Repair for Amazon ECS Managed Instances
Share
Services
[Amazon Elastic Container Service (Amazon ECS)](https://aws.amazon.com/ecs/) now offers NVIDIA GPU health monitoring and auto repair functionality for Amazon ECS Managed Instances. The new capability automatically detects critical NVIDIA GPU hardware failures and replaces impaired instances, helping customers improve the availability and reliability of their GPU-accelerated containerized workloads.
Running GPU-accelerated workloads, such as GenAI inference, requires specialized hardware management to mitigate failures and minimize disruption. Amazon ECS Managed Instances now continuously monitor GPU health using NVIDIA Data Center GPU Manager (DCGM) and proactively replace impaired capacity when critical failures occur. You can monitor GPU health through the DescribeContainerInstances API and receive notifications through Amazon EventBridge when instances become impaired. For workloads where you prefer to manage instance lifecycle manually, you can opt out of auto repair at the capacity provider level and handle GPU error events with your own remediation logic.
GPU health auto repair is enabled by default on all Amazon ECS Managed Instances running on supported NVIDIA GPU instance types at no additional cost. The capability is available in all AWS Commercial Regions. To learn more, visit the [Amazon ECS Developer Guide](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/managed-instances-gpu-auto-repair.html).
What else is happening at Amazon Web Services?
Amazon EC2 C8i instances are now available in Europe (Ireland) and Asia Pacific (New Zealand) regions
about 17 hours ago
Services
Share
Read update
Services
Share
Amazon SageMaker AI now supports serverless model customization for Qwen3.5 models
about 19 hours ago
Services
Share
AWS Secrets Manager extends managed external secrets to MongoDB Atlas and Confluent Cloud
about 20 hours ago
Services
Share