SageMaker HyperPod now supports gang scheduling for distributed training workloads
Share
Services
Amazon SageMaker HyperPod task governance now supports gang scheduling, which ensures all pods required for a distributed training job are ready before training begins. Administrators can configure gang scheduling to prevent wasted compute from partial job runs and avoid deadlocks from jobs waiting for resources.
Data scientists running distributed AI/ML training jobs on Amazon SageMaker HyperPod clusters using the EKS orchestrator require multiple pods to work together across nodes with pod-to-pod communication. When some pods start but others do not, jobs can hold onto resources without making progress, block other workloads, and increase costs. Gang scheduling resolves this by monitoring all pods in a workload and pulling the workload back if not all pods are ready within a set time. Pulled-back workloads are automatically requeued to prevent stalling. Administrators can adjust settings on the HyperPod Console, such as how long to wait for pods to be ready, how to handle node failures, whether to admit workloads one at a time to avoid deadlocks on busy clusters, and how retries are scheduled.
This capability is currently available for Amazon SageMaker HyperPod clusters using the EKS orchestrator across the following [AWS Regions](https://aws.amazon.com/about-aws/global-infrastructure/regional-product-services/): US East (N. Virginia), US East (Ohio), US West (N. California), US West (Oregon), Asia Pacific (Mumbai), Asia Pacific (Singapore), Asia Pacific (Sydney), and Asia Pacific (Tokyo), Asia Pacific (Jakarta), Europe (Frankfurt), Europe (Ireland), Europe (London), Europe (Stockholm), Europe (Spain), and South America (São Paulo).
To learn more, visit [SageMaker HyperPod webpage](https://aws.amazon.com/sagemaker-ai/hyperpod/), and [HyperPod task governance documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-operate-console-ui-governance-tasks-gang-scheduling.html).
What else is happening at Amazon Web Services?
Amazon DocumentDB (with MongoDB compatibility) is Now Available in the Canada West (Calgary) Region
about 6 hours ago
Services
Share
Gemma 4 models are now available in Amazon SageMaker JumpStart
about 7 hours ago
Services
Share
Amazon CloudWatch adds visual agent configuration to the EC2 console
about 7 hours ago
Services
Share
AWS Transfer Family Terraform module now supports Okta and Microsoft Entra ID integration examples
about 14 hours ago
Services
Share