April 1st, 2024

Amazon EMR on EC2 now gracefully replaces unhealthy core nodes

Services

We are excited to introduce a new [Amazon EMR](https://aws.amazon.com/emr/) on EC2 feature that enables automatic graceful replacement of unhealthy core nodes to ensure continued optimal cluster operations and prevent data loss. Additionally, EMR on EC2 will publish CloudWatch events to provide visibility into node health and recovery actions. These improvements are available for all Amazon EMR releases. With EMR on EC2, you can easily provision and scale your data processing clusters without having to manage compute infrastructure or open-source application setup. However, there can be circumstances when an EMR node turns unhealthy due to underlying hardware or memory over-utilization issue. Previously, for termination protected clusters, unhealthy core nodes would remain idle and continue to count towards cluster capacity. For other clusters, the core node replacement process was not graceful. With today’s launch, Amazon EMR minimizes job interruption and prevents data loss by [gracefully decommissioning](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-scaledown-behavior.html#emr-scaledown-terminate-task) and replacing unhealthy core nodes regardless of your cluster’s termination protection setting. Amazon EMR will also publish [unhealthy node replacement events](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-manage-cloudwatch-events.html#emr-cloudwatch-unhealthy-node-replacement-events) that will be available in the EMR console and [Amazon EventBridge](https://aws.amazon.com/eventbridge/). Graceful unhealthy core node replacement is generally available in all [AWS Regions](https://docs.aws.amazon.com/general/latest/gr/emr.html) where Amazon EMR on EC2 is available. To ensure this launch doesn’t affect your existing workflows, we will by default turn off unhealthy node replacement for clusters running EMR 7.0.0 and lower releases that have termination protection enabled. For all other clusters, we will turn on this feature by default. To learn more, see [replacing unhealthy nodes](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-node-replacement.html).