April 18th, 2024

AWS Neuron introduces speculative decoding and vLLM support

Services

Today, AWS announces the release of Neuron 2.18, introducing stable support (out of beta) for PyTorch 2.1, adding continuous batching with vLLM support, and adding support for speculative decoding with [Llama-2-70B sample](https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/transformers-neuronx/inference/speculative%5Fsampling.ipynb) in Transformers NeuronX library. AWS Neuron is the SDK for Amazon EC2 Inferentia and Trainium based instances purpose-built for generative AI. Neuron integrates with popular ML frameworks like PyTorch and TensorFlow. It includes a compiler, runtime, tools, and libraries to support high performance training and inference of generative AI models on Trn1 instances and Inf2 instances. This release also adds new features and performance improvements for both LLM training and inference, and updates Neuron DLAMIs and Neuron DLCs. For training, NeuronX Distributed adds asynchronous checkpointing support, auto partitioning pipeline parallelism, and introduces pipeline parallelism in PyTorch Lightning Trainer (Beta). For inference, Transformers NeuronX improves weight loading performance by adding support for SafeTensor checkpoint format and adds new samples for Mixtral-8x7B-v0.1 and mistralai/Mistral-7B-Instruct-v0.2\. NeuronX Distributed and PyTorch NeuronX add support for auto-bucketing. You can use AWS Neuron SDK to train and deploy models on Trn1 and Inf2 instances, available in AWS Regions as On-Demand Instances, Reserved Instances, Spot Instances, or part of Savings Plan. For a full list of new features and enhancements in Neuron 2.18, visit [Neuron Release Notes](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/index.html). To get started with Neuron, see: [AWS Neuron](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/) [Inf2 Instances](https://aws.amazon.com/ec2/instance-types/inf2/) [Trn1 Instances](https://aws.amazon.com/ec2/instance-types/trn1/)