AWS Neuron introduces speculative decoding and vLLM support
Share
Services
Today, AWS announces the release of Neuron 2.18, introducing stable support (out of beta) for PyTorch 2.1, adding continuous batching with vLLM support, and adding support for speculative decoding with [Llama-2-70B sample](https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx/transformers-neuronx/inference/speculative%5Fsampling.ipynb) in Transformers NeuronX library.
AWS Neuron is the SDK for Amazon EC2 Inferentia and Trainium based instances purpose-built for generative AI. Neuron integrates with popular ML frameworks like PyTorch and TensorFlow. It includes a compiler, runtime, tools, and libraries to support high performance training and inference of generative AI models on Trn1 instances and Inf2 instances.
This release also adds new features and performance improvements for both LLM training and inference, and updates Neuron DLAMIs and Neuron DLCs. For training, NeuronX Distributed adds asynchronous checkpointing support, auto partitioning pipeline parallelism, and introduces pipeline parallelism in PyTorch Lightning Trainer (Beta). For inference, Transformers NeuronX improves weight loading performance by adding support for SafeTensor checkpoint format and adds new samples for Mixtral-8x7B-v0.1 and mistralai/Mistral-7B-Instruct-v0.2\. NeuronX Distributed and PyTorch NeuronX add support for auto-bucketing.
You can use AWS Neuron SDK to train and deploy models on Trn1 and Inf2 instances, available in AWS Regions as On-Demand Instances, Reserved Instances, Spot Instances, or part of Savings Plan.
For a full list of new features and enhancements in Neuron 2.18, visit [Neuron Release Notes](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/index.html). To get started with Neuron, see:
[AWS Neuron](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/)
[Inf2 Instances](https://aws.amazon.com/ec2/instance-types/inf2/)
[Trn1 Instances](https://aws.amazon.com/ec2/instance-types/trn1/)
What else is happening at Amazon Web Services?
Amazon Bedrock now available in the Asia Pacific (Mumbai) Region
about 5 hours ago
Services
Share
Amazon Personalize launches new recipes supporting larger item catalogs with lower latency
about 9 hours ago
Services
Share
Amazon Connect Contact Lens now provides generative AI-powered agent performance evaluations (preview)
about 9 hours ago
Services
Share
Amazon Chime SDK Voice Connector now supports audio streaming G.711 A-Law encoded audio
about 9 hours ago
Services
Share
Read update
Services
Share