Maintained with ☕️ by
IcePanel logo

Hex-LLM: High-Efficiency Large Language Model Serving is available in General Availability (GA)

Share

Services

## Feature [Hex-LLM: High-Efficiency Large Language Model Serving](https://cloud.google.com/vertex-ai/generative-ai/docs/open-models/use-hex-llm) is available in [General Availability (GA)](https://cloud.google.com/products#product-launch-stages). This launch adds support for the following models: * Llama 3.1 * Llama 3.2 * Phi-3 * Qwen2 and Qwen2.5 Additional supported features: * Multi-host serving. * Disaggregated serving (experimental). * Prefix caching. * AWQ quantization.