Article written by Google Cloud software engineers Xiang Xu and Pengchong Jin. 

With Model Garden on Vertex AI, Google Cloud strives to deliver highly efficient and cost-optimized machine learning (ML) workflow recipes. Currently, Model Garden offers a selection of more than 150 first-party, open and third-party foundation models. 

Last year, we introduced the popular open-source large language model (LLM) serving stack, vLLM on GPUs, in Model Garden. Since then, we have witnessed rapid growth in serving deployment. Today, we are thrilled to introduce Hex-LLM (High-Efficiency LLM serving with XLA) on TPUs in Vertex AI Model Garden.

Hex-LLM

Hex-LLM is Vertex AI's in-house LLM serving framework designed and optimized for Google's Cloud TPU hardware, available as part of AI Hypercomputer. Hex-LLM combines state-of-the-art LLM serving technologies, including continuous batching and paged attention, and in-house optimizations tailored for XLA/TPU. This represents our latest high-efficiency and low-cost LLM serving solution on TPU for open-source models.

Hex-LLM is now available in Vertex AI Model Garden via playground, notebook, and one-click deployment. We can't wait to see how Hex-LLM and Cloud TPUs can help your LLM serving workflows.

Design and benchmarks

Hex-LLM is inspired by several successful open-source projects, including vLLM and FlashAttention. It incorporates the latest LLM serving technologies and in-house optimizations tailored for XLA/TPU.

The key optimizations in Hex-LLM include:

  • The token-based continuous batching algorithm to ensure the maximal memory utilization for KV caching.
  • A complete rewrite of the PagedAttention kernel that is optimized for XLA/TPU.
  • Flexible and composable data parallelism and tensor parallelism strategies with special weights sharing optimization to run large models efficiently on multiple TPU chips.

In addition, Hex-LLM supports a wide range of popular dense and sparse LLM models, including:

  • Gemma 2B and 7B
  • Gemma 2 9B and 27B
  • Llama 2 7B, 13B, and 70B
  • Llama 3 8B and 70B
  • Mistral 7B and Mixtral 8x7B

As the LLM field keeps evolving, we are also committed to bringing in more advanced technologies and the latest and greatest foundation models to Hex-LLM.

Hex-LLM delivers competitive performance with high throughput and low latency. We conducted benchmark experiments, and the metrics we measured can be explained as follows: 

  • TPS (tokens per second) is the average number of tokens a LLM server receives per second. Similar to QPS (queries per second), which is used for measuring the traffic of a general server, TPS is for measuring the traffic of an LLM server but in a more fine-grained fashion.
  • Throughput measures how many tokens the server can generate for a certain timespan at a specific TPS. It is a key metric for estimating the capability of processing a number of concurrent requests.
  • Latency measures the average time to generate one output token at a specific TPS. This estimates the end-to-end time spent on the server side for each request, including all queueing and processing time.

Note that there is usually a tradeoff between high throughput and low latency. As TPS increases, both throughput and latency should increase. The throughput will saturate at a certain TPS, while the latency will continue to increase with higher TPS. Thus, given a particular TPS, we can measure a pair of throughput and latency metrics of the server. The resultant throughput-latency plot with respect to different TPS gives an accurate measurement of the LLM server performance.

The data used to benchmark Hex-LLM is sampled from the ShareGPT dataset, which is a widely adopted dataset containing prompts and outputs in variable lengths.

In the following charts, we present the performance of Gemma 7B and Llama 2 70B (int8 weight quantized) models on eight TPU v5e chips:

  • Gemma 7B model: 6ms per output token for the lowest TPS, 6250 output tokens per second at the highest TPS.
  • Llama 2 70B int8 model: 26ms per output token for the lowest TPS, 1510 output tokens per second at the highest TPS.
https://storage.googleapis.com/gweb-cloudblog-publish/images/1_xdU0TAU.max-1500x1500.png
Gemma 7B model
https://storage.googleapis.com/gweb-cloudblog-publish/images/2_SWx3NoW.max-1500x1500.png
Llama 2 70B int8 model

Get started with Model Garden on Vertex AI

We have integrated the Hex-LLM TPU serving container into Vertex AI Model Garden. Users can access this serving technology through the playground, one-click deployment, or Colab Enterprise Notebook examples for a variety of models.

Vertex AI Model Garden's playground is a pre-deployed Vertex AI Prediction endpoint integrated into the UI. Users type in the prompt text and the optional arguments of the request, click the SUBMIT button, and then quickly get the model response.

https://storage.googleapis.com/gweb-cloudblog-publish/images/3_lUqPdUM.max-700x700.png
Example prompt

To deploy a custom Vertex Prediction endpoint with Hex-LLM, one-click deployment through the model card UI is the easiest approach:

1. Navigate to the model card page and click on the "DEPLOY" button.

https://storage.googleapis.com/gweb-cloudblog-publish/images/4_tSl8PhV.max-1000x1000.png
Deploy button locale

2. For the model variation of interest, select the TPU v5e machine type ct5lp-hightpu-*t for deployment. Click "DEPLOY" at the bottom to begin the deployment process. You can receive two email notifications: One when the model is uploaded and one when the endpoint is ready.

https://storage.googleapis.com/gweb-cloudblog-publish/images/5_hlp7Fom.max-600x600.png
Example deployment model selection

For maximum flexibility, users can use Colab Enterprise notebook examples to deploy a Vertex Prediction endpoint with Hex-LLM using the Vertex Python SDK.

1. Navigate to the model card page and click on the "OPEN NOTEBOOK" button.

https://storage.googleapis.com/gweb-cloudblog-publish/images/6_6XFILuu.max-1000x1000.png
Open Notebook locale

2. Select the Vertex Serving notebook. This will open the notebook in Colab Enterprise.

https://storage.googleapis.com/gweb-cloudblog-publish/images/7_jayfnyJ.max-1400x1400.png
Select serving notebook locale
https://storage.googleapis.com/gweb-cloudblog-publish/images/8_JKobNAP.max-1600x1600.png
Sample notebook preview

3. Run through the notebook to deploy using Hex-LLM and send prediction requests to the endpoint. The code snippet for the deployment function is as follows.

 

Sample code snippet

Here, users can customize the deployment to best align with their needs. For instance, they can deploy with multiple replicas to handle a large amount of projected traffic.

Learn more about Google Cloud and WWT. Explore 

Technologies