Partner POV | Developer Blog: Accurate and Efficient Collaborative Optimizations for Fast Generative AI on AMD GPU
Blog post provided by AMD. Written by Eddie Wu (AMD), Cheng Ling (AMD), George Wang (AMD), Xu Wang (Heyintelligence, AI optimization technical lead) and Yuncong Yang (Heyintelligence, GPU technical lead).
AMD is advancing AI with an open ecosystem through its open-source software ROCmTM, which is designed for GPUs, with a collection of drivers, software tools, libraries and APIs that enable GPU programming with ease. AMD RadeonTM GPUs, and AMD ROCm software, are inherently designed to support a balance of accuracy and efficiency, empowering developers to rapidly build high-performance, large-model applications through underlying hardware architectures and software innovations. This offers an opportunity for more partners to co-innovate in the AMD AI ecosystem.
HEYINTELLIGENCE Delivers Generalized Co-optimization of LLMs on AMD GPU Platforms
HEYINTELLIGENCE delivers highly optimized AI solutions in both hardware and software. Founded in 2017 with deep and rich experience in GPU architecture design and AI algorithm optimization, HEYINTELLIGENCE is developing Generalized Co-optimization Technology (GCT) for specific LLMs, that provides different optimized kernels and hybrid quantization combinations for LLMs based on their structural characteristics. GCT is designed to achieve a significant improvement in performance with almost no loss of accuracy. Recently, HEYINTELLIGENCE optimized the inference of ChatGLM2-6B on an AMD Radeonâ„¢ RX 7900XTX GPU.
As shown in Figure 1, the original implementation of RMSNorm, MatMul fused with Rotary-EMB, MatMul fused with SwiGLU and Decoding Attention in ChatGLM2-6B were selected, which is based on their proportion in computing or bandwidth of the entire inference process. GCT developed four optimized kernels to implement these functions. All four kernels are designed to provide significant performance gains, thanks to the flexibility of the HIP and ROCm components. The kernels are compiled into efficient backend instructions and are well adopted with the high efficiency of AMD GPUs. The key elements of the optimized kernels are as follows:
1. RMSNorm - it regularizes the summed inputs to a neuron in one layer according to the Root Mean Square (RMS). Avoiding synchronization between warps is a key to improving the performance.
2. MatMul fused with Rotary-EMB - The integration of matrix multiplication (MatMul) and rotary operation can greatly reduce the launch cost of multiple kernels. Designing the kernel according to the rotary's granularity is the key factor to increasing data sharing and improving compute efficiency.
3. MatMul fused with SwiGLU- The integration of matrix multiplication and SwiGLU can reduce the launch cost of two separate kernels. Designing the entire optimization process from the output perspective can also reduce the memory-to-register load time.
4. Decoding Attention - Flexible design of thread processing granularity based on the calculating characteristics of attention, optimization of the synchronization method between thread warps in SoftMax, and the rational use of shared memory are three key factors to improve performance.
These kernel optimization techniques have a minimal impact on accuracy and are independent of the quantization strategy, so they can be used as a standalone plug-in with various quantization algorithms to give next-level performance along with GCT. On the other hand, for accuracy-sensitive applications, quantization may lead to a decrease in model generalization, thus posing unpredictable risks. In this case, quantization techniques need to be used with caution, but GCT techniques can still be employed to optimize performance.
Accuracy Matters
In LLM applications, quantization strategies can be used to reduce GPU memory usage and increase the number of simultaneous users that can be served. While aggressive quantization can significantly reduce the amount of data, sometimes the price paid in accuracy is unacceptable, especially in practical LLM applications.
However, the GCT technology offers optimizations without sacrificing accuracy, that matters for LLM applications like ChatGLM2-6B. Since the multiplication operation takes up a large amount of computational data, the GCT uses the 'SmoothQuant' method to obtain the per-channel, 8-bit weight, and stores its FP16 scaled value into a file. After quantization, the parameter volume of ChatGLM2-6B is reduced significantly with limited impact on accuracy.
Further Optimizations
HEYINTELLIGENCE has accumulated a wealth of experience in the application of AI models and hardware platforms to real-world scenarios. There are many sub-techniques within the GCT, such as LLM-serving techniques, quantization/de-quantization kernel fusion techniques, pipeline optimizations, etc. Further optimizations can be done based on customer requirements. The core idea is to enable different optimization techniques to work together to obtain maximum performance improvement with minimum accuracy loss under the constraints of real scene data and timing cost.
Conclusion
The optimized implementations mentioned above further enrich the AMD AI developer community, helping with the highly efficient AMD AI accelerators in processing complex AI workloads such as LLMs, thus making is possible to provide data center users with a complete set of inference solutions that can meet high-throughput, low-latency performance. AMD is empowering more ecosystem partners and AI developers by building open software platforms, such as ROCm, ZenDNNTM, VitisTM AI, and RyzenTM AI software for innovations on GPUs, CPUs and adaptive SoCs.