Article • August 6, 2024 • 10 minute read

Introduction to NVIDIA's AI/ML GPU networking solutions

This article discusses the importance of deploying AI applications and training models using distributed computing and the need for significant computational resources. It highlights the role of network efficiency and scalability in large-scale AI deployments.

The article also introduces NVIDIA's AI/ML GPU networking and the benefits of InfiniBand and Ethernet technologies. It discusses the features and components of NVIDIA Quantum InfiniBand, such as the Subnet Manager, switches, network adapters and NVIDIA Unified Fabric Manager (UFM). It also explains the collective computational power and in-network computing capabilities of InfiniBand. Additionally, it discusses the features of NVIDIA Spectrum™-X Ethernet, including RDMA, congestion control, performance isolation and security. The article concludes by emphasizing the importance of considering factors such as job completion times, cost and multi-tenancy needs when choosing between InfiniBand and Ethernet for AI/ML back-end architecture.

Deploying AI applications like generative models or training foundational models such as ChatGPT, BERT or DALL-E requires significant computational resources, particularly for larger, more complex models. As data volume and model complexity grow, distributed computing becomes crucial. It speeds up training by spreading tasks across many compute nodes. The slowest node dictates the runtime for a distributed task. Network efficiency is vital for message delivery to all nodes, making tail latency—a measure of the last message's arrival time—critical in large-scale deployments with competing workloads. Moreover, network scalability is vital to managing more nodes and vast data for extensive AI model training.

We will examine how the world's biggest networking companies design versions of AI/ML Validated designs.

As GPUs get more powerful and AI/ML training becomes increasingly critical for the business, there will be a quick ramp-up of GPU intra-node (single node) PCIe speeds. Today, a top-of-the-line NVIDIA GPU can easily burst during synchronization to 400Gbps on the NIC. Once PCIe 6 and 7 become mainstream, we will see GPUs push 800 Gbps, so our GPU node networks will significantly differ from traditional Ethernet, as most network engineers know. Look for 800 Gbps and 1.6 Tbps fabrics in the next two years, along with other innovations to provide the lowest latency to reduce Job Completion Times (JCT).

Please note we purposely have kept OEM model numbers out of the discussion. Due to the breakneck pace of OEM solutions for these designs, this article will be obsolete in six months. All OEM designs use spine/leaf RoCE or Infiniband-based fabrics and are calculated to be non-blocking. Please get in touch with WWT sales for help when you are ready to create your validated OEM solution so we can guide you through the latest OEM solution we have validated in our new state-of-the-art WWT AI Proving Ground Labs!

Introduction to NVIDIA's AI/ML GPU networking

When designing networks, it's vital to sidestep common fallacies. A widespread but incorrect belief is that it's OK to alter link speeds end-to-end for AI implementations; however, this often results in increased latency and diminished performance. Other fallacies in AI network construction include:

Ongoing AI evolution
Importance of switch radix
Choosing between shallow or deep buffer architectures
Strategies for ensuring network robustness

The foundational network infrastructure determines the data center's operational tier and the anticipated performance and efficiency levels. Thus, debunking these fallacies and adopting an integrated approach that weighs performance, security, and adaptability is crucial to meet the objectives of the data center, be it an AI Cloud or an AI Factory.

Given that NVIDIA makes some of the world's fastest GPUs and GPU systems and created the NVIDIA® NVLink® and NVIDIA® NVSwitch™ protocols, the NVIDIA Collective Communications Library (NCCL) is critical to lowering latency for GPU connectivity, allowing faster JCTs. Interestingly, the Mellanox purchase now looks like a great acquisition by NVIDIA as InfiniBand has lower latency, millions of ports will be needed, and other methods for reducing GPU-to-GPU latency will be required.

It is important to note that NVIDIA offers both Infiniband and Ethernet solutions for AI fabrics. NVIDIA has tightly coupled these two offerings with its network cards, allowing customers to use NVIDIA Quantum InfiniBand or NVIDIA Spectrum-X Ethernet.

NVIDIA Spectrum-X Ethernet: Designed for the Era of Generative AI

Without question, the NVIDIA Quantum InfiniBand Platform has enabled many of the large-scale supercomputing deployments for complex distributed scientific computing. As a lossless network with ultra-low latencies, native RDMA architecture, and in-network computing capabilities, Infiniband is revered as the gold standard in performance and pivotal in accelerating today's mainstream development and deployment of AI. InfiniBand networking sets the standard for lossless data transmission, ensuring complete and accurate packet delivery.

Spectrum-X4 features

The NVIDIA Spectrum™-4 Ethernet switch, an application-specific integrated circuit (ASIC), sets a new standard for AI and cloud-native computing tasks, extending from the core to the cloud and edge. As the fifth installment in NVIDIA's Ethernet switch ASICs, the Spectrum-4 fuels the NVIDIA Spectrum-X platform, a pioneering accelerated Ethernet solution tailored for AI infrastructures. It offers unparalleled performance for applications that demand high bandwidth, such as generative AI, extensive language models, recommendation systems, video analytics, and more. The Spectrum-4's capabilities are further enhanced by advanced remote direct-memory access (RDMA) over Converged Ethernet (RoCE), featuring adaptive routing, performance isolation, and automatic path configuration, making it the go-to choice for Ethernet-based workloads. Moreover, Spectrum-4 equips clouds with a cutting-edge feature set, including novel security enhancements like MACsec over VXLAN, and unmatched nanosecond-level timing accuracy from the switch to the host.

Spectrum-X seamlessly integrates the Spectrum-4 switch and NVIDIA BlueField®-3 SuperNICs, enhancing hyperscale generative AI infrastructures through comprehensive innovations. At the heart of these advancements are the novel RoCE dynamic datapath extensions, which include Adaptive Routing, Congestion Control, Performance Isolation, Port-to-Port (RP2P) Auto-path Configuration, and Synchronized Collectives, all designed to streamline and fortify data transmission processes.

Spectrum-4 design

Spectrum-4, crafted with NVIDIA's proprietary 4-nanometer process, incorporates the most advanced and reliable SerDes to date.

With a shared, unified packet buffer, Spectrum-4 ensures all ports have dynamic access, enhancing microburst absorption and delivering genuine port-to-port cut-through latency. Moreover, Spectrum-4's pipeline and packet modifier/parser are fully programmable, maintaining high performance without affecting latency or packet throughput. The integrated packet buffer facilitates high-speed packet processing, ensuring steady and predictable performance, while its singular architecture eases buffer management and traffic coordination, promoting equitable resource distribution.

NVIDIA Spectrum-X 4 recommended architecture

Ethernet, traditionally designed to be oversubscribed and lossy, now competes with GPU computing and AI demands in cloud environments through RDMA over Converged Ethernet (RoCE) Explicit Congestion Notification (ECN) and Priority Flow Control (PFC) alongside lossless-network solutions like Spectrum-X. A typical RoCE-based fabric would look similar to the diagram below. Expect to see a Spine/Leaf topology Rail Optimized if possible, 400Gbps BlueField-3 SuperNics, and the networking bandwidth of 400 or 800 Gbps for uplinks and spine count must all work together and offer a non-blocking fabric that is undersubscribed (for growth).

Optimized 400Gb/s end-to-end AICloud Ethernet topology.

RDMA revolutionizes data transfer with its high-speed, low-latency capabilities. It allows direct memory-to-memory data movement across systems, GPUs, and storage, bypassing CPU involvement. RMDA has always been part of Infiniband, and OEMs have been using the Infiniband verb and actions to create RoCE or RMDA over Converged Ethernet. This contrasts traditional networking's multi-step process, which incurs additional latency and inefficiency.

RDMA is Remote Direct Memory Access, a technology that streamlines data transfers across networks by allowing direct memory-to-memory transactions. This bypasses the CPU, reducing latency and enhancing speed.

Datacenter apps often create numerous small data flows, allowing network traffic to be statistically averaged. Simple, static hash-based methods like ECMP are typically enough for flow routing to prevent traffic problems. However, AI tasks using GPUs produce fewer but larger "elephant flows," which consume substantial bandwidth (1 GPU can now push 400Gbps out of a network or InfiniBand card.) and can cause congestion and latency if routed poorly. Thus, adaptive routing algorithms are essential for dynamic load balancing and avoiding collisions with other jobs or multi-tenancy. Packet spraying in a scheduled fabric can lead to out-of-order packets, necessitating a flexible reordering system to make adaptive routing transparent to the application. Spectrum-X combines Spectrum-4's load balancing and BlueField-3 SuperNIC's Direct Data Placement (DDP) for seamless end-to-end adaptive routing. Please note that an Ethernet RMDA-capable NIC such as NVIDIAs BlueField-3 SuperNIC can reassemble packets if a scheduled fabric is configured. This is because flows are broken into individual packets and then dynamically adaptively routed to least-used links. However, this leads to out-of-order packets and requires re-assembly by the SuperNIC.

Congestion control

In AI cloud scenarios with concurrent AI tasks, network congestion is expected. This occurs when multiple senders transmit data to one or various destinations, leading to latency and reduced bandwidth. Such congestion can also impact nearby tenants.

Traditional congestion control like ECN falls short of Generative AI needs via Ethernet. Effective congestion relief requires the transmitting device to be regulated. ECN only signals to meter data when switching buffers near total capacity, which may be too late in high-traffic conditions, causing packet loss. Effective congestion control necessitates collaboration between switches and NICs/DPUs. Spectrum-X uses in-band telemetry from Spectrum-4 switches to direct BlueField-3 SuperNIC for flow metering. Other methods rely on deep buffer switches, which are unsuitable for AI or complex proprietary protocols.

Performance isolation and security

In AI Cloud environments, safeguarding against interference from concurrent processes on shared systems is crucial. Traditional Ethernet ASIC architectures often lack intrinsic job protection at the ASIC level, leading to potential bandwidth starvation for specific tasks due to "noisy neighbors" — other processes transmitting data to the same port.

AI Clouds must accommodate diverse applications on identical infrastructure to ensure equitable network distribution. Varied application data frame sizes necessitate isolation enhancements; otherwise, larger frames may monopolize bandwidth during simultaneous transmission to a port similar to smaller frames.

Implementing shared packet buffers is essential for achieving performance isolation, which helps mitigate the effects of noisy neighbors and ensures network fairness. A universal shared buffer granting equal cache access to every port on the switch is imperative for the predictable, uniform, and swift response times necessary for AI Cloud operations.

Conclusion

As the battle for AI backend network supremacy intensifies, InfiniBand and Ethernet are at the forefront, vying for market leadership. High-performance network infrastructure becomes pivotal in supporting AIGC operations. Addressing AI cluster computing needs, the industry champions two primary network solutions: InfiniBand and RDMA. We explored and contrasted these technologies. NVIDIA has made wise moves purchasing Mellanox for their Infiniband technology and committing to developing RMDA-capable SuperNICs and Infiniband adapters. The current Infiniband and Spectrum-X lines of switches, cables, and DPUs show roadmaps out to 800 Gbps and 1.6 Tbps by 2026 and 3.2 Tbps by 2030, tracking the same bandwidth numbers we see from the Ethernet OEMs. Ethernet can compete with InfiniBand's ability to do in-network computations to reduce the all-to-all traffic inherent in GPU nodes, making it attractive for organizations that want the fastest job completion times. As the speeds get faster and other protocols are developed, we expect quite a significant market share (experts predict 20-30%) to move to RoCE instead of InfiniBand. Tomorrow, AI/ML back end Architects must look at the whole picture of the cost of the lengths of JCT, cost per GPU port, cost of training runs in power and cooling, and multi-tenancy needs.