The Battle of AI Networking: Ethernet vs InfiniBand
In this blog
Introduction
Comparing Ethernet vs InfiniBand is like a ring announcer introducing a prize fight, with the prize being marketshare in the $20 billion AI networking market.
That said, it's less pugilism and more fine details, although it could be argued that "sweet science" applies to both.
InfiniBand was created to address Ethernet's shortcomings (lossy, stochastic and slow). Over time, however, the overall performance/reliability gap has substantially narrowed; with some tweaks, Ethernet can push data with the same bandwidth, latency and reliability as InfiniBand. While the ultra-high performance domain (perhaps top 3 to 5 percent of the total market?) still belongs to InfiniBand, the vast majority of current InfiniBand deployments can actually be handled by Ethernet.
Regardless of the changes in performance profiles, directly comparing Ethernet and InfiniBand is challenging. It's not even apples-to-oranges; it's comparing apples to wheelbarrows. In some ways, they're identical; in others, radically different. The stakes for the primary use case (both generative and inferential AI) are high from both an economic and strategic perspective, so it's important to get it right.
As mentioned in a previous article, The Basics of High-Performance Networking, the value of a network is derived not from the transport itself but from how it connects computing and storage together. When it comes to high performance, it boils down to a single question: How do you transport your RDMA? The performance of a system leveraging RDMA is a function of the type of storage and compute, enhancements to each, and how they're configured.
In recent proofs of concept (POCs) hosted in WWT's labs, engineering a true apples-to-apples Ethernet/InfiniBand comparison has meant duplicating a complex InfiniBand infrastructure on Ethernet, hop-by-hop, optic-by-optic, nerd-knob by nerd-knob. The environment was so customized that the results were largely only relevant to that exact build and its configuration. So, while we could absolutely say that Ethernet/RoCE was faster than InfiniBand, it only held true for those specific environments and the circumstances we tested.
Ethernet vs InfiniBand
Comparing the two "by the numbers," with attention to their differentiating factors:
ETHERNET | INFINIBAND | |
Max Bandwidth | 800 gbps | 800 gbps |
MTU | 9216 bytes (NOTE: RDMA is optimized for 4096 bytes, so larger frames will not necessarily result in enhanced performance) | 4096 bytes |
Layer 3 Support | Yes | No |
Delivery | Best Effort, enhanced to lossless | Lossless |
Load Balancing | Hash Values | Deterministic (NCCL) |
RDMA Support | RoCEv2 | Native |
Enhancements |
|
|
Pros |
|
|
Cons |
|
|
The question remains: How can you test the two in a way that applies more broadly?
The test
WWT, in collaboration with Arista and Cisco, recently conducted a series of independent tests designed to eliminate all variables except for the network transport. The raw metrics in these tests were expected to be worse than other publicly available numbers, precisely because many performance-optimizing features were disabled so as to position the network transport as the central component.
The test compared RoCEv2's performance profile and its enabling features (PFC, ECN) against InfiniBand's natively scheduled fabric.
Equipment
HARDWARE | MANUFACTURER | FUNCTION |
DGX | NVIDIA | Compute |
H100 GPU | NVIDIA | Accelerator |
Quantum 9700 NDR | NVIDIA | Network (InfiniBand) |
7060DX5-64S | Arista | Network (Ethernet) |
Nexus 9332D-GX2B | Cisco | Network (Ethernet) |
Setup
Phase 1
For Phase 1, a single-switch network was deployed, representing the ideal minimum-variable scenario.
Methodology
Testing made use of industry-standard MLCommons benchmarks, specifically the MLPerf Training and MLPerf Inference Datacenter problem sets. These enabled an apples-to-apples analysis of how network transport affects generative and inference AI performance.
- Each selected benchmark test was run for each network solution and OEM:
- NVIDIA InfiniBand
- Arista Ethernet
- Cisco Ethernet
- Ethernet was minimally optimized, with only basic PFC and ECN switch configurations used in accordance with industry best practices.
- Performance-enhancing features on the DGX (notably NVLink) were disabled.
- The intent was to force all GPU-GPU traffic out of the DGX and onto the network. Did this optimize performance? No. However, it allowed us to observe exclusively how the network contributed to the performance and then directly compare the differences.
- NCCL was modified between InfiniBand and Ethernet tests to whitelist DGX NICs (a requirement for Ethernet functionality).
- The same physical optical cables were used for all Ethernet test.
- The same physical 3rd-party optics were leveraged across systems.
- Storage was local to the DGX.
In short, we removed every variable not related to network transport.
Results
BENCHMARK | MODEL | ETHERNET | INFINIBAND | ETH/IB RATIO |
MLPerf Training | BERT-Large | 10,886 s | 10,951 s | 0.9977 |
MLPerf Inference | LLAMA2-70B-99.9 | 52.362 s | 52.003 s | 1.0166 |
Performance ratios were expressed in terms of Ethernet/InfiniBand (i.e., a longer Ethernet completion time will be reflected as a ratio greater than 1).
Observations
Across generative tests and OEMs, the performance delta between InfiniBand and Ethernet was statistically insignificant (less than 0.01 percent)
Ethernet was faster than InfiniBand's best time in 3 out of 9 generative tests (although the margin was only by a few seconds)
In Inference tests, Ethernet averaged 1.0166 percent slower.
Conclusions
In the evaluations discussed above, InfiniBand and unoptimized Ethernet are statistically neck-and-neck.
It is understood that performance differentials will emerge in larger networks, but it has been observed in other laboratory environments that the performance gap is generally below 5 percent.
The introduction of current and pending optimization features (e.g., UltraEthernet) will substantially improve Ethernet performance.
In larger, more complex multivariate tests that were not part of this particular evaluation (e.g., the "bespoke" client POCs run in WWT's Advanced Technology Center), Ethernet has been observed to sometimes outperform InfiniBand by a sizable margin, especially when packet size variance and multiple AI models shared the same fabric.
In published case studies of large-cluster Ethernet performance (e.g., Meta's LLAMA2 training on a 2000-GPU Ethernet cluster and LLAMA3 training on a 24,000-GPU Ethernet cluster), Ethernet and InfiniBand performance was at parity.
Connecting the dots between small-scale tests, complex multivariate POCs, BasePod production environments and industry case studies, and we arrive at this actionable insight:
How WWT can help
We have answered one fundamental question about high-performance networking, but that doesn't mean we've answered them all. In Phase 2, a complex ecosystem of GPU, DPU storage and specific use cases still needs to be tested. As such, future tests will be run over a more conventional spine/leaf non-blocking architecture with 4 DGXs (32 H100 GPU). In iterative tests, Ethernet-enhancing features (including Ultra Ethernet modifications, ECMP entropy improvements, flowlets, packet spraying, network and NIC reordering, etc.) will be examined.
WWT has more than a decade of experience designing and implementing big data and AI/ML solutions for clients across industries. In late 2023, WWT announced a three-year, $500 million dollar investment in the creation of a unique AI Proving Ground (AIPG). The AIPG provides an ecosystem of cutting-edge AI hardware, software and architecture where clients can answer pressing questions related to AI infrastructure and solution design, development, validation, implementation and optimization.