Introduction

Comparing Ethernet vs InfiniBand is like a ring announcer introducing a prize fight, with the prize being marketshare in the $20 billion AI networking market.

That said, it's less pugilism and more fine details, although it could be argued that "sweet science" applies to both.

InfiniBand was created to address Ethernet's shortcomings (lossy, stochastic and slow). Over time, however, the overall performance/reliability gap has substantially narrowed; with some tweaks, Ethernet can push data with the same bandwidth, latency and reliability as InfiniBand. While the ultra-high performance domain (perhaps top 3 to 5 percent of the total market?) still belongs to InfiniBand, the vast majority of current InfiniBand deployments can actually be handled by Ethernet.

Regardless of the changes in performance profiles, directly comparing Ethernet and InfiniBand is challenging. It's not even apples-to-oranges; it's comparing apples to wheelbarrows. In some ways, they're identical; in others, radically different. The stakes for the primary use case (both generative and inferential AI) are high from both an economic and strategic perspective, so it's important to get it right.

As mentioned in a previous article, The Basics of High-Performance Networking, the value of a network is derived not from the transport itself but from how it connects computing and storage together. When it comes to high performance, it boils down to a single question: How do you transport your RDMA? The performance of a system leveraging RDMA is a function of the type of storage and compute, enhancements to each, and how they're configured. 

In recent proofs of concept (POCs) hosted in WWT's labs, engineering a true apples-to-apples Ethernet/InfiniBand comparison has meant duplicating a complex InfiniBand infrastructure on Ethernet, hop-by-hop, optic-by-optic, nerd-knob by nerd-knob. The environment was so customized that the results were largely only relevant to that exact build and its configuration. So, while we could absolutely say that Ethernet/RoCE was faster than InfiniBand, it only held true for those specific environments and the circumstances we tested.

Ethernet vs InfiniBand

Comparing the two "by the numbers," with attention to their differentiating factors:

 ETHERNETINFINIBAND
Max Bandwidth800 gbps800 gbps
MTU

9216 bytes

(NOTE: RDMA is optimized for 4096 bytes, so larger frames will not necessarily result in enhanced performance)

4096 bytes

 
Layer 3 SupportYesNo
DeliveryBest Effort, enhanced to losslessLossless
Load BalancingHash ValuesDeterministic (NCCL)
RDMA SupportRoCEv2Native
Enhancements
  • Dynamic Load Balancing
  • Weighted ECMP
  • VOQ
  • Disaggregated Scheduled Fabric (DSF)
  • Adaptive Routing
  • EtherLink
  • Performance Isolation
  • DDP
  • Adaptive Routing
  • SHARP
Pros
  • Handles multi-workload fabrics (i.e., several different AIs with varying requirements)
  • Easily adapted skillset for existing network engineers
  • Simple to install
  • Self-optimizing
Cons
  • At present, requires a few QoS modifications to optimize performance
  • Rare skillset
  • Operationally difficult to support when something goes wrong

The question remains: How can you test the two in a way that applies more broadly?

The test

WWT, in collaboration with Arista and Cisco, recently conducted a series of independent tests designed to eliminate all variables except for the network transport. The raw metrics in these tests were expected to be worse than other publicly available numbers, precisely because many performance-optimizing features were disabled so as to position the network transport as the central component.

The test compared RoCEv2's performance profile and its enabling features (PFC, ECN) against InfiniBand's natively scheduled fabric.

Equipment

HARDWAREMANUFACTURERFUNCTION
DGXNVIDIACompute
H100 GPUNVIDIAAccelerator
Quantum 9700 NDRNVIDIANetwork (InfiniBand)
7060DX5-64SAristaNetwork (Ethernet)
Nexus 9332D-GX2BCiscoNetwork (Ethernet)

Setup

Phase 1

For Phase 1, a single-switch network was deployed, representing the ideal minimum-variable scenario. 

Graphic showing the Phase 1 Topology
Phase 1 topology

Methodology

Testing made use of industry-standard MLCommons benchmarks, specifically the MLPerf Training and MLPerf Inference Datacenter problem sets. These enabled an apples-to-apples analysis of how network transport affects generative and inference AI performance.

  • Each selected benchmark test was run for each network solution and OEM:
    • NVIDIA InfiniBand
    • Arista Ethernet
    • Cisco Ethernet
  • Ethernet was minimally optimized, with only basic PFC and ECN switch configurations used in accordance with industry best practices.
  • Performance-enhancing features on the DGX (notably NVLink) were disabled.
    • The intent was to force all GPU-GPU traffic out of the DGX and onto the network. Did this optimize performance? No. However, it allowed us to observe exclusively how the network contributed to the performance and then directly compare the differences.
  • NCCL was modified between InfiniBand and Ethernet tests to whitelist DGX NICs (a requirement for Ethernet functionality).
  • The same physical optical cables were used for all Ethernet test.
  • The same physical 3rd-party optics were leveraged across systems.
  • Storage was local to the DGX.

In short, we removed every variable not related to network transport.

Results

BENCHMARK MODEL ETHERNET INFINIBAND ETH/IB RATIO 
MLPerf Training BERT-Large 10,886 s10,951 s0.9977
     
MLPerf Inference LLAMA2-70B-99.9  52.362 s 52.003 s1.0166

Performance ratios were expressed in terms of Ethernet/InfiniBand (i.e., a longer Ethernet completion time will be reflected as a ratio greater than 1).

Observations

Across generative tests and OEMs, the performance delta between InfiniBand and Ethernet was statistically insignificant (less than 0.01 percent)

Ethernet was faster than InfiniBand's best time in 3 out of 9 generative tests (although the margin was only by a few seconds)

In Inference tests, Ethernet averaged 1.0166 percent slower.

Conclusions

In the evaluations discussed above, InfiniBand and unoptimized Ethernet are statistically neck-and-neck.

It is understood that performance differentials will emerge in larger networks, but it has been observed in other laboratory environments that the performance gap is generally below 5 percent.

The introduction of current and pending optimization features (e.g., UltraEthernet) will substantially improve Ethernet performance.

In larger, more complex multivariate tests that were not part of this particular evaluation (e.g., the "bespoke" client POCs run in WWT's Advanced Technology Center), Ethernet has been observed to sometimes outperform InfiniBand by a sizable margin, especially when packet size variance and multiple AI models shared the same fabric.

In published case studies of large-cluster Ethernet performance (e.g., Meta's LLAMA2 training on a 2000-GPU Ethernet cluster and LLAMA3 training on a 24,000-GPU Ethernet cluster), Ethernet and InfiniBand performance was at parity.

A line graph showing the viability of Ethernet based on cluster size, use case and LLM involved

Connecting the dots between small-scale tests, complex multivariate POCs, BasePod production environments and industry case studies, and we arrive at this actionable insight:

How WWT can help

We have answered one fundamental question about high-performance networking, but that doesn't mean we've answered them all. In Phase 2, a complex ecosystem of GPU, DPU storage and specific use cases still needs to be tested. As such, future tests will be run over a more conventional spine/leaf non-blocking architecture with 4 DGXs (32 H100 GPU). In iterative tests, Ethernet-enhancing features (including Ultra Ethernet modifications, ECMP entropy improvements, flowlets, packet spraying, network and NIC reordering, etc.) will be examined.

Graphic showing Phase 2 Topology
Phase 2 topology

WWT has more than a decade of experience designing and implementing big data and AI/ML solutions for clients across industries. In late 2023, WWT announced a three-year, $500 million dollar investment in the creation of a unique AI Proving Ground (AIPG). The AIPG provides an ecosystem of cutting-edge AI hardware, software and architecture where clients can answer pressing questions related to AI infrastructure and solution design, development, validation, implementation and optimization.

Layout of the AIPG's logical networking layout.
AIPG logical

 

Keep learning

Check out our high-performance AI/ML networking Learning Path!Learn more

References

Data Center AI Networking, 650 Group (2024)
The Basics of High-Performance Networking, WWT (2024)
MLCommons (2024)
Meta AI, Meta (2024) 
AI/ML Data Center Networking Blueprint, Cisco (2024)
AI Networking, Arista (2024)

Technologies