Revving Up ECMP Routing for AI/ML Workloads

Today's high-performance networks support artificial intelligence/machine learning (AI/ML) and high-performance computing (HPC). These computing and storage workloads are typically delivered using Ethernet or InfiniBand. However, InfiniBand differs significantly from the traditional networks we use today, making troubleshooting and remediation challenging for operational teams due to the limited resources or operational knowledge. On the other hand, while Ethernet is more familiar, it was not initially designed for these types of workloads.

Mike Witte, a WWT principal architect, wrote a fantastic article detailing how these workloads communicate for GPU synchronization. It also covers how Remote Direct Memory Access (RDMA) is utilized to improve the performance of these workloads. Fortunately, RDMA can run on Ethernet and is called RDMA over Converged Ethernet v2 (RoCE v2). If you are unfamiliar with GPU synchronization and RoCEv2 details, we recommend reading Mr. Witte's article as we will not be covering those topics here.

In a series of blogs, we will address the challenges associated with using Ethernet for GPU networking, starting with the inefficiency of Ethernet packet routing.

ECMP history

We will review how traditional Equal-Cost Multi-Path (ECMP) works and the challenges it can bring to high-performance workloads. Then, we will compare it to newer functionality, such as flowlet switching and packet spraying, which are designed to overcome ECMP's limitations. To show the evolution of these features, example configurations from major networking vendors such as Cisco, NVIDIA, and merchant silicon ASIC providers will be provided.

ECMP routing is a network routing strategy that allows traffic distribution across multiple paths at an equal cost. This technique is useful in networks where optimizing available bandwidth and enhancing redundancy is essential. By leveraging multiple paths, ECMP can effectively balance the load, preventing any single path from becoming a bottleneck. This improves the overall network performance and increases resilience to network route failures, as traffic can be rerouted through alternative paths if one path becomes unavailable. If it were only that simple; AI/ML workloads are far from traditional network traffic.

In traditional ECMP, packets are distributed based on a hash of the packet headers, which ensures that packets belonging to the same flow follow the same path, maintaining the order of packets. The headers used for hashing are source/destination IP addresses, source/destination port and protocol. Figure 1 shows an example of ECMP hashing across multiple uplinks between three flows. All three flows are destined for Host 4, but each takes a different path, showing a high entropy between uplinks.

However, this method can lead to uneven load distribution for AI/ML workloads. The ECMP hash function will not evenly distribute the flows due to low entropy in the utilization of IP addresses and ports. Low entropy refers to the lack of variability or randomness in the distribution of traffic flows. When traffic flows have low entropy, they tend to follow predictable patterns, leading to uneven load distribution across multiple paths. This can cause some paths to become congested while others remain underutilized, reducing the network's overall efficiency.

Most of this traffic will be RoCEv2, using UDP port 4791. Because the protocol type and RoCEv2 UDP port are the same, there is a much higher chance of poor distribution of flows across the uplinks, as the distribution relies on the source/destination IP address and the source port. Figure 2 illustrates an example of low ECMP entropy, showing flows one and three hashed to the same uplink from the leaf to the spine switch. The situation worsens when all three flows share the same uplink from the spine to the leaf switch for node four. Although this is an extreme example, it highlights the potential problem.

Another challenge is that these flows can send a large amount of data over a long period of time. These large continuous flows are often referred to as elephant flows. Newer techniques, such as flowlet switching and packet spraying, have been developed to address the limitations of traditional ECMP.

Flowlet switching

Flowlet switching or load balancing can be defined in a couple of different ways. The first way utilizes User Defined Fields (UDF). With UDF, you will tell the hashing algorithm to look deeper into the packet. A RoCEv2 packet encapsulates RDMA. Figure 3 is a visual of an Ethernet packet that contains RoCEv2.

The first 22 bytes are the typical Ethernet requirements of preambles and MAC addresses, followed by the ether type. RoCEv2 starts to appear in the IP portion of the packet. The UDP port for RoCEv2 is 4791, followed by the InfiniBand header, payload and iCRC. The RDMA header contains a parameter called the queue pair. The Destination Queue Pair below is in the packet capture (Figure 4). It is part of the encapsulated InfiniBand data.

Figure 4 - RoCEv2 queue pair packet capture

These queue pairs help establish communication between two RDMA endpoints for a particular piece of work. They are also unique for each connection. By looking deeper into the packet for these queue pairs, a switch can load balance with higher entropy and better load distribution because it has more information to make a better decision.

The next way is what many people think of a flowlet is; a larger flow chopped up into smaller flows. Figure 5 shows an example of flowlet switching. There is a large flow between host 1 and host 4. The leaf chops the flow into smaller flowlets and sends them up different uplinks.

The key to this method is the flowlet aging timer, which ages out a flow when the gap between packets reaches the timer's threshold. At that point, the switch picks another path for the next flowlet. While this does not guarantee it will be on a different physical transmit port, it is likely. Verifying that the flowlet aging timer is based on the round-trip time in your fabric is essential. Otherwise, you could encounter packet reordering issues if packets are placed on the fabric too soon after being moved to a different path. Here you will see that flowlet 1a arrived before flowlet 1b.

The key to both solutions is sending a smaller data flow between the endpoints while maintaining the packet sequencing compared to packet spraying or per-packet load balancing. If packets arrive out of sequence and are not appropriately reordered, it can negatively impact workload performance.

Packet spraying

Another option to load balance AI/ML traffic is to use packet spraying. Some people may remember Cisco's Cisco Express Forwarding (CEF) per-packet load balancing from years ago. It is a similar concept but handled by the switch's ASIC. Packet spraying is not supported broadly by all OEMs or ASICs. This feature has one significant downside: packet reordering. With packet spraying, packets can arrive out of order and will significantly impact AI/ML workload performance. Something needs to put these packets back in order before the GPU handles them.

One option is to have the re-ordering performed on the network switches. This is often referred to as a scheduled fabric or disaggregated scheduled fabric (DSF). We will not travel down that rabbit hole now; it will be addressed in a separate article dedicated to that topic.

The other option is reordering on the GPU's network interface card (NIC). Many new SmartNICs and SuperNICs and DPUs support this function, but you must verify support on the NIC. Some examples of the next-generation NICs and DPUs are the NVIDIA ConnectX and Bluefield-3 platforms.

Enabling packet spraying allows all the uplinks to be more equally utilized, as packets are balanced across uplinks in a round-robin fashion. This would help eliminate the challenge of elephant flows. Remember that you will still want to implement other Ethernet tools, such as ECN and PFC, to address Quality of Service (QOS) within the fabric.

How do we implement this?

In the following sections, we will explore how to implement the different ECMP enhancements for Cisco, NVIDIA and merchant silicon provider networking hardware and infrastructure. These are not comparisons of the technologies or switch vendors. We are referencing their publicly available configuration guides on implementing the technology. No single technology or OEM is considered better than the others in this section.

Merchant silicon providers

There are several merchant silicon switch ASIC providers, with Broadcom being one of the most well-known. The Tomahawk and Jericho chips from Broadcom are frequently referenced, each offering different methods to address entropy. This review will explore some of the features enabled on both platforms.

Multiple switching vendors using Broadcom chips offer a feature called Dynamic Load Balancing (DLB). Each vendor's implementation is very different. DLB allows the switch to move traffic between uplinks for better utilization, using flowlet switching or per-packet load balancing (packet spraying). For flowlet switching, it is crucial to understand the latency of the fabric. If the flowlet inactivity interval is set incorrectly, packets will arrive out of order, significantly impacting performance. It is essential to define inactivity thresholds accurately; setting the threshold too low can cause packets to arrive out of order, reducing performance. Below is an example of the DLB flowlet switching configuration (Figure 6).

Figure 6 - Merchant silicon ECMP example

The other option for DLB in some vendors is to provide per-packet load balancing, or as we often refer to packet spraying. As discussed earlier, this literally sprays each packet up different uplinks instead of sending a whole flow or flowlet of data. There is one caveat to this approach. The NIC needs to support packet re-ordering. Please verify that the NIC supports that; otherwise, the results could be disastrous. Below is an example of DLB with per-packet load
balancing (Figure 7).

Figure 7 - Merchant silicon per packet example

Other vendors will use the UDF hashing mentioned earlier in the article. Switch manufacturers are looking deeper into packets to load balance based on queue pairs. Some switch manufacturers' UDF feature will look 5-7 bytes into packets with a destination UDP port of 4791. There, it will find the queue pair and help with load balance with the additional information compared to a traditional five-tuple load balance.

Some Broadcom-based OEMs use cell spraying instead of packet spraying to overcome ECMP's low entropy on their Jericho-based platforms. In this method, packets are divided into smaller cells and transmitted on a scheduled lossless backplane or fabric using Virtual Output Queues (VOQ). This ensures packets arrive in order without needing to implement features such as Explicit Congestion Notification (ECN), Priority Flow Control (PFC) and Data Center Quantized Congestion Notification (DCQCN). One platform uses the cell spraying on the backplane when going from line card to line card, with no configuration needed as it is enabled by default. Another platform uses a spine-leaf fabric that functions as a single device through a scheduled fabric mechanism between the spine and leaf switches. This topic will be addressed in a separate blog article dedicated to scheduled fabrics.

Cisco

Cisco also has options to help address the ECMP entropy challenges. Below is an example of how to configure basic ECMP in NX-OS (Figure 8).

The first ECMP option we will cover in NX-OS is UDF. With UDF, the switch will look deeper into the IP packets and the InfiniBand Queue Pairs encapsulated within the RoCEv2 packet. That will help provide higher entropy due to the packets using port 4791 for RoCE. Cisco defines the queue pair as 33 bytes from the beginning of the IP packet while looking at the next 24 bytes. Below is an example of UDF Load Balancing (Figure 9).

Next, we will cover Dynamic Load Balancing (DLB). Only certain flavors of Nexus switches (FX3, GX, GX2 and HX) support DLB with NX-OS 10.5(1) and newer. We will review both methods of DLB available in the NX-OS Configuration Guide. Earlier, we talked about flowlet switching and per-packet load balancing. Cisco's DLB covers both capabilities.

Figure 9 - Cisco UDF load balancing example

First, we will start with flowlet switching. This capability within DLB allows a large flow of data to be broken up into smaller chunks called flowlets, as shown in Figure 3. Inactivity thresholds, which were discussed earlier, are crucial for this feature. The flowlet-aging timer is based on the fabric's round-trip time. If the aging time is too short, packets may arrive out of order and may severely impact performance. The default flowlet-aging timer is 500 microseconds. When the aging time allows for a flowlet to be created and assigned to an uplink, DLB will use the least utilized uplink. Below is an example of the DLB flowlet switching configuration. Please note that changes to the DLB interface list require a switch reload. Remember that many of the commands below have default values, which are shown to help
you understand the commands used (Figure 10).

Cisco's per-packet load balancing (PLB) function in DLB has a configuration similar to flowlet switching. PLB can also be referred to as packet-spraying. This feature allows all links to be equally utilized without worrying about elephant flows or flowlet aging. The key to enabling PLB is that the GPU's NIC must support packet reordering. NVIDIA has SuperNICs and DPUs that support packet reordering. Please verify all endpoint compatibility before enabling PLB. Here is an example of PLB (Figure 11).

Figure 11 - Cisco DLB per packet example

NVIDIA

NVIDIA has introduced a new line of switches, the Spectrum-4 series. The switch models are SN5600 and SN5400. These switches run on the Cumulus Linux OS. NVIDIA also pairs its ConnectX-8 and Bluefield-3 SuperNICs with the SN5600 switch to create an end-to-end Spectrum-X architecture.

We will start with general ECMP. NVIDIA supports using BGP using the multipaths command shown in Figure 12.

NVIDIA's answer to ECMP entropy is a feature called Adaptive Routing (AR). AR looks at the queue occupancy and port utilization to determine path selection for RoCEv2 packets. It performs this on a hop-by-hop basis. Adaptive routing will dynamically distribute traffic between uplinks if the cumulative flow rate exceeds an uplink port's bandwidth. It is essential to understand that enabling or disabling will disruptively restart the switch process. Below is an example of an adaptive routing configuration (Figure 13).

Figure 13 - NVIDIA Cumulus adaptive routing example

Summary

There are multiple ways to address the ECMP entropy issue in AI/ML/HPC networks, as ECMP alone is insufficient. The primary options include flowlet switching, which helps mitigate low uplink entropy challenges caused by RoCEv2's use of a single UDP port. This can be achieved using flow timers or UDF to improve hashing. Another method involves packet or cell spraying techniques. It is important to note that packet spraying requires NICs that support packet reordering.

I look forward to continuing the journey by exploring ECN, PFC, scheduled fabrics and UltraEthernet, each of which will receive a separate detailed article. These articles will be followed up with testing in WWT's AI Proving Ground. The testing will show how Ethernet performs with traditional ECMP and apply those same tests to each of the different features that help make Ethernet a viable GPU networking option.