Minivans and McLarens: Networks for AI (Part Deux)… Just for Fun!
by Micheline Murphy, for Cisco Blogs, VIP Perspectives
When historians look back on 2024, I'm sure that they won't be thinking about how this was the year that Oprah gave AI out to everyone. ("You get AI, and you get AI, and you get AI!"). Even though that seems to be exactly the picture everyone's racing towards, as everyday engineers, getting that done requires much more work behind the scenes.
In the first installment of this series, The Elephant in the Room: Networks for AI…Just for Fun!, I covered how AI traffic behaves differently than non-AI enterprise or DC traffic. Because of that difference, many performance optimizations that grew up in non-AI networks don't work. In this installment, let's continue the conversation about purpose-built high-performance networks for AI workloads.
Did You Do that On Purpose? YES, ma'am!
If you think about it, even our everyday networks are purpose-built. It's like the family minivan. It's not very sexy, it might be a bit cluttered, and don't mind the goldfish crackers in the seat. But the family minivan gets its job done.
Networks for AI workloads are not the family minivan.
As I discussed in the first installment of this series, AI traffic is a very different beast than the non-AI enterprise traffic we've all been used to. So, it makes sense that the network for AI workloads isn't your family minivan. Recall also that in many instances, we're going to be talking about an entirely separate network.
We're talking about the backend now, people.
💡 Missed part one of this blog series? Click here to catch up on the first installment: The Elephant in the Room: Networks for AI… Just for Fun!
The backend network is the network between GPUs, and all the GPUs in the cluster need to function as if they were a single uber-compute node. That means that each GPU needs to be able to reach every other node, the communication must be lossless and fast, and the network must be able to handle heavy, bursty, and inconsistent traffic.
AI workload requirements pressure the network in ways that have led to some very interesting developments. Let's dive deeper.