Unlocking Growth: Meeting Business Demands with High-Performance Networking in AI and HPC
In this blog
The exciting thing about ChatGPT and other interfaces to public AI/ML models is that everybody is starting to engage with AI, from sales, marketing and finance to almost everyone. It's been democratized to such an extent that now everybody, from the CEO to the average worker, has a point of view on how it can be used. It's discussed on TV, social media, etc., and every company is getting on or already on the AI bandwagon. So, organizations and the people supporting them must be on their front foot and have a strategy around what that means for the business, not just speak to the shiny AI objects in the organization.
AI as necessity
The foundation for the open models of ChatGPT, Google BARD/Gemini, Meta and others is based on large language models (LLM). This computer algorithm processes natural language inputs and predicts the next word based on what it's already seen. The biggest challenge for using these cloud-based open model interfaces is that they are available to everyone. Whatever you input into ChatGPT and other public engines is also used for training the public LLM. If proprietary data is input into one of these open models, that data set is then on the public internet for everyone to see and use. Today, hundreds of large open-source LLMs are available, a number that will grow exponentially.
Because of this growth and the security ramifications, companies and organizations must front-end these LLMs for use and fine-tune them with private data for task-specific goals. Companies and organizations can benefit from using these LLMs with their data, and adopting AI/ML will be paramount in a company's future. For example, 61 percent of recent Cisco survey respondents believe they have a maximum of one year to implement an AI strategy before their organization begins to incur significant negative business impact.
AI/ML is a must-have for any company moving forward to compete and be successful.
Cloud vs. on-premises deployment
Most customers today who are starting to explore AI/ML are utilizing public cloud resources because (a) it's easy to run AI/ML workloads in the public cloud, and (b) most customers do not have sizeable GPU-accelerated infrastructure. In the next 12 to 18 months, a decision must be made to either stay in the public cloud or build new high-performance infrastructure to support the business needs. Remaining in the public cloud and expanding the number of GPU-enabled nodes to train larger models will be very expensive, given increases in cloud costs. Building new infrastructure will also be very expensive, as power, cooling, rack space, hardware and engineering staff must be factored into the decision.
According to a recent IDC forecast, global spending on AI is expected to reach a minimum of $300 billion by 2026. Tirias Research projects that by 2028, the combined infrastructure and operational expenses for generative AI (GenAI) data center server nodes alone will surpass $76 billion, posing challenges to the business models and profitability of emerging services like search, content creation, and business automation incorporating GenAI.
Furthermore, a report from research firm 650 Group indicates that by 2028, nearly one in five Ethernet switch ports acquired by data centers will be associated with AI/ML and accelerated computing. Navigating the complexities of the evolving landscape effectively gives you better cost predictability for GenAI models. Still, achieving the proper performance at the enterprise scale comes with additional complexity and effort.
High-performance networking
Today, network engineers, especially in the data center space, must acquire AI/ML infrastructure skills and be able to discuss the required infrastructure upgrades and the reasoning for the upgrades with upper management. 95 percent of businesses know that AI will increase infrastructure workloads, but only 17 percent have fully flexible networks to handle this complexity. Expect to see 400Gb/s GPU node connectivity as standard, and as PCIe 6.0 and 7.0 hit the market, we will see the vast majority of switch ports deployed in AI networks in 2025 will be operating at 800 Gb/s and will double to 1,600 Gb/s by 2027. The faster we can connect the GPU nodes, the faster we can run our AI/ML jobs, lowering AI/ML infrastructure OPEX costs.
One would only expect to see 400 Gb/s at the aggregation layer in a traditional data center, as your typical server node just isn't going to saturate 400 Gb/s of bandwidth, let alone 100 Gb/s.
AI clusters, meanwhile, are an entirely different beast. The average AI node comes equipped with one 400 Gb/s NIC per GPU. Such nodes can pack four or eight GPUs — so do the math for NICs — and they're all needed to handle the immense data flows AI workloads generate.
Today, NVIDIA's InfiniBand continues to dominate AI networking deployments, and it is estimated that about 90 percent of deployments today use NVIDIA/Mellanox's InfiniBand instead of Ethernet. However, emerging technologies, like smart NICs and AI-optimized switch ASICs with deep packet buffers, have helped to curb packet loss, making Ethernet behave more like InfiniBand.
Analysts predict an increase in Ethernet's role in AI networks to capture about 20 points of revenue share by 2027. One of the reasons for this is the industry's familiarity with Ethernet. While AI deployments may still require specific tuning, enterprises already know how to deploy and manage Ethernet infrastructure.
Today's data center network engineers must acquire a diverse skill set to navigate the complexities of the evolving AI networking landscape effectively. Here are key areas they must focus on:
Business strategy
- Understand how to demonstrate the value of low-latency and lossless networking for AI and high-performance computing (HPC) to leadership.
- Articulate to management why traditional data center networks will not suffice for connecting GPU nodes in the context of AI initiatives.
- Understand and be able to explain why GPU nodes need low latency lossless fabrics and why most current networking infrastructure cannot support it.
Basics of AI infrastructure
- Architect HPC node connectivity with a focus on AI accelerator infrastructures.
- Differentiate between Ethernet and Infiniband solutions.
- Implement non-blocking fat tree networking in a spine-leaf topology to ensure a low-latency and lossless fabric.
- Familiarity with RDMA over Converged Ethernet (RoCE) and Infiniband solutions to make informed decisions supporting business needs.
Basics of automation and APIs
- Manage large HPC backend networks in an automated manner.
- Use APIs to manage and integrate tools for monitoring and managing HPC fabrics.
- Implement automation for fabric provisioning, port allocation, and Quality of Service (QoS) management on a per-port basis.
Basics of creating lossless fabrics
- Understand advanced queuing mechanisms like Data Center Quantized Congestion Notification (DCQCN), Priority Flow Control (PFC), and Explicit Congestion Notification (successfully implementing and managing RDMA over Ethernet, ensuring a lossless fabric for HPC backend networks).
Basics of securing an HPC network
- Implement segmentation and strict access controls to enhance network security.
- Develop disaster recovery (DR) scenarios to ensure data integrity and availability.
- Standardize and automate infrastructure configurations to mitigate the risk of human errors that could compromise security.
By mastering these areas, data center network engineers can effectively contribute to successfully implementing and managing advanced AI and HPC infrastructure, aligning technological capabilities with business objectives while maintaining a robust and secure network environment.
Summary
Businesses at all levels must proactively prepare for the anticipated rise in infrastructure costs due to the growing prominence of GenAI. Strategic decisions regarding data center architecture, energy efficiency and operational enhancements are crucial for maintaining profitability and fostering sustainable growth. The escalating demands of AI/ML on network bandwidth are poised to drive significant development in data center switching over the next five years. Ethernet switching linked to AI/ML and accelerated computing will transition from a niche to a substantial market share by 2028. Anticipated shipments of 800 Gbps-based switches and optics signal a pivotal advancement, pending scalable production to meet the evolving needs of AI/ML applications. Data center network engineers must prioritize learning HPC accelerated network design to plan for the future effectively.
WWT is creating learning paths, building composable AI/ML and HPC lab environments — see our AI Proving Ground — where engineers can learn about AI/ML and become proficient in the future needs to support the business as they leverage AI/ML in the workplace.