Article written by Mike Kieran, Senior Marketing Manager, IBM. 

Today's advanced analytics and artificial intelligence applications are built on deep learning models that require extreme computational power and high-speed data access at every stage, from data ingest and model development to AI inferencing. These workloads need high-performance storage systems tightly coupled with accelerated AI infrastructure. 

We're excited to announce that IBM Scale System 6000 is now an approved storage solution for NVIDIA DGX SuperPOD

DGX SuperPOD is a turnkey AI infrastructure designed to help solve the world's most challenging computational problems and aimed at organizations looking to rapidly deploy a robust platform for deep learning (DL) and AI development. When leveraging Scale System 6000 storage with DGX SuperPOD, IBM works with NVIDIA to test, plan, and install the system, with the storage backed by IBM global deployment and support services .

Performance requirements for AI workloads vary significantly depending on the types of AI models and data formats being used. A storage system approved for DGX SuperPOD needs to be optimized for small, random I/O patterns, and provide high peak system performance and high aggregate filesystem performance to meet the variety of workloads an organization may encounter.

For example, large language model training requires regular checkpointing to save the state of the training. Model training is paused until the complete checkpoint is written, so peak write throughput is an important storage requirement. Inferencing is latency-sensitive, but the workload is often random with a mixture of reads and writes. 

To help customers characterize their own performance requirements, NVIDIA provides guidance on workloads and datasets, as shown in Table 1. 

Performance Level RequiredExample WorkloadsDataset Size
GoodNatural language processingMost datasets fit in cache
BetterImage processing with compressed images, ImageNet/ResNet-50Many to most datasets can fit within the local system's cache.
BestTraining with 1080p, 4K, or uncompressed images, offline inference, ETLDatasets are too large to fit into cache, massive first epoch I/O requirements, workflows that only read the dataset once

Table 1 – NVIDIA guidance characterizing different I/O workloads

Storage Scale System 6000 attains the performance levels in the guidance's "best" category.

Multiple classes of validation tests are used to evaluate a particular storage technology and configuration for use with DGX SuperPOD: microbenchmark performance, real application performance, and functional testing. Beyond performance, storage solutions are evaluated for robustness and resiliency as part of functional testing. A pair of Scale System 6000 systems met the requirements of all three test classes.

Storage Scale System 6000 uses a simple building-block approach to grow capacity and performance. Each Storage Scale System is a single 4U node with active-active controllers and redundant hardware, providing up to 310 gigabytes per second (GB/S) of throughput, with up to 13 million IOPS using NVMeoF. Performance scaling is linear — i.e. a cluster of 10 Storage Scale System 6000 systems is capable of more than 3 terabytes per second of throughput. It also supports up to nine SAS hard disk drive expansion enclosures.

To help unlock the full potential of AI and ensure that fast accelerated infrastructure isn't being hampered by slow IO, Storage Scale System 6000 supports NVIDIA GPUDirect Storage, which enables a direct data path between GPU memory and remote storage. This GPUDirect architecture removes the host server CPU and DRAM from the data path, so the IO path between storage and the GPU is shorter. Scale System 6000 supports NVIDIA ConnectX-7 network interface cards, which enable 200 Gb/s and 400 Gb/s of NVIDIA InfiniBand and Ethernet networking between the storage system and the GPUs.

Test IBM's Storage Scale System in WWT's AI Proving Ground Learn More 

Technologies