Large language models (LLMs) are expanding rapidly, necessitating increased computational power for processing inference requests. To meet real-time latency requirements and serve a growing number of users, multi-GPU computing is essential, according to the NVIDIA Technical Blog.
Benefits of Multi-GPU Computing
Even if a large model fits within a single state-of-the-art GPU's memory, the rate at which tokens are generated depends on the total compute power available. Combining the capabilities of multiple cutting-edge GPUs makes real-time user experiences possible. Techniques like tensor parallelism (TP) allow for fast processing of inference requests, optimizing both user experience and cost by carefully selecting the number of GPUs for each model.
Multi-GPU Inference: Communication-Intensive
Multi-GPU TP inference involves splitting each model layer's calculations across multiple GPUs. The GPUs must communicate extensively, sharing results to proceed with the next model layer. This communication is critical as Tensor Cores often remain idle waiting for data. For instance, a single query to Llama 3.1 70B may require up to 20 GB of data transfer per GPU, highlighting the need for a high-bandwidth interconnect.
NVSwitch: Key for Fast Multi-GPU LLM Inference
Effective multi-GPU scaling requires GPUs with excellent per-GPU interconnect bandwidth and fast connectivity. The NVIDIA Hopper Architecture GPUs, equipped with fourth-generation NVLink, can communicate at 900 GB/s. When combined with NVSwitch, every GPU in a server can communicate at this speed simultaneously, ensuring non-blocking communication. Systems like NVIDIA HGX H100 and H200, featuring multiple NVSwitch chips, provide significant bandwidth, enhancing overall performance.
Performance Comparisons
Without NVSwitch, GPUs must split bandwidth into multiple point-to-point connections, reducing communication speed as more GPUs are involved. For example, a point-to-point architecture provides only 128 GB/s of bandwidth for two GPUs, whereas NVSwitch offers 900 GB/s. This difference substantially impacts overall inference throughput and user experience. Tables in the original blog illustrate the bandwidth and throughput benefits of NVSwitch over point-to-point connections.
Future Innovations
NVIDIA continues to innovate with NVLink and NVSwitch technologies to push real-time inference performance boundaries. The upcoming NVIDIA Blackwell architecture will feature fifth-generation NVLink, doubling speeds to 1,800 GB/s. Additionally, new NVSwitch chips and NVLink switch trays will enable larger NVLink domains, further enhancing performance for trillion-parameter models.
The NVIDIA GB200 NVL72 system, connecting 36 NVIDIA Grace CPUs and 72 NVIDIA Blackwell GPUs, exemplifies these advancements. This system allows all 72 GPUs to function as a single unit, achieving 30x faster real-time trillion-parameter inference compared to previous generations.
Image source: Shutterstock