AMD, Broadcom, Cisco, Google, Hewlett Packard Enterprise, Intel, Meta and Microsoft form Ultra Accelerator Link (UALink) advocacy group to boost AI connectivity in data centres
A broad industry alliance of data centre infrastructure companies has joined forces to address Nvidia’s dominance of AI computing accelerators and more specifically, the superfast networking technology that makes them work.
Current AI compute clusters are different to older computers because the interconnect in the individual computing nodes now connects directly to the AI accelerators (or GPUs). In these systems, each computing node (a single unit within the cluster) has its AI accelerators directly connected to each other, bypassing the CPUs. This direct connection allows the AI accelerators to quickly share data and work together to handle large AI models that a single accelerator’s memory can’t accommodate on its own.
Additionally, these AI accelerators are not only connected within a single node but also have external connections. These connections link multiple nodes together through high-speed switches, enabling fast and efficient communication between nodes. This setup ensures that data can be transferred with high bandwidth and low latency, which is crucial for processing large AI workloads effectively.
NVLink rules
The issue is that Nvidia owns this space with its proprietary NVLink. As a result, AMD, Broadcom, Cisco, Google, Hewlett Packard Enterprise (HPE), Intel, Meta and Microsoft announced that they are aligned on the development of the open interconnect Ultra Accelerator Link (UALink) instead of each using their own various links. In this way, they want the Ultra Accelerator Link to form the external bridge between a maximum of 1,024 nodes in an AI cluster.
By creating an interconnection based on open standards, UALink will enable OEMs, IT professionals and systems integrators to create a path to greater ease of integration, greater flexibility and scalability of their AI-connected data centres.
Specification 1.0 will allow up to 1,024 accelerators in an AI compute pod and allow direct loads and stores between memory connected to accelerators, such as GPUs, in the pod. The UALink promoter group has formed the UALink Consortium and expects its incorporation to occur in the third quarter of 2024.
Specification 1.0 is expected to be available in the third quarter of 2024 and will be made available to companies participating in the Ultra Accelerator Link (UALink) Consortium. It will enable direct data transfers between the memory attached to accelerators, such as AMD’s Instinct GPUs or specialized processors like Intel’s Gaudi.
AI milestone
“The work done by companies at UALink in creating a fabric of open, high-performance, and scalable accelerators is essential to the future of AI. Together, we bring extensive experience in creating large-scale AI and high-performance computing solutions based on open standards, efficiency, and robust ecosystem support,” said AMD Data Center Solutions Group EVP and GM Forrest Norrod.
“It is essential to support an open ecosystem collaboration to enable scaling networks with a variety of high-speed, low-latency solutions,” said Broadcom Data Center Solutions Group VP and GM Jas Tremblay.
“Very high-performance interconnects become increasingly important as AI workloads continue to grow in size and scope,” said Cisco Common Hardware Group EVP Martin Lund. “Together, we are committed to developing UALink, which will be a scalable and open solution to help overcome some of the challenges in developing AI supercomputers.”
“As a founding member of the industry’s UALink consortium, we look forward to contributing our expertise in high-performance networks and systems, and collaborating on the development of a new open standard for accelerator interconnects for the next generation of supercomputing,” said HPE HPC & AI Infrastructure Solutions SVP and GM Trish Damkroger.
“UALink is an important milestone in the advancement of artificial intelligence computing. Intel is proud to jointly lead this new technology and bring our expertise in creating an open, dynamic AI ecosystem,” said Intel Network and Edge Group SVP and GM Sachin Katti. “This initiative extends Intel’s commitment to AI connectivity innovation, which includes leadership roles in the Ultra Ethernet Consortium and other standards organisations.”
“In a short space of time, the technology sector has accepted challenges that AI and HPC have revealed. Interconnecting accelerators like GPUs requires a holistic perspective in seeking to improve efficiency and performance,” said Ultra Ethernet Consortium president J Metz. “At UEC, we believe that UALink’s scaling approach to troubleshooting pod clusters complements our own scaling protocol. We also look forward to collaborating together on creating an open, industry-wide ecosystem-friendly solution that solves both types of needs in the future.”
The future with Ethernet
It makes sense for the Ultra Ethernet Consortium to be involved in this initiative. It was formed last Summer by Intel, AMD, Meta, HPE, and others for high performance networking and Ultra Accelerator Link complements this work by linking GPUs within pods.
There is plenty of work happening on the Ethernet front like Remote DMA over Converged Ethernet (RoCE) which can be used for high-performance networking of clusters. Even the companies involved in UAUlinLink are working on 800G Ethernet adaptors and work is already happening on the specification for 1.6-TBit Ethernet (IEEE P802.3dj draft). The Ultra Ethernet Consortium has been working since last year, under the auspices of the Linux Foundation on its plan to accelerate all parts of Ethernet including the Physical Layer, Link Layer, Transport Layer and Software Layer.