Gartner: Considerations when using GPUs in the datacentre

CIOs expect extensive value from their artificial intelligence (AI) investments, including increased productivity, enhanced customer experience (CX) and digital transformation. As a result, Gartner client interest in deploying AI infrastructure – including graphics processing units (GPUs) and AI servers – has grown substantially. 

Specifically, client enquiries regarding GPUs and AI infrastructure increased nearly fourfold annually from October 2022 through October 2024. Clients are exploring the use of hosted, cloud and on-premise-based options for GPU deployment. In some cases, enterprises will select a “full-stack” AI offering that includes GPU, compute, storage and networking in a bundled package. In other instances, enterprises will select and deploy the pieces, individually selected and integrated. The requirements of AI workloads are different from most existing datacentre workloads.

Multiple interconnect technologies are available to support GPU connectivity. A common question from Gartner clients is: “Should I use Ethernet, InfiniBand or NVLink to connect to GPU clusters?” All three approaches can be valid, depending on the scenario.

These technologies are not mutually exclusive. Enterprises can deploy them in conjunction with one another (for example, In?niBand or Ethernet) to scale out beyond a rack. A common misconception is that only In?niBand or a supplier-proprietary interconnect technology (such as NVLink) can deliver appropriate performance and reliability.

However, Gartner recommends that enterprises deploy Ethernet over alternative technologies, such as In?niBand, for GPU clusters up to several thousand. Ethernet-based infrastructure can provide the necessary reliability and performance, and there is widespread enterprise experience with the technology. Furthermore, a broad ecosystem of suppliers is associated with Ethernet technology. 

Optimise network deployments for GPU traffic 

The current state of practice for computer processing unit (CPU)-based, general-purpose computing workloads is a leaf/spine network topology.

However, leaf-spine topologies are not always optimal for AI workloads. In addition, running AI workloads colocated with existing datacentre networks can create noisy neighbour effects that degrade performance both for AI and existing workloads. This can delay the processing and job completion time for AI workloads, which is highly inefficient.

In a buildout of AI infrastructure, networking switches typically represent 15% or less of the cost. As a result, saving money by using existing switches often leads to suboptimal overall price/performance for the AI workload investment. As a result, Gartner makes several recommendations. 

Due to the unique traffic requirements and GPU costs, Gartner suggests building out dedicated physical switches for GPU connectivity. Furthermore, rather than defaulting to a leaf-spine topology, Gartner also suggests using a minimal number of physical switches to reduce physical “hops”. This could ultimately lead to a leaf-spine topology, as well as other topologies, including single-switch, two-switch, full-mesh, cube-mesh and dragonfly.

Avoid using the same switches for other generalised datacentre computing needs. For GPU clusters below 500 GPUs, one or two physical switches is ideal. For organisations with more than 500 GPUs, Gartner advises IT decision-makers to build out a dedicated AI Ethernet fabric. This is likely to require a deviation from the standard, state-of-practice, top-of-rack topologies towards middle-of-row and/or modular switching implementations. 

Enhance Ethernet buildouts

Gartner recommends using dedicated switches for GPU connectivity. When deploying Ethernet (compared with InfiniBand or shelf/rack/row optimised), use switches with specific requirements. Switches need to support: 

  • High-speed interface for GPUs, including 400Gbps access ports and above.
  • Support for lossless Ethernet, including advanced, congestion-handling mechanisms – for example, datacentre quantised congestion notification (DCQCN).
  • Advanced traffic-balancing capabilities, including congestion-aware load balancing.
  • Remote Direct Memory Access (RDMA)-aware load balancing and packet spraying.

Support for static pinning of flows 

Furthermore, the software to manage AI networking fabrics must be enhanced as well. This requires functionality at the management layer to alert, diagnose and remediate issues quickly. In particular, management software that provides advanced granular telemetry (including sub-second and sub-100 millisecond intervals) is ideal for troubleshooting and visibility. In addition, the ability to monitor and alert (in real time) and provide historical reporting for bandwidth utilisation, packet loss, jitter, latency and availability at the sub-second level is required. 

Ultra Ethernet (and accelerator) support

When building fabrics, Gartner advises IT leaders to consider hardware providers that pledge to support the Ultra Ethernet Consortium (UEC) and Ultra Accelerator Link (UAL) specifications.

The UEC is developing an industry standard to support high-performance workloads on Ethernet. As of February 2025, there is no proposed standard available, but Gartner expects a proposal before the end of 2025. The need for a standard stems from the fact that suppliers currently use proprietary mechanisms to provide the high-performance Ethernet necessary for AI connectivity. 

Long term, this reduces interoperability for customers as it locks them into a single supplier’s implementation. The benefit of suppliers confirming a consistent UEC standard is the ability to interoperate.

There is also a separate, but related, standards effort for shelf/rack/row-optimised accelerator link called the UAL. The goal of UAL is to standardise a high-speed, scale-up accelerator interconnect technology aimed at addressing scale-up network bandwidth needs that are beyond what Ethernet and InfiniBand are currently capable of. 

Reduce risk with co-certified implementations

Finally, because of the stringent performance requirements for AI workloads, connectivity between GPU and network switches needs to be optimised and error-free from a hardware and software perspective. This can be increasingly challenging, given the rapid pace of change associated with both networking and GPU technology.

To mitigate the potential for implementation challenges, Gartner recommends following validated implementation guides that are co-certified (see box: Benefits of co-certification of networking GPUs) by the networking and GPU suppliers. The value of following co-certified design is that both suppliers should stand by deployments that are done according to this specification, ultimately reducing the likelihood of issues and decreasing mean time to repair (MTTR) in the event of an issue.


This article is based on an excerpt of the Gartner report, Key networking practices to support AI workloads in the data center. Andrew Lerner is a distinguished vice-president analyst at Gartner.

#Gartner #Considerations #GPUs #datacentre