When you consider the objective behind any AI system initiative is to improve productivity, it’s somewhat ironic that their performance can be severely affected by an often neglected, but critical element to AI’s success.
And when companies build their own AI systems, the focus typically falls on models, data, GPUs and software frameworks. But a frequently overlooked yet critical component of successful AI deployment is the network itself.
AI workloads behave very differently from traditional enterprise applications. Distributed training generates intense east-west traffic between servers, not just north-south traffic between users and data centres. Large volumes of data move continuously between GPUs and nodes, making ultra-low and, crucially, deterministic latency essential. Even small amounts of jitter can slow synchronisation and extend training times because distributed AI workloads rely on tightly coordinated communication between nodes. This means that delays in even a single data exchange can hold up the entire process and impact AI’s performance and the end-user’s quality of experience.
Throughput is equally important. AI clusters often run links at sustained high utilisation, where even minor packet loss can trigger retransmissions and degrade performance. Lossless or near-lossless behaviour, intelligent congestion management and carefully designed spine–leaf architectures become fundamental rather than optional.
There are also several lesser-known network characteristics that can influence the success of AI deployments. From subtle traffic behaviours within high-performance compute clusters, to timing and synchronisation considerations, and the way large datasets move across infrastructure, these factors are often only discovered once organisations begin scaling their AI systems. For example:
Timing and synchronisation
Many AI training environments rely on precise timing across nodes, meaning clock synchronisation protocols such as PTP (Precision Time Protocol) become critical.
Microburst traffic
AI clusters often produce microbursts of traffic when gradient updates are exchanged between nodes. These bursts can exceed switch buffer capacity even when average utilisation appears safe.
Incast patterns
During training phases, multiple workers may simultaneously send updates to a single parameter server or aggregation node, creating incast congestion that can overwhelm switch queues.
Because of this, AI environments require dedicated network testing and validation processes. Simply applying traditional performance checks is rarely sufficient because networks need to be evaluated under the types of traffic patterns and load conditions that AI workloads actually create.
This includes validating precise timing and synchronisation across nodes to minimise jitter, analysing how the network handles sudden microbursts of high-intensity traffic that can overwhelm buffers, and stress-testing incast scenarios where many nodes transmit simultaneously to a single point, potentially triggering congestion.
Beyond this, companies must also consider how networks perform under sustained high utilisation, how they respond to even minimal packet loss, whether bandwidth is consistently and fairly distributed across highly parallel workloads, and how resilient the network remains during failures or disruption. Together, these factors highlight that validating AI networks requires a far broader and more specialised approach than traditional testing alone. In short, the network is no longer just supporting infrastructure – it’s a key enabler of successful AI system performance and ensuring a productive AI quality of experience for your users.
These are some of the issues network engineers encounter when AI workloads begin to scale. At Future Networks LIVE on 28th April, we’ll be exploring these behaviours in more detail, including how organisations are testing and validating their infrastructure.
Click here for the full agenda and registration: www.redhelix.com/media/future-networks-live-agenda/