Scale-Up, Scale-Out, and Scale-Across: The Three Core Scaling Strategies in AI Data Centers

With the rapid advancement of artificial intelligence models, model sizes are growing at an unprecedented pace. From large language models to multimodal models, the number of parameters, training data volume, and computational requirements have far exceeded what traditional single-node or single-cluster architectures can handle.

In this context, relying on a single scaling approach is no longer sufficient to meet the performance, efficiency, and resource utilization demands of modern AI systems. Whether it is increasing the number of GPUs within a single machine or simply scaling out by adding more compute nodes, neither approach alone can fundamentally address the communication overhead and system bottlenecks in large-scale model training.

As a result, three core scaling strategies have gradually emerged in modern AI data center architectures: Scale-Up (vertical scaling), Scale-Out (horizontal scaling), and Scale-Across (cross-domain scaling). These approaches address computational capacity and system scale from different dimensions, and together form the foundation of today’s AI infrastructure design and optimization.

What is Scale-Up?

Scale-Up refers to increasing computational capacity by adding more resources within a single system. In AI infrastructure, this typically means increasing the number of GPUs, memory capacity, and high-speed interconnect bandwidth within a single node (server), rather than distributing workloads across multiple machines.

The core goal of Scale-Up is to maximize intra-node performance, ensuring that communication between accelerators is as fast and low-latency as possible.

Evolution of GPU Servers

The evolution of GPU servers has been a key driver behind modern Scale-Up architectures.

Early GPU systems typically consisted of only 1–2 GPUs connected via PCIe. However, this approach had clear limitations in terms of bandwidth and latency, making it insufficient for large-scale AI workloads. As model sizes continued to grow, server designs evolved to support 4 GPUs, 8 GPUs, or even more within a single node.

Modern AI servers increasingly rely on tightly integrated multi-GPU architectures, where multiple GPUs can be treated as a unified pool of compute resources. This design enables large-scale tensor operations and model parallelism to be executed more efficiently within a single machine, reducing the need for cross-server communication.

The Importance of NVLink and NVSwitch

NVLink plays a critical role in Scale-Up architectures. Compared to traditional PCIe connections, NVLink provides significantly higher bandwidth and lower latency, enabling more efficient data exchange between GPUs.

However, in modern high-end GPU servers, point-to-point NVLink connections alone are no longer sufficient for scaling to many GPUs. As a result, NVLink is typically combined with NVSwitch to build a fully connected high-speed switching fabric within a single node.

The introduction of NVSwitch allows any GPU to communicate with any other GPU at high bandwidth, effectively merging multiple discrete GPUs into a unified logical compute pool. This architecture is essential for AI training workloads that require frequent large-scale inter-GPU communication, such as tensor parallelism and global gradient synchronization.

Without high-speed interconnect technologies like NVLink and NVSwitch, the overall performance of Scale-Up systems would be severely constrained. The primary bottleneck would shift from compute capability to inefficient GPU-to-GPU communication.

The essence of Scale-Up is to integrate multiple GPUs within a single node into a unified, low-latency, high-bandwidth computing system by increasing GPU count and leveraging high-speed interconnects such as NVLink and NVSwitch.

DimensionPCIeNVLinkNVSwitch
Technology TypeGeneral-purpose I/O busHigh-speed GPU interconnectGPU switching fabric
Design GoalConnect diverse devices (GPU, SSD, NIC)Accelerate GPU-to-GPU communicationBuild a unified multi-GPU compute domain
Typical TopologyTree (Root Complex-based)Point-to-point (Mesh/Ring)Fully connected Crossbar
AI Optimization❌ No✅ Yes✅ Yes (advanced)
GPU Communication PathVia CPU / PCIe switchDirect GPU-to-GPUThrough NVSwitch ASIC
LatencyRelatively highLowExtremely low
Bandwidth (per link)PCIe Gen4 x16 ≈ 32 GB/sPCIe Gen5 x16 ≈ 64 GB/sNVLink v4 ≈ 50 GB/s per direction per link (aggregated via multiple links)Aggregate bandwidth up to TB/s scale
ScalabilityLimited (mainly inter-node)Moderate (within a node, limited GPU count)High (8/16+ GPUs fully interconnected)
Communication ModelCPU-centricGPU Peer-to-PeerGlobal GPU memory fabric
All-to-All Efficiency❌ Poor⚠️ Moderate✅ Excellent
Unified Memory Support❌ No⚠️ Partial✅ Strong (coherent memory space)
Typical Use CasesGeneral-purpose servers4–8 GPU serversDGX / HGX / large-scale AI training
Representative ArchitecturesStandard x86 serversNVLink Bridge / HGX platformsNVSwitch-based systems (e.g., DGX H100)

What is Scale-Out?

Scale-Out refers to expanding AI computing capacity by connecting multiple servers across a data center network, rather than relying on resources within a single machine. In modern AI infrastructure, this approach is essential for training large-scale models that require coordination across multiple nodes and thousands or even tens of thousands of GPUs.

Unlike Scale-Up, the core focus of Scale-Out is not single-node performance, but cross-server communication capability and network architecture design, because the efficiency of distributed training heavily depends on the speed of data exchange between nodes.

Core Enabling Technologies

• 800G High-Speed Ethernet

In Scale-Out architectures, 800G Ethernet switches serve as the core infrastructure of the data center network, carrying massive volumes of inter-server GPU communication traffic. As AI models continue to grow, the bandwidth demand between GPUs increases rapidly, and 800G networks provide the necessary bandwidth foundation to prevent the network from becoming a system bottleneck in large-scale distributed training.

• RoCE Technology

To achieve low-latency and high-throughput communication, AI clusters widely adopt RoCE (RDMA over Converged Ethernet). RoCE enables direct memory-to-memory data transfer between GPUs across different servers, bypassing the CPU and the operating system kernel path, thereby significantly reducing communication overhead.

This mechanism is particularly critical in distributed training operations such as gradient synchronization and parameter aggregation, where strong consistency is required. It effectively reduces communication latency and improves overall training efficiency.

• InfiniBand (IB)

InfiniBand is a high-performance networking architecture designed specifically for HPC and AI workloads. It provides more stable low latency and higher communication efficiency compared to Ethernet-based solutions, but its drawback is that it relies on a proprietary protocol ecosystem.

Leaf–Spine Network Topology

Scale-Out networks are typically built on a Leaf–Spine architecture to support non-blocking forwarding of large-scale east-west traffic.

  • Leaf switches connect compute nodes (servers or GPU nodes)
  • Spine switches provide high-bandwidth interconnectivity across the entire network

This design ensures highly symmetric network paths, making communication latency more predictable while delivering near-uniform bandwidth across the cluster. As a result, it is well-suited for large-scale AI training workloads.

AI Traffic Model as the Driving Force

From a workload perspective, Scale-Out architectures are driven by AI-specific traffic patterns, which are fundamentally different from traditional internet applications.

AI training typically generates large-scale All-to-All and Many-to-Many communication patterns, such as:

  • Gradient All-Reduce: Each GPU computes its own gradients, but all results must be synchronized and averaged across all GPUs to keep the model consistent.
  • Parameter Exchange: Instead of only sharing outputs, GPUs directly exchange parts of the model parameters.
  • Tensor Parallelism: A large model is split across multiple GPUs, which jointly compute different parts of the same operation.
  • MoE (Mixture-of-Experts) Routing Communication: Data is dynamically routed to different GPUs (experts), rather than following a fixed computation path.

These communication patterns place extremely high demands on network bandwidth, latency, and congestion control, making network performance a critical factor in overall training efficiency.

What is Scale-Across?

Scale-Across refers to an architecture that extends AI workloads beyond a single data center by connecting multiple compute clusters distributed across different data centers or cloud regions. Unlike Scale-Up, which focuses on scaling within a single node, and Scale-Out, which focuses on scaling within a single data center, Scale-Across builds a unified AI infrastructure composed of geographically distributed clusters.

In this architecture, each cluster operates its own GPU resources and training tasks, while collectively participating in a global AI training or inference workflow. As a result, Scale-Across is critical for frontier AI systems that require extremely large-scale compute capacity, high availability, and multi-region deployment capabilities.

• Multi-Cluster AI Systems and Hybrid Interconnects

The foundation of Scale-Across lies in multi-cluster AI systems, which are connected through hybrid cloud and data center interconnect (Hybrid Cloud / DC Interconnect) technologies. These clusters may be distributed across on-premises data centers, public cloud regions, and edge environments, collectively forming a distributed computing network.

The interconnect layer typically relies on wide-area networks (WAN), dedicated private links, and high-bandwidth backbone networks. Compared to intra–data center networks, these cross-domain links have significantly higher latency and lower effective bandwidth, making communication constraints much more stringent.

Therefore, Scale-Across architectures often combine on-prem GPU clusters with cloud-based compute resources, dynamically scheduling workloads across regions based on performance, cost, and resource availability.

• Workload Orchestration

Workload orchestration is a key capability in Scale-Across architectures, responsible for coordinating AI task execution across multiple clusters and heterogeneous environments.

It includes distributed task scheduling, cross-region model partitioning, data locality management, and cross-cluster synchronization mechanisms. Due to the much higher latency of WAN compared to intra–data center communication, orchestration systems must carefully balance local compute efficiency with cross-cluster coordination to avoid overall performance degradation.

In practical systems, Scale-Across typically adopts a hierarchical orchestration approach: each data center independently optimizes local training efficiency, while a higher-level control plane manages global model updates and cross-cluster resource scheduling.

Scale-Across extends AI infrastructure beyond a single data center by connecting multiple compute clusters, enabling unified orchestration and coordination at a global scale.

However, this capability comes at the cost of significantly increased complexity in cross-cluster communication, synchronization, and system coordination, with the primary limitation being WAN latency and bandwidth constraints.

It should also be emphasized that Scale-Across capabilities are currently only partially implemented by a small number of hyperscalers such as Google, Microsoft/OpenAI, AWS, and Meta. The technology is still in an early stage, mainly used for multi-region resource scheduling and distributed deployment, rather than true cross–data center synchronous training.

Why AI Needs Three Types of Scaling

Modern AI systems cannot rely on a single scaling strategy, because performance bottlenecks arise in fundamentally different dimensions. As model sizes continue to grow, AI infrastructure has evolved into a multi-layer scaling system where Scale-Up, Scale-Out, and Scale-Across are all essential, jointly enabling efficient end-to-end computation and global deployment.

• Scale-Up: Improving Single-Node Performance

Scale-Up is the foundational layer of AI compute efficiency. It primarily increases the number of GPUs within a single compute node and leverages high-speed interconnect technologies such as NVLink and NVSwitch to boost performance.

The core goal of this layer is to minimize intra-node communication latency, allowing multiple GPUs to function as a unified compute engine. Without strong Scale-Up capability, even before distributed training begins, the single node itself becomes a performance bottleneck.

• Scale-Out: Expanding Training Scale

Scale-Out extends compute capacity by connecting multiple servers within a data center, making large-scale model training possible. It relies on high-speed networking technologies such as 800G Ethernet, RoCE, and InfiniBand to interconnect thousands of GPUs.

This layer primarily supports distributed training paradigms such as data parallelism, tensor parallelism, and pipeline parallelism. However, its efficiency is highly dependent on network bandwidth, latency, and congestion control capabilities.

• Scale-Across: Expanding System Boundaries

Scale-Across further extends AI infrastructure beyond a single data center by connecting multiple clusters across regions or cloud environments, enabling global-scale compute integration.

Unlike Scale-Up and Scale-Out, Scale-Across is primarily constrained by WAN latency and cross-region synchronization overhead. As a result, it is more commonly used for task distribution, multi-region deployment, and large-scale resource orchestration rather than high-frequency synchronous training.

DimensionScale-UpScale-OutScale-Across
Scaling ScopeWithin a single server / nodeAcross multiple servers within a data centerAcross multiple data centers / cloud regions
Primary GoalMaximize single-node compute performanceExpand cluster-level compute capacityBuild global-scale AI computing systems
Compute UnitMulti-GPU within one serverMulti-server GPU clustersMulti-cluster across regions / clouds
Communication ScopeGPU ↔ GPU (intra-node)Server ↔ Server (intra–data center)Data Center ↔ Data Center
Key Interconnect TechnologiesNVLink / NVSwitch800G Ethernet / InfiniBand / RoCEWAN / Cloud Interconnect / Backbone Networks
Typical TopologyNVSwitch fabricLeaf-Spine architectureHierarchical / multi-region mesh
Latency LevelNanoseconds to microsecondsMicrosecondsMilliseconds
Bandwidth CharacteristicsExtremely high (intra-node)High (data center scale)Relatively limited and variable
Main BottlenecksGPU interconnect limitsNetwork congestion / all-to-all trafficWAN latency / synchronization overhead
AI Traffic PatternsTensor / pipeline / MoE (intra-node)All-to-all / gradient synchronizationParameter synchronization / distributed consistency
Optimization FocusGPU interconnect efficiencyNetwork bandwidth + congestion controlScheduling + synchronization + geo-latency optimization
Typical Use CasesSingle-node large model trainingLarge-scale distributed trainingGlobal AI training / multi-cloud AI systems
Core ValueExtreme single-node performanceScalable training across clustersGlobal resource aggregation and coordination

Summary

Together, these three scaling approaches form a complete modern AI infrastructure stack:

  • Scale-Up: Optimizing single-node compute efficiency
  • Scale-Out: Scaling large distributed training within a data center
  • Scale-Across: Extending AI systems beyond global infrastructure boundaries

Only when all three work in coordination can AI systems achieve a balance between compute efficiency, scalability, and global deployment capability.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *