Optimizing Checkpoint Bandwidth for LLM Training

VAST Data

The Operating System for the Thinking Machine.

Published Oct 21, 2025

Summary: We analyzed 85,000 checkpoints across real-world systems to see how production training workloads actually interact with storage, and we found that global checkpoint bandwidth requirements are modest, typically well below 1 TB/s even for trillion-parameter-scale models. To make these findings actionable, we developed a simple sizing model that relates GPU scale and reliability to the required global checkpoint bandwidth. This offers system designers a practical way to provision storage to support efficiency checkpointing while preserving as much power, space, and budget as possible for GPUs.

The field of training LLMs at scale is evolving rapidly, and AI service providers are struggling to design infrastructure that can scale to meet the current and future needs of frontier models.

The scale and throughput of GPU clusters limit the rate at which new models can be developed, so it is reasonable to expect that their associated network and storage subsystems keep pace. However, although GPU designers publish reference architectures that quantify the I/O bandwidth required to keep their accelerators fully saturated, these guidelines only capture half of the design equation. They focus on the supply side of the equation—describing what GPUs could consume under ideal conditions—but do not account for how production workloads actually interact with storage.

VAST operates the data platform for many of the world’s largest AI training clusters, giving us visibility into the other half of the equation: the demand side. While we do not observe training on the GPU nodes directly, our systems capture the I/O patterns generated during pretraining and fine-tuning. This allows us to characterize checkpoint frequency, duration, and bandwidth, and to assess how much checkpointing overlaps with compute. By quantifying what training actually demands of storage, we can help customers optimally balance their investments in GPUs, storage, networking, power, cooling, and space.

To ground this analysis of the demand-side view, we analyzed checkpoint behavior across dozens of frontier-scale training runs.

Characterizing 85,000 AI model checkpoints

We surveyed 40 large (i.e., more than 1,024-GPU) production training runs, covering more than 85,000 checkpoints across 18 clusters at AI cloud providers. Model sizes ranged from 45 billion to over 1 trillion parameters, spanning modest-to-frontier-scale LLMs¹. And what we observed is that the checkpoint bandwidth per GPU decreases as model size grows.

This trend reflects the mechanics of data-parallel training: although larger models require more GPUs to train, only a single data-parallel replica must be written when checkpointing. As a result, checkpoint bandwidth does not scale linearly with cluster size, and per-GPU demand drops as more data parallel replicas are incorporated across more GPUs.

In addition, we found that the absolute bandwidth required for checkpointing decreases for the largest models.

This may seem counterintuitive: larger models require more GPUs to train, and more components means a greater chance of a failure interrupting. To maintain efficient training of large models, therefore, we might expect checkpoints to be written more frequently and completed more quickly—thus increasing bandwidth demands on storage.

However, our data suggest that model trainers are relying on asynchronous checkpointing, a technique where frequent checkpoints are written quickly to node-local storage and then drained to global storage at a lower, fixed frequency. Because whole-node failures are rare compared to subcomponent issues (such as link flaps or GPU ECC errors), these local checkpoints usually survive crashes. Our analysis suggests that model trainers optimize for this by synchronizing checkpoints to shared storage only often enough to protect against the relatively rare cases where node-local checkpoints are lost.

While the exact rate of whole-node failures varies (one study by the Llama team at Meta found this to be only 5% of failures), the implication is clear: global storage needs only enough bandwidth to absorb periodic drains of node-local checkpoints, not the full write rate implied by GPU throughput.

To validate this, we compared the checkpoint duration (time required to copy a checkpoint to VAST) with the checkpoint interval (time between successive checkpoints arriving on VAST). In one 800-billion-parameter training run, for example, the checkpoint interval was 40 minutes, and the median checkpoint duration was 3.6 minutes. From this, we can determine the checkpoint overlap, which is the fraction of time between checkpoints where asynchronous checkpointing was causing asynchronous background I/O. In this 800-billion-parameter run, the checkpoint overlap was about 9%.

Across the 40 production runs we studied, almost every one kept the median checkpoint overlap under 10% of total training time.

Recommended by LinkedIn

10 years of AI-specialized chips

Google Cloud 1 year ago

At Google, the future is multi-arch; AI and automation…

Google Cloud 1 month ago

How SK Telecom and VAST Data Rewrote the Rules of GPU…

VAST Data 3 months ago

While there is no formal threshold for what qualifies as a “tolerable” checkpoint overlap, the consistency of this result highlights how model trainers manage checkpointing overlap in practice.

Across all runs, asynchronous checkpoints drained to global storage at rates of 50–200 GB/s, with no clear correlation to model size. This indicates that even frontier-scale training can be sustained with relatively modest global checkpoint bandwidth (and could, in principle, tolerate higher checkpoint overlap and operate with even lower bandwidth). Global storage does not need to match peak GPU throughput.

Optimally balancing compute and data infrastructure for AI

This demand-side perspective has important implications for designing balanced training infrastructure across compute, networks, and storage.

In well-architected high-performance storage systems, bandwidth scales with the number of SSDs that can write in parallel, rather than with networking, software, or drive capacity. As a result, most AI training servers ship with 4 to 8 NVMe SSDs instead of a single high-capacity device. However, this performance comes at a cost: more drives increase the cost-per-terabyte, and when applied to distributed storage systems, also mean more servers, networking, and power, space, and cooling

So, although it may be tempting to demand the highest possible I/O performance to match theoretical GPU capabilities, that might be overkill. In fact, our analysis suggests that even trillion-parameter-scale models can train efficiently with well under 1 TB/s of checkpoint bandwidth (though many large VAST deployments are capable of TB/s-scale throughput). Checkpoints will ultimately use all of the I/O bandwidth that is available to them at scale, so provisioning more bandwidth will always allow checkpoints to drain faster. However, overprovisioning also consumes resources that could otherwise support more GPUs without providing any improvement in training time performance–remember, the GPUs aren’t blocked while checkpoints drain asynchronously.

The optimal architecture for AI training neither minimizes nor maximizes I/O bandwidth, but instead balances GPU capability, reliability, and checkpoint draining throughput to maximize training efficiency within the constraints of power, space, and cooling. Assuming a 10% checkpoint overlap, we see that the required global bandwidth is modest for all but the highest checkpoint frequencies and largest models.

Here are some guidelines to shape how system architects should think about training LLMs at scale:

Checkpoint overlap, not GB/s, is the most relevant performance metric for checkpointing. Lower overlap is better because it reduces the likelihood of a catastrophic failure happening before a checkpoint can be synchronized to shared storage. Our findings indicate that 10% overlap is sufficient.
Global storage should be designed to drain asynchronous checkpoints. Attempting to provision performance to match every GPU’s theoretical maximum results adds little to overall training efficiency because synchronous checkpoints only block GPUs for as long as it takes to write to node-local NVMes, and node-local NVMes already scale with GPU count and model size.
Design for job reliability, not hero bandwidth. Global storage should be used to protect against infrequent whole-node failures. Most failures affect individual GPUs or network links and can be recovered using node-local checkpoints.
Balance performance headroom against GPU investment. Over-provisioning I/O performance reduces the number of GPUs that a datacenter can support. Storage should be an enabler of training efficiency, not an independent race for the highest I/O benchmark number.

And here’s a simple sizing model that relates model size, checkpoint interval, and acceptable overlap to the global checkpoint bandwidth required:

In practice, the checkpoint interval should be proportional to the rate at which the job crashes, which in turn scales with the number of GPUs in use. For example, a 405-billion-parameter model trained on 16,000 H100 GPUs experienced a mean time to interrupt of 150 minutes. At that scale, a reasonable checkpointing interval may have been one-tenth of this, or about 15 minutes². Assuming 14 bytes of checkpoint per parameter, the required checkpoint bandwidth is:

This example demonstrates how the sizing model translates cluster reliability and model design into defensible global bandwidth requirements. However, it can also be applied to a range of model sizes and checkpoint intervals to understand how bandwidth requirements vary as models scale.

And if there’s one thing we can take away from the 85,000 checkpoints we examined, it’s that even the largest LLMs can train efficiently with modest global checkpoint bandwidth, thus freeing up datacenter resources for more compute and faster time to training completion.

¹ We use clustering to attribute individual checkpoints to a single model training run, then use cluster centers to estimate the checkpoint size. We then assume all models checkpoint 14 bytes per parameter to estimate the parameter size of each model trained. This introduces a systematic uncertainty across all estimated model sizes of around +/- 15%, but it does not materially change the conclusions.

² One-tenth (10%) is a relatively arbitrary number we chose to keep the expected lost work from a random failure small relative to the job’s mean time between failures. We are also assuming that all restarts must come from checkpoints that drained to shared storage which is very conservative. Recall that only 5% of failures in this 405-billion-parameter model training resulted from a whole node (and its un-drained checkpoint) being lost.

Roy Varghese

1mo

Great analysis Glenn! Interested to see how this may affect the architecture decisions and budget allocations of future systems

2 Reactions

Joe Greenseid

1mo

this is a really interesting analysis, thanks for doing this and publishing it, Glenn!

3 Reactions

Brian Verkley

1mo

Bringing the data! I love it!

3 Reactions

Antoine Patte

1mo

Real life and Real experience VAST Data !

2 Reactions

Glenn K. Lockwood

1mo

This was a fun analysis to conduct, and there will be more to come. I'm looking forward to sharing some of the technical details behind how we did this analysis at #SC25 and other events to follow!

Optimizing Checkpoint Bandwidth for LLM Training

VAST Data

The Operating System for the Thinking Machine.

Characterizing 85,000 AI model checkpoints

Recommended by LinkedIn

Optimally balancing compute and data infrastructure for AI

More articles by VAST Data

Others also viewed

GP Bullhound's weekly review of the latest news in public markets.

The Martini Straw Analogy: Unraveling Memory Bandwidth Bottlenecks in AI Compute

How Dell’s AI Infrastructure Updates Deliver Choice, Control And Scale

Why OpenAI Is Embracing Google TPUs: A Strategic Shift in AI Compute Infrastructure

AMR Future Brief| Exploring the Potential of Heterogeneous Integration in Next-Generation Computing

Simplifying the dataflow with a switch fabric!

The Debate About the Quality of AI Earnings

Amazon's Project Rainier - The $11B Data Center Revolution

Photonic Computing Breaks Speed Barrier: 1000x Faster Than Traditional Processors

GKE Upcoming pricing change. Hackathon and scalable HPA

Explore content categories

Characterizing 85,000 AI model checkpoints

Recommended by LinkedIn

Optimally balancing compute and data infrastructure for AI

More articles by VAST Data

How TACC Pushed Supercomputing Toward an IO-First Architecture

Beyond the Loop: Architecting a VAST, Multi-Agent AI System

VAST Data and Lablup Power Korea’s Sovereign AI Initiative with Integrated Backend.AI and the VAST AI Operating System

VAST Data Partners with Microsoft to Power the Next Wave of Agentic AI

VAST Data Receives Multiple Honors in 2025 HPCwire Readers’ and Editors’ Choice Awards

VAST Data and TACC Partner to Deliver Horizon, the Most Powerful National Science Foundation Supercomputer Ever Built

Streaming 136M Messages Per Second on VAST

When Software Becomes the System

Unlocking Data-Driven Automation with VAST Serverless Functions in DataEngine

VAST Data Partners with Google Cloud to Enable Enterprise AI at Scale Across Hybrid Cloud Environments

Others also viewed

GP Bullhound's weekly review of the latest news in public markets.

The Martini Straw Analogy: Unraveling Memory Bandwidth Bottlenecks in AI Compute

How Dell’s AI Infrastructure Updates Deliver Choice, Control And Scale

Why OpenAI Is Embracing Google TPUs: A Strategic Shift in AI Compute Infrastructure

AMR Future Brief| Exploring the Potential of Heterogeneous Integration in Next-Generation Computing

Simplifying the dataflow with a switch fabric!

The Debate About the Quality of AI Earnings

Amazon's Project Rainier - The $11B Data Center Revolution

Photonic Computing Breaks Speed Barrier: 1000x Faster Than Traditional Processors

GKE Upcoming pricing change. Hackathon and scalable HPA

Explore content categories