Optimizing Checkpoint Bandwidth for LLM Training
Summary: We analyzed 85,000 checkpoints across real-world systems to see how production training workloads actually interact with storage, and we found that global checkpoint bandwidth requirements are modest, typically well below 1 TB/s even for trillion-parameter-scale models. To make these findings actionable, we developed a simple sizing model that relates GPU scale and reliability to the required global checkpoint bandwidth. This offers system designers a practical way to provision storage to support efficiency checkpointing while preserving as much power, space, and budget as possible for GPUs.
The field of training LLMs at scale is evolving rapidly, and AI service providers are struggling to design infrastructure that can scale to meet the current and future needs of frontier models.
The scale and throughput of GPU clusters limit the rate at which new models can be developed, so it is reasonable to expect that their associated network and storage subsystems keep pace. However, although GPU designers publish reference architectures that quantify the I/O bandwidth required to keep their accelerators fully saturated, these guidelines only capture half of the design equation. They focus on the supply side of the equation—describing what GPUs could consume under ideal conditions—but do not account for how production workloads actually interact with storage.
VAST operates the data platform for many of the world’s largest AI training clusters, giving us visibility into the other half of the equation: the demand side. While we do not observe training on the GPU nodes directly, our systems capture the I/O patterns generated during pretraining and fine-tuning. This allows us to characterize checkpoint frequency, duration, and bandwidth, and to assess how much checkpointing overlaps with compute. By quantifying what training actually demands of storage, we can help customers optimally balance their investments in GPUs, storage, networking, power, cooling, and space.
To ground this analysis of the demand-side view, we analyzed checkpoint behavior across dozens of frontier-scale training runs.
Characterizing 85,000 AI model checkpoints
We surveyed 40 large (i.e., more than 1,024-GPU) production training runs, covering more than 85,000 checkpoints across 18 clusters at AI cloud providers. Model sizes ranged from 45 billion to over 1 trillion parameters, spanning modest-to-frontier-scale LLMs¹. And what we observed is that the checkpoint bandwidth per GPU decreases as model size grows.
This trend reflects the mechanics of data-parallel training: although larger models require more GPUs to train, only a single data-parallel replica must be written when checkpointing. As a result, checkpoint bandwidth does not scale linearly with cluster size, and per-GPU demand drops as more data parallel replicas are incorporated across more GPUs.
In addition, we found that the absolute bandwidth required for checkpointing decreases for the largest models.
This may seem counterintuitive: larger models require more GPUs to train, and more components means a greater chance of a failure interrupting. To maintain efficient training of large models, therefore, we might expect checkpoints to be written more frequently and completed more quickly—thus increasing bandwidth demands on storage.
However, our data suggest that model trainers are relying on asynchronous checkpointing, a technique where frequent checkpoints are written quickly to node-local storage and then drained to global storage at a lower, fixed frequency. Because whole-node failures are rare compared to subcomponent issues (such as link flaps or GPU ECC errors), these local checkpoints usually survive crashes. Our analysis suggests that model trainers optimize for this by synchronizing checkpoints to shared storage only often enough to protect against the relatively rare cases where node-local checkpoints are lost.
While the exact rate of whole-node failures varies (one study by the Llama team at Meta found this to be only 5% of failures), the implication is clear: global storage needs only enough bandwidth to absorb periodic drains of node-local checkpoints, not the full write rate implied by GPU throughput.
To validate this, we compared the checkpoint duration (time required to copy a checkpoint to VAST) with the checkpoint interval (time between successive checkpoints arriving on VAST). In one 800-billion-parameter training run, for example, the checkpoint interval was 40 minutes, and the median checkpoint duration was 3.6 minutes. From this, we can determine the checkpoint overlap, which is the fraction of time between checkpoints where asynchronous checkpointing was causing asynchronous background I/O. In this 800-billion-parameter run, the checkpoint overlap was about 9%.
Across the 40 production runs we studied, almost every one kept the median checkpoint overlap under 10% of total training time.
Recommended by LinkedIn
While there is no formal threshold for what qualifies as a “tolerable” checkpoint overlap, the consistency of this result highlights how model trainers manage checkpointing overlap in practice.
Across all runs, asynchronous checkpoints drained to global storage at rates of 50–200 GB/s, with no clear correlation to model size. This indicates that even frontier-scale training can be sustained with relatively modest global checkpoint bandwidth (and could, in principle, tolerate higher checkpoint overlap and operate with even lower bandwidth). Global storage does not need to match peak GPU throughput.
Optimally balancing compute and data infrastructure for AI
This demand-side perspective has important implications for designing balanced training infrastructure across compute, networks, and storage.
In well-architected high-performance storage systems, bandwidth scales with the number of SSDs that can write in parallel, rather than with networking, software, or drive capacity. As a result, most AI training servers ship with 4 to 8 NVMe SSDs instead of a single high-capacity device. However, this performance comes at a cost: more drives increase the cost-per-terabyte, and when applied to distributed storage systems, also mean more servers, networking, and power, space, and cooling
So, although it may be tempting to demand the highest possible I/O performance to match theoretical GPU capabilities, that might be overkill. In fact, our analysis suggests that even trillion-parameter-scale models can train efficiently with well under 1 TB/s of checkpoint bandwidth (though many large VAST deployments are capable of TB/s-scale throughput). Checkpoints will ultimately use all of the I/O bandwidth that is available to them at scale, so provisioning more bandwidth will always allow checkpoints to drain faster. However, overprovisioning also consumes resources that could otherwise support more GPUs without providing any improvement in training time performance–remember, the GPUs aren’t blocked while checkpoints drain asynchronously.
The optimal architecture for AI training neither minimizes nor maximizes I/O bandwidth, but instead balances GPU capability, reliability, and checkpoint draining throughput to maximize training efficiency within the constraints of power, space, and cooling. Assuming a 10% checkpoint overlap, we see that the required global bandwidth is modest for all but the highest checkpoint frequencies and largest models.
Here are some guidelines to shape how system architects should think about training LLMs at scale:
- Checkpoint overlap, not GB/s, is the most relevant performance metric for checkpointing. Lower overlap is better because it reduces the likelihood of a catastrophic failure happening before a checkpoint can be synchronized to shared storage. Our findings indicate that 10% overlap is sufficient.
- Global storage should be designed to drain asynchronous checkpoints. Attempting to provision performance to match every GPU’s theoretical maximum results adds little to overall training efficiency because synchronous checkpoints only block GPUs for as long as it takes to write to node-local NVMes, and node-local NVMes already scale with GPU count and model size.
- Design for job reliability, not hero bandwidth. Global storage should be used to protect against infrequent whole-node failures. Most failures affect individual GPUs or network links and can be recovered using node-local checkpoints.
- Balance performance headroom against GPU investment. Over-provisioning I/O performance reduces the number of GPUs that a datacenter can support. Storage should be an enabler of training efficiency, not an independent race for the highest I/O benchmark number.
And here’s a simple sizing model that relates model size, checkpoint interval, and acceptable overlap to the global checkpoint bandwidth required:
In practice, the checkpoint interval should be proportional to the rate at which the job crashes, which in turn scales with the number of GPUs in use. For example, a 405-billion-parameter model trained on 16,000 H100 GPUs experienced a mean time to interrupt of 150 minutes. At that scale, a reasonable checkpointing interval may have been one-tenth of this, or about 15 minutes². Assuming 14 bytes of checkpoint per parameter, the required checkpoint bandwidth is:
This example demonstrates how the sizing model translates cluster reliability and model design into defensible global bandwidth requirements. However, it can also be applied to a range of model sizes and checkpoint intervals to understand how bandwidth requirements vary as models scale.
And if there’s one thing we can take away from the 85,000 checkpoints we examined, it’s that even the largest LLMs can train efficiently with modest global checkpoint bandwidth, thus freeing up datacenter resources for more compute and faster time to training completion.
¹ We use clustering to attribute individual checkpoints to a single model training run, then use cluster centers to estimate the checkpoint size. We then assume all models checkpoint 14 bytes per parameter to estimate the parameter size of each model trained. This introduces a systematic uncertainty across all estimated model sizes of around +/- 15%, but it does not materially change the conclusions.
² One-tenth (10%) is a relatively arbitrary number we chose to keep the expected lost work from a random failure small relative to the job’s mean time between failures. We are also assuming that all restarts must come from checkpoints that drained to shared storage which is very conservative. Recall that only 5% of failures in this 405-billion-parameter model training resulted from a whole node (and its un-drained checkpoint) being lost.
Great analysis Glenn! Interested to see how this may affect the architecture decisions and budget allocations of future systems
this is a really interesting analysis, thanks for doing this and publishing it, Glenn!
Bringing the data! I love it!
Real life and Real experience VAST Data !
This was a fun analysis to conduct, and there will be more to come. I'm looking forward to sharing some of the technical details behind how we did this analysis at #SC25 and other events to follow!