arXiv (Cornell University)
Checkmate: Zero-Overhead Model Checkpointing via Network Gradient Replication
July 2025 • Ankit Bhardwaj, Weiyang Wang, Manya Ghobadi
This paper presents Checkmate, a system that enables per-iteration checkpointing in DNN training without any training slowdown. The traditional approach to checkpointing requires a pause in training to copy model states to a separate location, allowing the state to be restored in the event of failure. This approach fundamentally has a tradeoff between the frequency of checkpoints and the cost of a failure. We avoid this tradeoff; our key insight is that in data-parallel training, all information necessary to creat…