Exploring foci of:
arXiv (Cornell University)
Checkmate: Zero-Overhead Model Checkpointing via Network Gradient Replication
July 2025 • Ankit Bhardwaj, Weiyang Wang, Manya Ghobadi
This paper presents Checkmate, a system that enables per-iteration checkpointing in DNN training without any training slowdown. The traditional approach to checkpointing requires a pause in training to copy model states to a separate location, allowing the state to be restored in the event of failure. This approach fundamentally has a tradeoff between the frequency of checkpoints and the cost of a failure. We avoid this tradeoff; our key insight is that in data-parallel training, all information necessary to creat…
Role Model (Singer)
Iman (Model)
Model Context Protocol
Claude (Language Model)
Ford Model T
Tesla Model Y
Tesla Model 3
Convolutional Neural Network
Network (1976 Film)
Threads (Social Network)
Tesla Model S
Remington Model 870
Gradient
Smith & Wesson Model 10
Usa Network
Recurrent Neural Network
Mastodon (Social Network)
Gradient Descent
Walter Model
Dominique Jackson (Model)
Network Effect (Novel)
Gma Network
Smith & Wesson Model 29
Mlb Network