26. February 2026
Breaking Through The Bottleneck: Revolutionizing Large-Scale Language Model Training

The Quest for Efficiency in Large-Scale Language Model Training
The field of large language models (LLMs) continues to grow, and the complexity and scale of their training have become increasingly important metrics. The current pretraining process often involves thousands of accelerators and massive token corpora, running for days to months. While speed and progress are crucial, they do not fully capture the essence of efficiency in this context.
Throughput, measured in tokens per second, is an essential outcome for large-scale LLM pretraining. However, it has limitations when it comes to providing a complete picture of efficiency. Throughput can be affected by various factors such as GPU count, network topology, storage bandwidth, data modality, sequence length, model architecture, and hyperparameters like global batch size.
This means that throughput is not only context-sensitive but also outcome-dependent. It does not provide insight into the reliability, recovery, and compute efficiency of the training process. In other words, throughput is an outcome, not a normalized measure of efficiency.
Goodput is a metric that aims to address these limitations by providing a more comprehensive understanding of efficiency in LLM pretraining. It measures the fraction of theoretical training capacity that is converted into useful training progress, taking into account both productive time and lost productivity.
To understand goodput, it’s essential to consider its decomposable nature. The metric can be broken down into three layers:
- Infra Goodput: This layer captures the availability of the training system, measuring how often the job is actually training rather than being unavailable due to infrastructure faults or orchestration delays.
- Framework Goodput: This layer assesses the loss of progress when failures occur, focusing on checkpointing overhead and recovery waste.
- Model Goodput: This layer examines the efficiency of compute utilization, measuring how effectively the training program converts peak accelerator capability into model computations.
These three layers provide a holistic view of goodput, allowing for actionable insights into where time is lost and why.
Goodput is essential because it provides a normalized measure of efficiency that can be compared across different systems and stacks. This normalization enables engineers to identify areas for improvement and prioritize efforts more effectively.
In addition to its utility as a standalone metric, goodput also serves as a framework for decomposition. By breaking down the training process into its constituent layers, engineers can pinpoint specific bottlenecks and areas for optimization.
The Google Approach
Google has introduced ML Productivity Goodput as an efficiency metric for end-to-end training systems. The company’s API-driven approach enables developers to compute goodput and diagnose badput sources across the stack.
The Google approach highlights the importance of attribution in understanding efficiency. By attributing losses to specific layers, engineers can identify areas where improvements are needed most.
Practical Considerations
Measuring goodput in practice requires careful instrumentation. A practical measurement stack typically includes:
- Establishing a measurement window
- Recording “productive training time” explicitly
- Tying each disruption to a fault event
- Computing MFU from steady-state training
These steps enable engineers to track productivity loss and identify areas for improvement.
Large-scale LLM pretraining is fundamentally a distributed systems problem wrapped around a massively parallel math problem. Throughput is an essential outcome, but it does not provide a complete picture of efficiency.
Goodput provides a normalized, decomposable alternative, measuring the fraction of theoretical training capacity that translates into actual progress and attributing losses to specific layers of the stack. By understanding goodput, engineers can identify areas for improvement and prioritize efforts more effectively, ultimately leading to more efficient and productive large-scale LLM pretraining processes.
If you’re building or operating large-scale ML systems and starting to think in terms of stack-level goodput rather than just tokens per second, the deeper work lies in how infrastructure, frameworks, and model design interact.