throughput
Throughput is the rate at which a system completes useful work, typically measured as units per second averaged over a given time interval.
In AI workloads, throughput often refers to tokens per second or sequences per second for language models, and to images or samples per second for training and inference jobs.
Throughput can be increased with techniques, such as batching, parallelism, vectorized kernels, and efficient I/O. However, larger batches and resource contention can increase latency or memory usage. In practice, practitioners monitor both throughput and latency to balance user experience against hardware utilization and cost.
By Leodanis Pozo Ramos • Updated Nov. 17, 2025