latency

Latency is the time between sending a request and seeing a response. In distributed systems it includes network delay, queuing, and processing time, and is usually reported as percentiles to capture tail behavior and jitter.

Latency is different from throughput, which measures how much work a system does per unit time. It’s affected by batching, concurrency, caching, cold starts, and resource contention.

For AI and LLM applications, important latency aspects include time to first token, tokens per second, delays from tool calls and retrieval, and extra overhead when models or containers must be loaded versus served from a warm cache.


By Leodanis Pozo Ramos • Updated Nov. 18, 2025