inference

Inference is the stage in which a trained model takes new inputs and produces outputs, operating under real-world constraints of latency, throughput, and cost.

It encompasses an end-to-end flow:

  • Request handling
  • Input or pre-processing
  • Model execution
  • Output decoding or post-processing

Unlike training, which optimizes model parameters, inference applies those fixed parameters to unseen data.

In the case of large language models (LLMs), inference typically means autoregressive generation, proceeding token by token, often starting with a prompt or pre-fill phase, followed by a decoding loop.

Production systems further optimize inference through batching, caching, quantization, hardware, and compiler tuning.


By Leodanis Pozo Ramos • Updated Nov. 3, 2025