tokenization

Tokenization is the process of converting raw text into a sequence of discrete tokens that models can handle, typically by segmenting text into words, subwords, characters, or bytes and mapping them to vocabulary IDs.

Modern natural language processing (NLP) systems favor subword tokenization to balance vocabulary size and coverage, using schemes such as Byte Pair Encoding, WordPiece, and Unigram, often trained directly from raw text with tools like SentencePiece. These methods help handle rare and out-of-vocabulary words, support many languages and scripts, and improve robustness by representing text as variable-length units.

Practical pipelines also define special tokens for boundaries or control signals and include a reversible detokenization step, allowing model outputs to be turned back into readable text.


By Leodanis Pozo Ramos • Updated Oct. 21, 2025