Tokenization

NLP

The process of breaking text into smaller units called tokens, which serve as the fundamental input elements for language models.

Tokenization is the process of converting raw text into a sequence of discrete tokens that a language model can process. Tokens can be whole words, subwords, individual characters, or even bytes, depending on the tokenization strategy. The tokenizer defines the vocabulary of a model and directly impacts its ability to handle different languages, rare words, and specialized terminology.

Modern language models primarily use subword tokenization algorithms such as Byte Pair Encoding (BPE), WordPiece, and SentencePiece. These methods strike a balance between character-level tokenization (which has a tiny vocabulary but produces long sequences) and word-level tokenization (which cannot handle out-of-vocabulary words). Subword tokenizers learn to split rare words into meaningful sub-units while keeping common words as single tokens.

Tokenization has significant practical implications for language model performance and cost. The number of tokens in a text determines computational requirements and, for commercial APIs, the cost of inference. Different tokenizers handle multilingual text, code, and special characters with varying effectiveness. Research into tokenizer-free models and byte-level processing aims to overcome some of the limitations inherent in predefined tokenization schemes.

Last updated: February 20, 2026

Tokenization

Related Terms