Token / Tokenization

Tokenization is the process of splitting text into smaller units called tokens, which are the basic input elements for language models.

Tokenization is a fundamental preprocessing step in natural language processing (NLP) that converts raw text into a sequence of tokens. Tokens can be words, subwords, characters, or even byte pairs, depending on the tokenization algorithm used. Common methods include Byte-Pair Encoding (BPE), WordPiece, and SentencePiece, each designed to balance vocabulary size and representation efficiency. The choice of tokenizer significantly affects how a model interprets and generates language, as it determines the granularity of linguistic units the model can process.

For example, the sentence “I love AI!” might be tokenized into [“I”, “love”, “AI”, ”!”] by a word-level tokenizer, or into [“I”, “lov”, “e”, “AI”, ”!”] by a subword tokenizer. Subword tokenization is particularly useful for handling out-of-vocabulary words and morphologically rich languages, as it can represent rare or unseen words by combining known subword units. Tokenization also involves assigning each token a unique numerical ID from a predefined vocabulary, which is then fed into the model’s embedding layer.

Modern large language models (LLMs) like GPT and BERT use subword tokenization to achieve a balance between vocabulary size and coverage. The tokenizer is typically trained on a large corpus to learn the most frequent subword units, and it remains fixed during model training and inference. Understanding tokenization is crucial for debugging model behavior, as token boundaries can affect how the model interprets punctuation, capitalization, and compound words. Additionally, token count directly impacts computational cost, as models have a maximum context window measured in tokens.

Why it matters

Tokenization directly influences model performance, vocabulary size, and computational efficiency. A well-designed tokenizer reduces the number of tokens needed to represent text, enabling longer context windows and faster processing. It also affects how models handle rare words, multilingual text, and domain-specific terminology. Practitioners must consider tokenization when optimizing input length, fine-tuning models, or deploying systems with limited memory, as token count is a key factor in API pricing and latency.

FAQ

How does tokenization work?

Tokenization works by applying a predefined algorithm to split text into tokens. For subword methods like BPE, the algorithm iteratively merges the most frequent character pairs in a training corpus to create a vocabulary of subword units. During inference, the tokenizer matches the input text against this vocabulary, splitting unknown words into smaller known pieces.

What is the difference between word-level and subword tokenization?

Word-level tokenization splits text into whole words, resulting in a large vocabulary and poor handling of rare or compound words. Subword tokenization breaks words into smaller units like prefixes, suffixes, or character groups, allowing a smaller vocabulary and better coverage of unseen words. Subword methods are preferred in modern LLMs for their efficiency and flexibility.

How does token count affect model usage?

Token count determines the computational cost and context window of a model. Models have a maximum token limit (e.g., 4096 tokens for GPT-3.5), and exceeding it truncates input or requires chunking. API pricing is often per token, so longer token sequences increase cost. Efficient tokenization reduces token count, saving resources and enabling longer conversations or documents.