Context Window
NLPThe maximum number of tokens a language model can process at once, which limits how much retrieved content can be included alongside a query.
The context window is the maximum number of tokens that a language model can accept as input in a single request. This is a hard architectural limit determined by the model's positional encoding scheme and training configuration. Examples include 2,048 tokens for GPT-3, 128K tokens for GPT-4, and 200K tokens for Claude 3.
The context window is a critical constraint in RAG system design because it limits how much retrieved content can be provided to the LLM alongside the user query and system instructions. If you retrieve 20 chunks of 500 tokens each, that is 10,000 tokens of context before accounting for the prompt template and desired output length. Larger context windows allow more retrieved content but increase cost (API pricing scales with tokens) and can degrade quality due to the "lost in the middle" problem where LLMs pay less attention to information in the middle of long contexts.
Context window optimization involves balancing retrieval breadth (more documents for better recall) against context efficiency (fewer, more relevant documents for better precision and lower cost). Most RAG systems find a sweet spot of 5-20 retrieved chunks, with reranking ensuring only the most relevant content occupies the limited context space.
Related Terms
Last updated: February 22, 2026