TF-IDF

Information Retrieval

A numerical statistic combining term frequency and inverse document frequency to measure how important a word is to a document within a collection.

TF-IDF (Term Frequency-Inverse Document Frequency) is a classic information retrieval weighting scheme that measures the importance of a word to a document within a larger collection. It combines two factors: Term Frequency (TF), which measures how often a term appears in a document, and Inverse Document Frequency (IDF), which measures how rare a term is across the entire collection.

The formula TF-IDF(t,d) = TF(t,d) x IDF(t) automatically downweights common words like "the" (high TF but very low IDF because it appears in nearly every document) and upweights distinctive terms (moderate TF but high IDF because they appear in few documents). This simple multiplication captures the intuition that a word is important to a document if it appears frequently in that document but rarely in others.

While BM25 has largely superseded TF-IDF in modern retrieval systems by adding term frequency saturation and document length normalization, TF-IDF remains conceptually foundational. Understanding TF-IDF is essential for grasping why BM25 works, how sparse retrieval operates, and when keyword-based methods outperform neural approaches.

Related Terms

BM25 Sparse Retrieval Hybrid Search

Last updated: February 22, 2026