Stop Word Removal

NLP

The practice of filtering out high-frequency function words — such as 'the', 'is', and 'in' — that carry little semantic content.

Stop word removal is a text preprocessing technique that discards words deemed too common to contribute meaningful signal to a model or retrieval system. Typical stop words include articles, prepositions, conjunctions, and auxiliary verbs. By removing them, the remaining vocabulary is denser in content-bearing terms, which can improve efficiency and relevance in tasks like keyword search and topic modeling.

The definition of a stop word is not fixed. Standard lists exist for most languages, but effective stop word sets are often domain-specific. In legal or medical text, words like 'not' or 'without' carry critical meaning and should not be removed. In contrast, a product search engine might aggressively strip function words to focus on nouns.

Modern neural NLP systems — particularly transformers — typically do not apply stop word removal, because attention mechanisms can learn to weight tokens by importance. Stop word removal remains common in classical retrieval systems like BM25, bag-of-words classifiers, and TF-IDF pipelines where every token contributes equally to term frequency counts.

Last updated: March 6, 2026

Stop Word Removal

Related Terms