Stemming
NLPA rule-based process that strips suffixes from words to reduce them to a common root form, often producing non-dictionary stems.
Stemming reduces inflected or derived words to a base form called a stem by applying heuristic rules that chop off known suffixes. For example, 'running', 'runs', and 'runner' might all be reduced to 'run'. The stem produced is not necessarily a valid dictionary word — 'studies' might become 'studi' — but grouping variants together improves recall in text retrieval and reduces vocabulary size in bag-of-words models.
The most widely used stemmer is the Porter Stemmer, developed in 1980, which applies a sequence of suffix-stripping rules. Later algorithms like the Snowball stemmer extend the approach to multiple languages. Stemming is fast and requires no dictionary lookup, making it suitable for high-throughput pipelines.
The main drawback of stemming is over-stemming — collapsing words that have different meanings — and under-stemming — failing to merge words that should map to the same root. For tasks requiring linguistic precision, lemmatization is preferred. For classical search engines and term-frequency models where speed matters more than linguistic accuracy, stemming is often sufficient.
Last updated: March 6, 2026