Lemmatization

NLP

The process of reducing a word to its dictionary base form (lemma) using vocabulary and morphological analysis.

Lemmatization maps each word to its canonical dictionary form, called the lemma. Unlike stemming, which applies mechanical suffix-stripping rules, lemmatization uses a vocabulary and an understanding of part-of-speech context to return valid words. 'Better' becomes 'good', 'was' becomes 'be', and 'running' becomes 'run' — all genuine dictionary entries.

To lemmatize accurately, a system typically needs to know the part of speech of each word, because the same surface form can have different lemmas depending on usage. 'Saw' as a verb lemmatizes to 'see'; as a noun it remains 'saw'. Tools like spaCy, NLTK, and Stanford CoreNLP include lemmatizers that combine morphological rules with vocabulary lookup.

Lemmatization is more linguistically accurate than stemming and is preferred when downstream tasks require interpretable tokens — such as building knowledge bases, extracting named entities, or preparing training data for language models. The trade-off is speed: lemmatization requires a dictionary and optional POS tagging, making it slower than rule-based stemming.

Last updated: March 6, 2026

Lemmatization

Related Terms