>_TheQuery
← Glossary

KNN

Fundamentals

K-Nearest Neighbors - a simple algorithm that classifies a data point based on the majority class of its K closest neighbors in the feature space.

K-Nearest Neighbors (KNN) is a non-parametric supervised learning algorithm that makes predictions based on the similarity of a new data point to its neighbors in the training set. For classification, it assigns the class most common among the K nearest neighbors. For regression, it returns the average of their values. The algorithm stores the entire training dataset and computes distances at prediction time, making it an instance-based or lazy learning method.

The choice of K (the number of neighbors) directly affects model behavior. A small K (like 1 or 3) makes the model sensitive to noise and outliers but captures local patterns. A large K smooths out noise but may blur class boundaries. Distance metrics also matter: Euclidean distance is the default, but Manhattan, Minkowski, and cosine similarity are used depending on the data. Feature scaling is critical because KNN relies on distance calculations, and unscaled features with larger ranges will dominate the distance computation.

KNN's simplicity makes it a useful baseline and teaching tool, but it has practical limitations. Prediction is slow on large datasets because every query requires computing distances to all training points. High-dimensional data suffers from the curse of dimensionality, where distances become less meaningful as dimensions increase. Despite these limitations, KNN remains widely used for recommendation systems, anomaly detection, and as a component in approximate nearest neighbor search, which underpins vector databases used in retrieval-augmented generation systems.

Last updated: February 27, 2026