The KNN algorithm is commonly used in many ML applications – right from supervised settings such as classification and regression, to just retrieving similar items in applications such as recommendation systems, search, question answering and so on.
What is the KNN Algorithm?
KNN for Nearest Neighbour Search: KNN algorithm involves retrieving the K datapoints that are nearest in distance to the original point. It can be used for classification or regression by aggregating the target values of the nearest neighbours to make a prediction. However, just retrieving the nearest neighbours is a very important aspect in sevaral applilcations. For instance, suppose we write a movie recommender system, once we find a suitable vector representation for all the movies, given a movie, recommending the five closest movies involves retrieving the five nearest neighbour vectors.
KNN for classification: KNN can be used for classification in a supervised setting where we are given a dataset with target labels. For classification, KNN finds the k nearest data points in the training set and the target label is computed as the mode of the target label of these k nearest neighbours.
KNN for Regression:KNN can be used for regression in a supervised setting where we are given a dataset with continuous target values. For regression, KNN finds the k nearest data points in the training set and the target value is computed as the mean of the target value of these k nearest neighbours.
What are the advantages of KNN ?
- Simple to implement and intuitive to understand
- Can learn non-linear decision boundaries when used for classfication and regression. Can came up with a highly flexible decision boundary adjusting the value of K.
- No Training Time for classification/regression : The KNN algorithm has no explicit training step and all the work happens during prediction
- Constantly evolves with new data: Since there is no explicit training step, as we keep adding new data to the dataset, the prediction is adjusted without having to retrain a new model.
- Single Hyperparameters: There is a single hyperparameter, the value of K. This makes hyper parameter tuning easy.
- Choice of distance metric: There are many distance metrics to chose from. Some popular distance metrics used are Euclidean, Manhattan, Minkowski, hamming distance eand so on.
What are the disadvantages of KNN ?
- High prediction complexity for large datasets: Not great for large datasets, since the entire training data is processed for every prediction. Time complexity for each prediction is O(MNlog(k)) where M is the dimension of the data, N is the size or the number of instances in the training data. Note that there are specialized ways of organizing data to address this issue and make KNN faster.
- Higher prediction complexity with higher dimensions: The prediction compleixty in supervised learning gets higher for higher dimensional data (see the dependence of time complexity from the previous point on the dimension d).
- KNN Assumes equal importance to all features: Since KNN expects points to be close in ALL dimensions, it might not consider points that are really close in sevaral dimensions, though farther away in a few favourably. This can be adjusted by chosing an appropriate distance measure. Moreover, this means it is sensitive if different features have different ranges. This can be addressed by appropriate pre-processing to scale features.
- Sensitive to outliers: A single mislabeled example can change the class boundaries. This could specially be a bigger problem for larger dimensions, if there is an outlier in one dimension, since the average separation tends to be higher for higher dimensions (curse of dimensionality), outliers can have a bigger impact.