K-Nearest Neighbor

“Show me who your friends are and I’ll tell you who you are?” . This is the assumption K-nn uses in order to identify class label of a point.

In Machine Learning K-nn comes under non-parametric techniques used for both classification and regression . The idea is simple just look for k Nearest Neighbors of a point to choose its class label.The Goal of this article is cover mathematical aspects of k-nn .

K-nn for classification

The idea behind classification is simple , just take majority vote of class labels among k-Nearest Neighbors .The class label with highest votes is assigned as class label of the point. This makes implementing knn extremely simple (for illustration see fig below) .

Example of k-NN classification. The test sample (green circle) should be classified either to the first class of blue squares or to the second class of red triangles. If k = 3 (solid line circle) it is assigned to the second class because there are 2 triangles and only 1 square inside the inner circle. If k = 5 (dashed line circle) it is assigned to the first class (3 squares vs. 2 triangles inside the outer circle). image source: wikipedia.

Knn Alogrithm for classification:

  1. Choose k .
  2. Find distance of every point to every point .
  3. Find the k Nearest Neighbor of the query point.
  4. Take majority vote .
  5. assign class label with highest vote.

K-nn for regression

The idea behind regression is simple ,find k-Nearest Neighbors and take the mean or median(preferred) of the k Nearest Neighbors and assign as that as the value  of query point. There is no such figure present which can graphically describe the knn for regression.

Knn algorithm for regression:

  1. Choose k .
  2. Find distance of every point to every point .
  3. Find the k Nearest Neighbor of the query point.
  4. Take mean or median of k-Neighbors .
  5. assign it as the value .

There are two factors which directly affects the accuracy and performance of Knn , the choice of distance metric and value of k.Let’s discuss first what distance metric to use.

There are many distance metric available but we will focus on two most popular techniques:

  1. Euclidean distance
  2. Manhattan distance

Euclidean distance :This can be interpreted as direct distance from one point to another or in simple words pythagoras theorem. The euclidean distance from a point to another can easily be calculated  as follows.

Let x_{{}} = \{ x_{{1}},x_{{2}},x_{{3}},.......x_{{n}}\}  and  y_{{}} = \{ y_{{1}},y_{{2}},y_{{3}},.......y_{{n}}\}, so Euclidean distance will be ,

d_{{}} =\sqrt{(x_{{1}}-y_{{1}})^{2} +(x_{{2}}-y_{{2}})^{2} +(x_{{3}}-y_{{3}})^{2}....+(x_{{n}}-y_{{n}})^{2}}

Manhattan Distance : This can be interpreted as path followed distance  from one point to another or in simple words distance between two points which passes through points in between. The manhatton distance from a point to another can easily be calculated  as follows.

Let x_{{}} = \{ x_{{1}},x_{{2}},x_{{3}},.......x_{{n}}\}  and  y_{{}} = \{ y_{{1}},y_{{2}},y_{{3}},.......y_{{n}}\}, so Manhattan distance will be ,

d_{{}} =||x_{{1}}-y_{{1}}|| +||x_{{2}}-y_{{2}}|| +||x_{{3}}-y_{{3}}|| + ..........+||x_{{n}}-y_{{n}}||

Now to choose the best value of k there are options available ,but this must be noted that these methods are not general to all.

  • Choose K on the basis of intuition .(Not recommended)
  • Choose K on the basis of data (very problem specific).
  • Use cross validation data to choose best k (Recommended approach).

There are also some modified versions of Knn:

  1. Weighted knn
  2. Using Minkowski distance metric.

Knn is one of the most simplest approaches in whole machine learning ,.Even with this simplicity knn can give very good results .Hope you enjoyed reading it.

HAPPY MACHINE LEARNING”

 

Send a Message