In this chapter, we will understand the concepts of k-Nearest Neighbour (kNN) algorithm.
In this chapter, we will understand the concepts of the k-Nearest Neighbour (kNN) algorithm.
Theory
------
kNN is one of the simplest of classification algorithms available for supervised learning. The idea
is to search for closest match of the test data in feature space. We will look into it with below
kNN is one of the simplest classification algorithms available for supervised learning. The idea
is to search for the closest match(es) of the test data in the feature space. We will look into it with the below
image.
![image](images/knn_theory.png)
In the image, there are two families, Blue Squares and Red Triangles. We call each family as
**Class**. Their houses are shown in their town map which we call feature space. *(You can consider
a feature space as a space where all datas are projected. For example, consider a 2D coordinate
space. Each data has two features, x and y coordinates. You can represent this data in your 2D
coordinate space, right? Now imagine if there are three features, you need 3D space. Now consider N
features, where you need N-dimensional space, right? This N-dimensional space is its feature space.
In our image, you can consider it as a 2D case with two features)*.
Now a new member comes into the town and creates a new home, which is shown as green circle. He
should be added to one of these Blue/Red families. We call that process, **Classification**. What we
do? Since we are dealing with kNN, let us apply this algorithm.
One method is to check who is his nearest neighbour. From the image, it is clear it is the Red
Triangle family. So he is also added into Red Triangle. This method is called simply **Nearest
Neighbour**, because classification depends only on the nearest neighbour.
But there is a problem with that. Red Triangle may be the nearest. But what if there are lot of Blue
Squares near to him? Then Blue Squares have more strength in that locality than Red Triangle. So
just checking nearest one is not sufficient. Instead we check some k nearest families. Then whoever
is majority in them, the new guy belongs to that family. In our image, let's take k=3, ie 3 nearest
families. He has two Red and one Blue (there are two Blues equidistant, but since k=3, we take only
In the image, there are two families: Blue Squares and Red Triangles. We refer to each family as
a **Class**. Their houses are shown in their town map which we call the **Feature Space**. You can consider
a feature space as a space where all data are projected. For example, consider a 2D coordinate
space. Each datum has two features, a x coordinate and a y coordinate. You can represent this datum in your 2D
coordinate space, right? Now imagine that there are three features, you will need 3D space. Now consider N
features: you need N-dimensional space, right? This N-dimensional space is its feature space.
In our image, you can consider it as a 2D case with two features.
Now consider what happens if a new member comes into the town and creates a new home, which is shown as the green circle. He
should be added to one of these Blue or Red families (or *classes*). We call that process, **Classification**. How exactly should this new member be classified? Since we are dealing with kNN, let us apply the algorithm.
One simple method is to check who is his nearest neighbour. From the image, it is clear that it is a member of the Red
Triangle family. So he is classified as a Red Triangle. This method is called simply **Nearest Neighbour** classification, because classification depends only on the *nearest neighbour*.
But there is a problem with this approach! Red Triangle may be the nearest neighbour, but what if there are also a lot of Blue
Squares nearby? Then Blue Squares have more strength in that locality than Red Triangles, so
just checking the nearest one is not sufficient. Instead we may want to check some **k** nearest families. Then whichever family is the majority amongst them, the new guy should belong to that family. In our image, let's take k=3, i.e. consider the 3 nearest
neighbours. The new member has two Red neighbours and one Blue neighbour (there are two Blues equidistant, but since k=3, we can take only
one of them), so again he should be added to Red family. But what if we take k=7? Then he has 5 Blue
families and 2 Red families. Great!! Now he should be added to Blue family. So it all changes with
value of k. More funny thing is, what if k = 4? He has 2 Red and 2 Blue neighbours. It is a tie !!!
So better take k as an odd number. So this method is called **k-Nearest Neighbour** since
classification depends on k nearest neighbours.
neighbours and 2 Red neighbours and should be added to the Blue family. The result will vary with the selected
value of k. Note that if k is not an odd number, we can get a tie, as would happen in the above case with k=4. We would see that our new member has 2 Red and 2 Blue neighbours as his four nearest neighbours and we would need to choose a method for breaking the tie to perform classification. So to reiterate, this method is called **k-Nearest Neighbour** since
classification depends on the *k nearest neighbours*.
Again, in kNN, it is true we are considering k neighbours, but we are giving equal importance to
all, right? Is it justice? For example, take the case of k=4. We told it is a tie. But see, the 2
Red families are more closer to him than the other 2 Blue families. So he is more eligible to be
added to Red. So how do we mathematically explain that? We give some weights to each family
depending on their distance to the new-comer. For those who are near to him get higher weights while
those are far away get lower weights. Then we add total weights of each family separately. Whoever
gets highest total weights, new-comer goes to that family. This is called **modified kNN**.
all, right? Is this justified? For example, take the tied case of k=4. As we can see, the 2
Red neighbours are actually closer to the new member than the other 2 Blue neighbours, so he is more eligible to be
added to the Red family. How do we mathematically explain that? We give some weights to each neighbour
depending on their distance to the new-comer: those who are nearer to him get higher weights, while
those that are farther away get lower weights. Then we add the total weights of each family separately and classify the new-comer as part of whichever family
received higher total weights. This is called **modified kNN** or **weighted kNN**.
So what are some important things you see here?
- You need to have information about all the houses in town, right? Because, we have to check
the distance from new-comer to all the existing houses to find the nearest neighbour. If there
are plenty of houses and families, it takes lots of memory, and more time for calculation
also.
- There is almost zero time for any kind of training or preparation.
- Because we have to check
the distance from the new-comer to all the existing houses to find the nearest neighbour(s), you need to have information about all of the houses in town, right? If there are plenty of houses and families, it takes a lot of memory, and also more time for calculation.
- There is almost zero time for any kind of "training" or preparation. Our "learning" involves only memorizing (storing) the data, before testing and classifying.
Now let's see it in OpenCV.
Now let's see this algorithm at work in OpenCV.
kNN in OpenCV
-------------
@ -67,11 +61,11 @@ We will do a simple example here, with two families (classes), just like above.
chapter, we will do an even better example.
So here, we label the Red family as **Class-0** (so denoted by 0) and Blue family as **Class-1**
(denoted by 1). We create 25 families or 25 training data, and label them either Class-0 or Class-1.
We do all these with the help of Random Number Generator in Numpy.
(denoted by 1). We create 25 neighbours or 25 training data, and label each of them as either part of Class-0 or Class-1.
We can do this with the help of a Random Number Generator from NumPy.
Then we plot it with the help of Matplotlib. Red families are shown as Red Triangles and Blue
families are shown as Blue Squares.
Then we can plot it with the help of Matplotlib. Red neighbours are shown as Red Triangles and Blue
neighbours are shown as Blue Squares.
@code{.py}
import cv2 as cv
import numpy as np
@ -80,36 +74,36 @@ import matplotlib.pyplot as plt
# Feature set containing (x,y) values of 25 known/training data