Skip to main content

K-NN

                     K-Nearest Neighbour

The k-nearest neighbors (KNN) algorithm is a simple, easy-to-implement supervised machine learning algorithm that can be used to solve both classification and regression problems.

KNN means in short Similar things near to each other.

The KNN algorithm uses ‘feature similarity’ to predict the values of any new data points.

I am going to explain this knn with a simple example:-


In the above table, we have S.No, Height, Weight & Age in our table for S.No.5 the weight is missing, So now we need to predict the weight of the person based on his Height and Age.

graph example


in the above graph, X_axis represents the age and the y_axis represents the Height of a person.
in the above graph, I write 5 numbers, in that  4 values have output and one id not having output now see How KNN help us.
the 5th number I want to predict which is circled.
hint:-By seeing the above graph we got an idea i.e the 5 is near to 4 or 2.
YES.

suppose I choose my k-value as 2  the selected numbers are 2 and 4 

for S.No 2 and 4, the weights are 63 and 78.
now we got an idea that is the weight of the 5th person is between 63-78.
now what about person 5 weight see here it takes the average of 63+78/2=70

THE WEIGHT OF 5th PERSON IS 70 FINALLY WE GOT IT.

Question:-How we calculate the distance between one point to another point?
To calculate the distance between unknown data from unknown data points below techniques is helped.

Ans:-Most common methods are  Euclidian, Manhattan these techniques are used when we have a continuous variable.
Euclidean Distance: Euclidean distance is calculated as the square root of the sum of the squared differences between a new point (x) and an existing point (y).

Manhattan Distance: This is the distance between real vectors using the sum of their absolute difference.

For categorical variable we use Hamming Distance:
Hamming Distance: It is used for categorical variables. If the value (x) and the value (y) are the same, the distance D will be equal to 0. Otherwise D=1.

KNN Algorithm:
step1:-Calculate the distance between the unknown data point and all known data points and store them in the same array.
Step2:- Sort the above array in the ascending of distance.
Step3:- Based on k neighbours selected by the user, select first k rows from the array.
Step4:- Perform voting of the selected records and select the label that wins the majority. 
Step5:- Assigned winned label to the unknown data point. 
How to Select K value in KNN?
if we choose less (k=1) value it leads to overfitting on the training data and it will lead to a high error on validation data.
Suppose if you choose high k-value the model performance is very poor on both the training dataset and testing dataset.

The K-value is always changed for every dataset. i.e it doesn't have any default value.
The best technique for selecting K-value is - plot the elbow method 
it will help you definitely.


When to use KNN:
a)Knn works well with a small number of dataset
b) it won't work well with a large number of dataset
c)KNN needs scaling YES!! because when we are calculating the distance for some data points the distance may be very long if I train the model without scaling it will work but it takes more time to execute. So better to scale the Data.

Pros:  

Cons: Accuracy depends on the quality of the data.


finally, love your Neighbour....😉

FOR CODE PART PLEASE MY CHECK MY GITHUB.

Comments

Popular posts from this blog

Loss Functions | MSE | MAE | RMSE

            Performance Metrics The various metrics used to evaluate the results of the prediction are : Mean Squared Error(MSE) Mean Absolute error(MAE) Root-Mean-Squared-Error(RMSE) Adjusted R² Mean Squared Error: Mean Squared error is one of the most used metrics for regression tasks. MSE is simply the average of the squared difference between the target value and value predicted by the regression model.  As it squares the differences and  penalizes (punish)even a small error which leads to over-estimation of how bad the model is. It is preferred more than other metrics because it is differentiable and hence can be optimized better. in the above formulae, y=actual value and ( yhat) means predicted value by the model. RMSE(Root Mean Squared Error: This is the same as MSE (Mean Squared Error) but the root of the value is considered while determining the accuracy of the model. It is preferred more in some cases because the errors are first...

LSTM

                                                                                  LSTM  we will discuss now each and every stage with the help of the above diagram. State1:Memory View The memory view is responsible for remembering and forget the information based on the context of an input. (you didn't get it, wait now you will understand). In the above diagram, the memory view is the top line.   key points( Ct-1, X, +,  and Ct) . The input is an old memory, X is multiplication which forgets the useless information from the old memory, and " +"   addition lets merge all these things. when we multiply the old memory with '0' the old memory will "0" or if we multiply with vector "1" The old memory won't change. ( what ...

KNN Interview Questions

                           KNN interview questions 1) Which of the following distance metric can not be used in k-NN? A) Euclidean Distance B) Manhatten Distance c) Hamming Distance E) Minkowski Distance F) Jaccard Distance G) All the above Answer:- G All of these distance metric can be used as a distance metric for KNN 2)Knn is for regression or classification? Answer:- Knn is used for both classification and regression problems. 3) When we use Manhatten Distance? Answer:-Manhatten distance is used for continuous variables. 4) You have given the following 2 statements, find which of these options is/are true in case of k-NN? In the case of very large value of k , we may include points from other classes into the neighborhood, so it leads to overfitting. In case of too small value of k the algorithm is very sensitive to noise.(it will affect our model performance). Answer:-The above two points are answers. 5...