Skip to main content

K-NN

                     K-Nearest Neighbour

The k-nearest neighbors (KNN) algorithm is a simple, easy-to-implement supervised machine learning algorithm that can be used to solve both classification and regression problems.

KNN means in short Similar things near to each other.

The KNN algorithm uses ‘feature similarity’ to predict the values of any new data points.

I am going to explain this knn with a simple example:-


In the above table, we have S.No, Height, Weight & Age in our table for S.No.5 the weight is missing, So now we need to predict the weight of the person based on his Height and Age.

graph example


in the above graph, X_axis represents the age and the y_axis represents the Height of a person.
in the above graph, I write 5 numbers, in that  4 values have output and one id not having output now see How KNN help us.
the 5th number I want to predict which is circled.
hint:-By seeing the above graph we got an idea i.e the 5 is near to 4 or 2.
YES.

suppose I choose my k-value as 2  the selected numbers are 2 and 4 

for S.No 2 and 4, the weights are 63 and 78.
now we got an idea that is the weight of the 5th person is between 63-78.
now what about person 5 weight see here it takes the average of 63+78/2=70

THE WEIGHT OF 5th PERSON IS 70 FINALLY WE GOT IT.

Question:-How we calculate the distance between one point to another point?
To calculate the distance between unknown data from unknown data points below techniques is helped.

Ans:-Most common methods are  Euclidian, Manhattan these techniques are used when we have a continuous variable.
Euclidean Distance: Euclidean distance is calculated as the square root of the sum of the squared differences between a new point (x) and an existing point (y).

Manhattan Distance: This is the distance between real vectors using the sum of their absolute difference.

For categorical variable we use Hamming Distance:
Hamming Distance: It is used for categorical variables. If the value (x) and the value (y) are the same, the distance D will be equal to 0. Otherwise D=1.

KNN Algorithm:
step1:-Calculate the distance between the unknown data point and all known data points and store them in the same array.
Step2:- Sort the above array in the ascending of distance.
Step3:- Based on k neighbours selected by the user, select first k rows from the array.
Step4:- Perform voting of the selected records and select the label that wins the majority. 
Step5:- Assigned winned label to the unknown data point. 
How to Select K value in KNN?
if we choose less (k=1) value it leads to overfitting on the training data and it will lead to a high error on validation data.
Suppose if you choose high k-value the model performance is very poor on both the training dataset and testing dataset.

The K-value is always changed for every dataset. i.e it doesn't have any default value.
The best technique for selecting K-value is - plot the elbow method 
it will help you definitely.


When to use KNN:
a)Knn works well with a small number of dataset
b) it won't work well with a large number of dataset
c)KNN needs scaling YES!! because when we are calculating the distance for some data points the distance may be very long if I train the model without scaling it will work but it takes more time to execute. So better to scale the Data.

Pros:  

Cons: Accuracy depends on the quality of the data.


finally, love your Neighbour....😉

FOR CODE PART PLEASE MY CHECK MY GITHUB.

Comments

Popular posts from this blog

Loss Functions | MSE | MAE | RMSE

            Performance Metrics The various metrics used to evaluate the results of the prediction are : Mean Squared Error(MSE) Mean Absolute error(MAE) Root-Mean-Squared-Error(RMSE) Adjusted R² Mean Squared Error: Mean Squared error is one of the most used metrics for regression tasks. MSE is simply the average of the squared difference between the target value and value predicted by the regression model.  As it squares the differences and  penalizes (punish)even a small error which leads to over-estimation of how bad the model is. It is preferred more than other metrics because it is differentiable and hence can be optimized better. in the above formulae, y=actual value and ( yhat) means predicted value by the model. RMSE(Root Mean Squared Error: This is the same as MSE (Mean Squared Error) but the root of the value is considered while determining the accuracy of the model. It is preferred more in some cases because the errors are first...

SUPPORT VECTOR MACHINE

                 SUPPORT VECTOR MACHINE:- Support vector machine:-it is a type of supervised learning algorithm it is used to solve both classification and regression problem. Note :- It is mostly used for classification problems. what we are going to learn in SVM: a) Support vectors b) Hyperplane c) Marginal Distance d) Linear Separable e) Non-linear separable f) support kernels NOw we will discuss everything in detail. Hyper plane:- in the above diagram, we have drawn three lines(A, B, C) separating two data points (stars and reds) The lines (A, B, C) are called Hyperplanes. Note:- “Select the hyper-plane which segregates the two classes better” i.e  above there are three hyperplanes how to select the best hyperplane? b)Marginal Distance:- When we draw a hyperplane the plane creates two new(------) dotted lines one line above the hyperplane and one line below the hyperplane line. see the below image you will get an ...

Multi Linear Regression

                                 MULTI LINEAR REGRESSION Before going into MULTI LINEAR REGRESSION first look into Linear Regression. LINEAR REGRESSION:-It is all about getting the best line for the given data that supports linearity. for Linear regression please check my previous post. In Linear regression, we have only one independent variable and one dependent variable. In Multilinear Regression, we have more than one independent variable and one dependent variable. This is the main difference between Multilinear regression and Linear regression. Formulae for Linear regression and Multilinear Regression is listed below: Evaluation metrics for Multi-linear Regression problems are: a)Mean Absolute error b)Mean Squared error c)Root Mean Squared Error d)..... For Evaluation metrics I had posted another post please check it. For the code part please check my Github In ...