Python : an application of knn

This is a short example of how we can use knn algorithm to classify examples. In this article I’ll be using a dataset from Kaggle.com that unfortunately no longer exists. But you can download csv file here : data. This is a dataset of employees in a company and the outcome is to study about employee’s attrition.

Basically, it is a very simple dataset : no missing values, small skewness of data, 5 quantitative features and 2 categorical variables. Each employee is represented by 7 features, and then labelled 1 or 0, depending on if the person left or stayed in the company. Goal of this study is to build a model using knn algorithm which predict the risk of attrition for each employee. In this article I’ll be doing my own implementation of knn and compare it to scikit-learn library solution. This is just a rapid solution to play with knn algorithm and not a complete data analysis, machine learning project.

Dependencies

Data loading

using pandas to load csv file and separating X (inputs values) and y (labels). Features values have different ranges, and it will be a problem when calculating distances (as one feature will have a big impact on the distance whereas other will only have a small impact). To normalize the data feature wise between 0 and 1 is a possible solution.

I also prepare subsets of the dataset in our to be able to test the performance of our model. Scikit-learn library does have function for this. We just need to import the train_test_split function from model_selection.

Dataset description

Our dataset in composed of 14999 employees represented by 7 features : satisfaction_level, last_evaluation, number_project, average_montly_hours, time_spend_company, work accident, promotion_last_5years. Each employee is then labeled in the “left” column. 1 if he left the company, 0 if he stayed.

Data analysis and visualization

Let’s start with a small statistical description of the data.

Apparently there are no features with missing values which is confirmed by the following command:

Ranges for values can have big variations, that’s why normalization of our data was a good idea.

Let’s boxplot each feature for visualization.

And correlation plot, to check any link between variables.

Highly correlated feature could be combined to form a new feature. In this case we may about combining average_montly_hours and number_project which have a correlation of 0.41. Instinctively it seems obvious that this 2 parameters are correlated. Still, I consider the correlation coefficient still quite low and decided to keep all the information.

numpy implementation of knn

Each of our individual in represented by 7 features. Let’s simplify the problem in order to understand how knn works and say that each of our example in represented by only 2 features. A representation of our dataset in the 2 dimensional space could be :

This is the database we are going to build our model on. In this example, 2 groups with different labels(green and pink) are clearly differentiated.

Let’s now introduce a new unlabeled employee for which we would like to predict if he will stay or leave the company. We can position it in our space.

 

To which group does this new employee belongs?

In order to evaluate this knn propose to do as follow :

  • calculate distance from new employee to every employee in the dataset
  • select the k nearest employees (k minimum values of our table of distances)
  • vote for the most represented label in the k nearest employees

To measure the distance between 2 employees, we choose the squared euclidean distance metric such as :

Each example is represented by x1 and x2 values. Let’s have two individuals a=(a1,a2) and b=(b1,b2), the euclidean distance between this 2 individuals can be calculated with the following formula :

\sum_{i=1}^{n}(a_{i}-b_{i})^2

function 1 : calculating the distance between 2 elements

function 2 : generating the distance table

We iterate through labeled data and calculate distance with new employee for each example in our dataset.

function 3 : vote for the k nearest neighbors

The choice of k can be discussed. My advice about k:

  • k should not be divisible by the number of different labels. In our example we have 2 groups of labels. So k should be an odd number. It will avoid the equal probabilities situations where our algorithm could not make a choice for prediction
  • k should be smaller that the quantity of individual in the smallest group. My concern is that if 2 groups are close to each other, a small group may struggle to compete against a bigger neighbor group.

For this study, I choose k=5

function 4 : predictions and accuracy

We compare our predictions to our labelled data in the test dataset.

Results

We reach 95.8% of accuracy on our predictions.

Knn is a very simple algorithm that can perform quite well for a prestudy. It’s not as powerfull as other machine learning algorithm, but it ‘s easy to set up and does not require a lot of computation power.

This version of the algorithm is not optimized we still have pretty good results with it. Let’s implement the scikit-learn version of knn and compare the results with the same value of k=5.

Results

We have similar accuracy values, which confirms that our algorithm has been properly coded.

Resources

  1. http://scikit-learn.org/stable/index.html