Here is a summary of machine learning techniques performed with R. This is just a quick reminder with some of my comments.
1. Linear Regression or multiple regression
This regression algorithm will try to fit a model to our dataset assuming that the link between variables is linear. It is good to use it to have a first look at your dataset. It is likely to underfit and it cannot represent nonlinear relations between variables.
Linear regression : 1 Y, 1 X
A short example with iris dataset. If we plot the petal length vs the sepal length, we may think that we could fit a linear model to these values.
1 2 3 |
lm<-lm(iris$Petal.Length~iris$Sepal.Length) lm #to display your formula as (y = ax + b) chart<-ggplot(data=iris)+geom_jitter(aes(x=iris$Sepal.Length,y=iris$Petal.Length))+ggtitle("Petal vs Sepal Length") |
Multiple regression : 1 Y, several X
R base library
1 2 3 |
lm<-lm(y~x1+x2,dataset) #generate the model plot(lm) predict(lm,data.frame(x1=value1,x2=value2),interval='confidence') #predictions |
2. Logistic regression
Logistic regressions are used to predict a category for a set of features. Instead of linear regression, you will predict categories your elements may belong to.
1 2 3 |
glm<-glm(y~x1+x2,family=binomial(link="logit"),dataset) plot(glm) predict(glm,data.frame(x1=value1,x2=value2)) #predictions |
3. Support Vector Machine
Support Vector regression algorithms are able to model more complex relations between parameters than linear regression. By adjusting some parameters you will also be able to reduce your global error on the prediction. We need to install the e1071 package in order to use svm.
1 2 |
install.packages("e1071") #installing the library library(e1071) #loading the library |
4. Kmeans
We can qualify this algorithm as a “lazy” algorithm. For a set of data we do not presume the definitions for categories (Unsupervised learning). We just let the algorithm separate the examples and group them by categories allowing it to decide about the boundaries. You just have to set enough centers when launching the calculation (estimate the number of categories)
R base library
1 2 |
k<-kmeans(data,centers=3) plot<-plot(data,col=k$clusters) |
1 2 3 4 5 6 |
library(dplyr) data<-select(iris,Sepal.Length,Sepal.Width) k<-kmeans(data,3) cbind(data,k$cluster) #We bind the cluster value with each point of the dataset chart<-ggplot(d,aes(col=k$cluster))+geom_jitter(aes(d$Sepal.Length,d$Sepal.Width)) chart |
We can see the different clusters with slightly different blue nuances which define the 3 different categories.
5. Random Forest
Random Forest algorithms are able to perform classification tasks. It is likely to overfit to the training set.
randomForest library : install.packages(“randomForest”)
1 2 3 |
library(randomForest) rf<-randomForest(input matrix, output vector,ntree=50) p<-predict(rf, input matrix) #predictions |
6. Neural Network
Neural Networks are able to model complex relations between parameters, but they are difficult to tune. This library can be used to build simple models between inputs and outputs.
neuralnet library : install.package(“neuralnet”)
1 2 3 |
library(neuralnet) nn<-neuralnet(y~x1+x2+x3,data,hidden=c(3,2))#hidden refers to the quantity of neural per layer (3 cells on 2nd layer and 2 cells on 3rd layer) p<-compute(nn,test) #predictions |