Python : statistics and simple models

Work in progress

This page aims to have an overview of basic knowledge about data analysis and how we should apprehend a dataset. Which tools can we use to have a better insight of our data?

1. Data

quantitative data : variables which are made of numerical values. Values could be continuous or discrete.

Ex1 : temperature is a continuous quantitative variable

Ex2 : number of kids in a family is a discrete quantitative variable

qualitative data : they represent categories. Labeling can be done using text or numerical values. A qualitative variable can be ordinal or nominal. We say it’s an ordinal variable if ordering these values have a meaning.

Ex3 : locations (eg countries) are nominal qualitative values

Ex4 : segment of a market can be ordinal values

dates, times

different format can be found in dataset. Here is a description of some usual format and some tools to manipulate it in python.

Nothing fancy here, we just need to remember that several format for time can be found in dataset and that we shall have to translate everything to the same format for further analysis. Pandas library has great tool to test if a particular timestamp belongs to a period of time.

pictures (as bitmap arrays)

Pictures are arrays of pixels. In a grey scaled image, each pixel can be represented by a value between 0 (white) and 255 (black).

 

sounds (.wav format)

original map3 piano file : https://www.auboutdufil.com/get.php?web=https://archive.org/download/auboutdufil-archives/492/Myuu-TenderRemains.mp3

We can visualize the 2 channels (orange and blue) of the audio file. The discrete time is \frac {1}{fs} second.

2. Distributions

For this study let’s group the variables we created in chapter 1 into a single pandas dataframe. All our lists contains the same amount of elements.

data.shape returns the size of our data set, we have 7 individuals represented by 4 different features.

For each feature, let’s try to represent its distribution graphically.

a. Pie chart (qualitative data) : on locations column

We can access the labels and values of our counts array using these commands :

 

b. barchart (qualitative data) : on market_value column

c. histogram (quantitative data) :

loading a bigger dataset for this chart (iris dataset see 6. Correlation)

d. densities plot

3. Statistical inference

Statistical inference is the process of making conclusion about populations from data. One way to draw conclusion about a population is to observe a sample of it. We are using indicators to describe a variable (they are called estimators). Still, one has to be aware that summing up data with estimators must be done with care and that we should keep an eye on the distribution of our values. An estimator is said to be relevant if :

  • it is consistant
  • it has low bias and low variance

let’s review the most common estimators used in statistics.

a. mode

In a list, mode is the most represented value.

b. mean

mean a a variable is defined by :

\bar{x}=\frac{\sum x}{n}

 

c. median

median is the value that divide your variable in 2 groups with equal amount of individuals.

d.variance and standard deviation

\sigma=\sqrt{\frac{\sum (x-\bar{x}) ^{2}}{n-1}}

 

standard and variance are a measurement of dispersion.

e. quantile, interquartile, decile, etc…

Q1 : value that cut the dataset with 1/4 of individuals under Q1 and 3/4 above Q1

Q3 : value that cut the dataset with 3/4 of individuals under Q3 and 1/4 above Q3

IQ interquartile distance : IQ = Q3 – Q1

IQ is often use to remove outliers from our dataset (see outliers and missing values part).

f. range

Is the range of your variable. Difference between the minimum and the maximum value in your variable.

g. boxplot

A good representation a dataset and some of its statistics is the boxplot. On this chart we can read values associated to quantiles : median, Q1,Q3, Q1-1.5IQ, Q3+1.5IQ. Outliers are plotted as dots.

4. Skewness

Skewness is a measurement of symmetry (comparison of mode and mean).

3 cases :

skewness=0 : the distribution is symmetric

skewness<0 : the distribution is spreading more on the left of the mean value

skewness>0 : the distribution is spreading more on the right of the mean value

5. Kurtosis

Is a measurement of flatness of our distribution.

kurtosis is negative and low, which could represent quite a flat profile for our variable.

6. Correlations : bivariate analysis on quantitative variables

Correlation between 2 variable is a measure of how linearly dependent these 2 variables are. Iris data set is a famous dataset use in data science in order to understand the basic principles. I’ll use it to illustrate the correlation.

r(x,y)=\frac{1}{n-1}\sum \frac{(x-\bar{x})(y-\bar{y})}{\sigma_{x}\sigma_{y}}

 

Interpretation : Positive values of correlation indicates that variables tend to evolve in the same way (when one is growing, the other one do the same), whereas negative value of correlation indicates that variables tend to evolve differently (one one is growing the other one tend to decrease).

Too much linearly correlated variables shall not be used at the same time for modeling. Removing one of the variable or creating a new feature from this variables is a good idea to address this problem.

7. Scatter plot

Plotting relation between 2 quantitative variables:

labeling data with colors

it is a good way to assess any tendencies between 2 variables and to check the distributions of each variables from our dataset.

8. ANOVA : bivariate analysis on mixed variables

let’s try to analyze our iris dataset and compare sepal_length per class of flower.

How can we evaluate a correlation between a qualitative and a quantitative variable?

 

9. Chi-2 test

Chi-2 test checks if whether or not 2 variables are related or not. We have the following hypothesis.

H0 : the 2 variables are independent

H1 : the 2 variables are not independent

 

10. Outliers and missing values

Dataset are not always perfect and frequently presents missing values and inconsistent data.

count missing values

If missing values can be found in only a small amount of individual in our dataset, we could simply drop this individuals.

remove outliers using quartiles

Outliers is value that appears to be far away and unusual in a given variable or set of individuals. One has to detect and remove them before trying to build a model from data. Their influence is also bad for our statistical estimations. Outliers can be found in a variable, but some individual in our dataset can also be considered as outliers. Before considering any statistical analysis or modeling, we should clean our data from these values.

We usually use our InterQuantile distance to detect outliers

missing values imputation

imputing mean or median of the sample is an easy strategy and will have bad impact on modelling

A slightly better approach is to build a simple model with the use of other variables.

11. 3d scatterplot

12. Dimensionality reduction

Reduction of the number of features in our dataset is a good option to reduce our computation time when building our model. Nevertheless, we are loosing some information when reducing the space and it should always be made carefully.

a. PCA

disadvantage of PCA : we are loosing the readability of our features (because our data is projected in a brand new space).

b. features selection

13. basic signal processing

see more about time series and signal processing on this page.

autocorrelation

autocorrelation of a signal helps to find periodicities in a signal.

temporal to frequency : Fast Fourier Transformation

14. Normalizations

Variables in a dataset usually runs on very different range. Normalizations techniques helps to bring every variable to a same range : usually [0,1] or [-1,1]. I’ll introduce the 2 most easy to use and famous type. Please refer to this wikipedia page for more information.

a. standard core

b. feature scaling

15. Probabilistic distribution models

There are a lot of different models to simulate the behavior of a variable. In the following, I am just going to describe the mostly used. Please refer to this wikipedia page for more insight.

Normal law : a random variable X follows a normal law, we write X~N(µ,/sigma^{2}). This law is often used to model natural phenomenon. Practically, it’s very common to meet this kind of variable in our datasets.

Bernoulli :

Bernoulli law allows to calculate the probability distribution of a random variable taking the value 1 with probabilty p and 0 with probability q=1-p.

Repeating the experience n times, we want to calculate the probability that

Poisson :

Chi2 : let’s assume a random variable X follows a normal law

We create a new variable Q, such as Q=X^{2}, we say that Q follows a Chi2 law Q~Chi^{2} with 1 degree of freedom

We can invent a new variable Q_{2}= X_{1}^2+X_{2}^2 ~Chi2 with 2 degrees of freedom

 

16. Statistic inference : the frequentist and the Bayesian approachs

H0 and H1

H0 is called null hypothesis

H1 is called alternative hypothesis

Significance levels (alpha, beta, type 1 and 2 errors) and p-value

17. Prestudy : run simple machine learning models

Naives and simple models (classification and regression) can be built to replace missing data with predictions values. In some cases, simple models turns out to have very good performance and we won’t really need to pursue our way to much more costly  models (in computation an time) such as neural networks.

KNN : K nearest neighbors (classification)

For a better understanding on how this classification algorithm work, please refer to this page. Knn allows to classify an individual regarding his neighbors assuming that individuals with equivalent features belong to the same group. We are choosing the K nearest neighbors of our new individual and associate it to the most common class.

Back to the example of iris flower. Here is the code to implement such a model using scikit-learn library :

Knn model works really good on this example. Sometimes a simple model can solve our classification task.

Knn algorithms use a lot of computation time when the number of compared features or examples in the dataset is big. We will usually avoid it when this quantities are too big (>1000) and choose an algorithm with better performance.

Linear Regression and R square coefficient of determination (value prediction)

Let’s assume that the sepal length can be approximate by a linear function of sepal_width, petal_length and petal width variable. Let’s model a simple linear model and evaluate its performance.

Interpretation of R square : R square is between 0 and 1. 0 being the case where the model explain none of the variability of the data around his mean. 1 being the case where the model explains all the variability of the data around his mean. Usually of high value of R square indicates a good model. Nevertheless, this is not a sufficient condition to conclude it with certainty.

Logistic Regression (classification)

Logistic regression has poor performance regarding this dataset.

Random Forest and Gini function (classification)

For this case Random Forest Classifier has a slightly lower accuracy than Knn. It’s often a good idea to use different models and compare their performances.

18. feature engineering

to be continued…