Supply chain mining

In this page I am proposing a way to tackle the complexity of flows in a company, in order to help business to have better insights of their processes.

These methods could be applied to any type of process flow: financial, physical (materials, products, spare parts) or human. I am not digging too much into details and one should consider this page as general guideline.

Aim of this study is to reduce the WIP (Work in Progress) and TAT (Turn Around Time or Lead Time) of individuals being kept in a process.

We are supposing that one has access to a great quantity of data representing each movement of an individual in a given process (with timestamp, steps and type of movement).

Source : https://en.wikipedia.org/wiki/Turnaround_time

Sampling methods

Population of our dataset may be large. Some of the algorithms that we are going to use are heavy consumer of computational power. Therefore, we are proposing to sample our data in order to work on a smaller dataset. Sampling step can be delicate. We do want to choose a sample that represents our population properly. Several technique of sampling are existing but in the case of a large dataset a random sampling technique is often appropriate.

Random sampling and sampling size justification.

If we have access to a nice quantity of data, we do not really needs to optimize our picking method (with pre identified clusters).

Sample size justification : theory

R script. Generate un mix of Gaussian variable and try to find a value from which our sampling method hits a plateau.

Theorically, a sample size of just 30 elements is a good representation of our population. In reality, we choose the biggest sample size that allows our algorithms to run in an acceptable amount of time.

Adjacency matrix

In order to build algorithm, we need to represent our flows as mathematical object. Flowchart or graphs can be written as an adjacency matrix. In our case, links between storage locations are represented by a 1 when vertices are adjacent in our flowchart and by a 0 when connections between nodes are inexistent. Here is a small example of how such matrices are built:

Having the following process map

It can be represented by the following matrix

Flowchart graphs can be oriented. In this case, adjacency matrix may not be symmetrical. Here is an example of a flowchart with oriented flows :

And its related matrix

Source Wikipedia : https://en.wikipedia.org/wiki/Adjacency_matrix

We want to build an adjacency matrix for each element of our dataset. Each matrix will have columns and rows with every node of our dataset.

Distance matrix

Now that we could represent the flow of every elements in our dataset, we would like to define a way to measure of how similar or how different they are from each other. We then need to choose a norm to assess the distance between 2 elements.

Different definitions are available :

Euclidean norm 1
Euclidean norm 2
Eucldiean norm n
Foebius norm
Etc…

I am only focusing on the Euclidean norm definition.

Euclidean norm L2 and Ln

We are introducing the Euclidean distance with dimension n as follow:

Having matrices A=(aij) and B=(bij) we can define the following measure of their relative distance.

N=2 and then we have Euclidean norm L2.

Distance matrix calculation

We can now create the distance matrix. Each element being an individual in our process, we are displaying distances of one element to every other element. This matrix is symmetrical and has zeros on its diagonal.

Dimensionality reduction: t-SNE vs PCA vs Umap

We would like to find a way to plot our elements in reduced space, having elements with similar processes close and elements with distant processes being far from each other. We would like to represent our elements in a 2D or 3D space, thus reduce from N to 2. Several space dimension techniques exist and we are only focus on 2 of them here: PCA and t-SNE.

PCA: Principal Component Analysis. PCA is a linear combination of the original variables. Principal components are extracted such as the first principal component explains maximum variance in the dataset. Second principal component tries to explain the remaining variance in the dataset and is uncorrelated to the first one. Good when having a large number of variables.

PCA is not able to interpret complex polynomial relationship between features

t-SNE: t_distributed stochastic neighbor embedding.

It is a nonlinear dimensionality reduction technique.

2 main approaches:

Local approach: maps nearby points on the manifold to nearby points in the low dimensional representation
Global approach: attempt to preserve geometry at all scales. Nearby nearby, far away  far away.
t-SNE is one of the few algorithms which is capable of retaining both local and global structure of the data at the same time

PCA (Principal Component Analysis) vs t-SNE (t-distributed stochastic neighbor embedding)

Both are used for dimensionality reduction

PCA is deterministic, t-SNE is not.
Interpretability of mapping.
Application to new unseen data

t-SNE more insightful understanding :

Expected result

We have 6 different well separated clusters. The idea now, is to try to link these clusters with business oriented clusters, therefore analyze the individuals within our clusters and try to define them with business features.

source : https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding

Removing outliers
Automatic clustering: Gaussian Mixture Model vs Kmeans

After tsne vs before tsne

Pros and Cons

Visualization of clusters (2D and 3D)

Models

Random Forest Model

Decision Tree

Segmentation and threshold values

Model performance

To predict cluster belonging :

We would like to define a model that predict a cluster affiliation from a series on inputs. I choose to have series of node has

To predict time consumption

Analysis and conclusions

% of data in nicely defined clusters

Few indicators are interesting to build here. We would like to asses the quality of our clustering and to define more or less what these clusters are representing.

Visualization of clusters
TAT calculation propositions

Once our clusters are well defined, we can plot the process maps they correspond to and propose a way to position our milestones and assess our different TAT.

We applied this method to study flows of parts in a company and the results were quite outstanding.

WIP and TAT analysis

References / papers

https://distill.pub/2016/misread-tsne/

https://towardsdatascience.com/gaussian-mixture-models-explained-6986aaf5a95

https://stats.stackexchange.com/questions/340175/why-is-t-sne-not-used-as-a-dimensionality-reduction-technique-for-clustering-or

https://towardsdatascience.com/how-to-cluster-in-high-dimensions-4ef693bacc6

Sampling methods

Adjacency matrix

Distance matrix

Euclidean norm L2 and Ln

Distance matrix calculation

Dimensionality reduction: t-SNE vs PCA vs Umap

Removing outliers

Automatic clustering: Gaussian Mixture Model vs Kmeans

Visualization of clusters (2D and 3D)

Models

To predict cluster belonging :

To predict time consumption

Analysis and conclusions

% of data in nicely defined clusters

Visualization of clusters

TAT calculation propositions

WIP and TAT analysis

References / papers