Python : RNN & LSTM

Introduction

Recurrent Neural Network (RNN) are particularly suitable for language analysis and generation of data based on previous sequences of information. Most of the notions and function we created for DNN will apply in our new system. However RNN are slightly more difficult to understand and some functions will need a bit more of a hard work. If you are new to neural network, I would recommend to start there for a first approach on how they work.

Imagine you are reading a text and after a few lines, you are asked to complete the story. Instinctively you will use the available data (environment, characters, atmosphere) to feed your imagination.

“Once upon a time, a knight was riding back home, impatient to meet his children and wife again. Unfortunately,…”

Our brain promptly understand the situation in this short sentence.

We could imagine to feed a simple neural network with the first 9 words and label the 10th word. Do this for every word in long text and we would be able to generate the next word of a given sentence. Neural networks will be able to generate text.

Recurrent Neural Networks are a particular model of Neural Network which allow the network to build a “memory” of past events. One issue with classic RNNs is that they have short term memory. LSTM (Long Short Term Memory neural network) allows us to capture long term memories. RNN and LSTM are specific neural networks that can both learn from sequential data (eg : text, video, music, time series data, …), they just have a different method to calculate the hidden state.

In this page, I will try to have an intuition on how RNN / LSTM models are built and apply them on simple example with the help of keras library.

Classic functions

sigmoid

tanh

softmax

Notations (inspired from deeplearning.ai notations)

x

y

a

c

Weights and biases

Tx

Ty

vocab

m

n_a

n_x

n_y

ŷ

one hot vector

Useful numpy functions

np.multiply

np.matmul

np.zeros

np.concatenate

RNN cell

Dimensions of vectors and matrices :

RNN equations :

\left\{\begin{matrix} s_{t}=tanh(Ux_{t}+Ws_{t-1})\\ \widehat{y}_{t}=softmax(Vs_{t})  \end{matrix}\right.

At each time step t we can calculate the error :

E_{t}=-y_{t}*log(\widehat{y}_{t})

y_{t} being the real output value and \widehat{y}_{t} being our prediction.

Total error of the RNN network being the sum of each error at time t.

E_{total}=\sum_{t=1}^{timesteps}E_{t}

E_{total}=-\sum_{t=1}^{timesteps}y_{t}*log(\widehat{y}_{t})

Different architectures for different purposes

  1. Classic RNN
  2. Bi Rnn
  3. ManyToOne
  4. OneToMany
  5. Encoder / Decoder
  6. Attention Model

Now, in order to train our network we need to minimize this error value by finding the optimal values for U,V and W weights matrices. We will use backpropagation algorithm to achieve this task.

Calculation of backpropagation through time :

We need to calculate the derivative of our Error regarding the weights U,V and W of our network. It’s a much complex version of the calculation we did for a simple deep neural network.

It’s certainly the most difficult part of understanding how recurrent neural networks works, I am going to describe the calculation step by step.

Derivative of Error regarding V

we want to calculate \frac{\partial E_{total}}{\partial V} using the chain derivation rule.

\frac{\partial E_{total}}{\partial V}=\sum_{t=1}^{timestep}\frac{E_{t}}{\partial V}

using chain rule and derivation properties (derivative of sum is the sum of derivative) we can write (i removed the symbols around the sum to make it clearer)

\frac{\partial E_{total}}{\partial V}=\sum \frac{\partial E_{t}}{\partial \widehat{y_{t}}}\frac{\partial \widehat{y_{t}}}{\partial z_{t}}\frac{\partial z_{t}}{\partial V}

Let’s derivate each part of our equation :

first factor :

we have E_{t}=-y_{t}*log(\widehat{y}_{t}), so

\frac{\partial E_{t}}{\partial \widehat{y_{t}}}=-y_{t}\frac{\partial log(\widehat{y_{t}})}{\partial \widehat{y_{t}}}  with  log(z)'=\frac{1}{z}  which give us

\frac{\partial E_{t}}{\partial \widehat{y_{t}}}=\frac{-y_{t}}{\widehat{y_{t}}}

second factor : derivation of softmax function

we have softmax(z)_{j}=\sigma (z)_{j}=\frac{e^{z_{j}}}{\sum_{k=1}^{K}e^{z_{k}}} for all j \epsilon K. I ll write down an example of what the softmax vector look like, so it’s easier to understand.

Imagine we have a 3 dimensions vector z such as z=\begin{bmatrix} z_{1}\\ z_{2} \\ z_{3} \end{bmatrix}

\sigma will be :

\sigma=\begin{bmatrix}  \sigma_{1} \\ \sigma_{2} \\ \sigma_{3} \end{bmatrix}=\begin{bmatrix} \frac{e^{z_{1}}}{e^{z_{1}}+e^{z_{2}}+e^{z_{3}}} \\ \frac{e^{z_{2}}}{e^{z_{1}}+e^{z_{2}}+e^{z_{3}}} \\ \frac{e^{z_{3}}}{e^{z_{1}}+e^{z_{2}}+e^{z_{3}}} \end{bmatrix}

so, using derivation rules (\frac{u}{v})'=\frac{u'v-uv'}{v^{2}}  and  (\frac{1}{u})'=\frac{-u'}{u^{2}}  we have :

 \frac{\partial \sigma }{\partial z_{1}}=\begin{bmatrix} \frac{e^{z_{1}}(e^{z_{1}}+e^{z_{2}}+e^{z_{3}})-(e^{z_{1}})^{2}}{(e^{z_{1}}+e^{z_{2}}+e^{z_{3}})^{2}} \\ \frac{-e^{z_{2}}e^{z_{1}}}{(e^{z_{1}}+e^{z_{2}}+e^{z_{3}})^{2}} \\ \frac{-e^{z_{3}}e^{z_{1}}}{(e^{z_{1}}+e^{z_{2}}+e^{z_{3}})^{2}} \end{bmatrix} \frac{\partial \sigma }{\partial z_{2}}=\begin{bmatrix} \frac{-e^{z_{1}}e^{z_{2}}}{(e^{z_{1}}+e^{z_{2}}+e^{z_{3}})^{2}} \\ \frac{e^{z_{2}}(e^{z_{1}}+e^{z_{2}}+e^{z_{3}})-(e^{z_{2}})^{2}}{(e^{z_{1}}+e^{z_{2}}+e^{z_{3}})^{2}} \\ \frac{-e^{z_{3}}e^{z_{2}}}{(e^{z_{1}}+e^{z_{2}}+e^{z_{3}})^{2}} \end{bmatrix} \frac{\partial \sigma }{\partial z_{3}}=\begin{bmatrix} \frac{-e^{z_{1}}e^{z_{3}}}{(e^{z_{1}}+e^{z_{2}}+e^{z_{3}})^{2}} \\ \frac{-e^{z_{2}}e^{z_{3}}}{(e^{z_{1}}+e^{z_{2}}+e^{z_{3}})^{2}} \\ \frac{e^{z_{3}}(e^{z_{1}}+e^{z_{2}}+e^{z_{3}})-(e^{z_{3}})^{2}}{(e^{z_{1}}+e^{z_{2}}+e^{z_{3}})^{2}} \end{bmatrix}

which we can write :

 \frac{\partial \sigma }{\partial z_{1}}=\begin{bmatrix} \sigma_{1}(1-\sigma_{1}) \\ -\sigma_{1}\sigma_{2} \\ -\sigma_{1}\sigma_{3} \end{bmatrix} \frac{\partial \sigma }{\partial z_{2}}=\begin{bmatrix} -\sigma_{2}\sigma_{1} \\ \sigma_{2}(1-\sigma_{2}) \\ -\sigma_{2}\sigma_{3} \end{bmatrix} \frac{\partial \sigma }{\partial z_{3}}=\begin{bmatrix} -\sigma_{3}\sigma_{1} \\ -\sigma_{3}\sigma_{2} \\ \sigma_{3}(1-\sigma_{3}) \end{bmatrix}

Let’s generalize this equation. We have 2 different cases :

if  i=j  then  \frac{\partial \sigma_{j} }{\partial z_{i}}=\sigma_{i}(1-\sigma_{i})

 

if i\neq j  then  \frac{\partial \sigma_{j} }{\partial z_{i}}=\sigma_{j}\sigma_{i}

third factor :

we have z_{t}=Vs_{t}, so

\frac{\partial z_{t}}{\partial V}=s_{t}

 

Derivative of Error regarding W

Using the derivation rule (g\circ f)'=(g'\circ f)f'

 

Derivative of Error regarding U

 

Vanishing and exploding gradient problem discussion

Clipping the gradient between a min and a max value to avoid exploding gradient is an effective solution. We can use numpy built in function.

LSTM cell

LSTM cell is a solution for the vanishing gradient problem. I’ll introduce new notations which are specific to this kind of cell.

Cost function and optimization

 

Tensorflow implementation

Let’s load up dependencies and information about the system.

We need to prepare data properly before we feed it into our Rnn. For this example we would like to next outputs of a time series. One hot encoding is the important step here. Our data is made of

A particular attention to the dimensions of tensors and matrices is required in order to understand of the forward propagation works. In our example we are trying to predict outputs for time series. As usual, we are feeding the network with batch of inputs and labels (batch_x,batch_y).

Each input is made of time_steps and is represented as 1 hot vector (of length num_classes).

 

An application of LSTM with Keras

Word level implementation for text generation is highly consuming in term of computation power. One trick is to generate a sequence model which rely on character level generation. Vocabulary list of words usually contains more than 10 000 different words which leads to high dimensional onehot vector. Computation becomes really slow with such a model. On the other hand character based sequence model can be lowered to less than 100 different inputs (counting special characters and so on). Although this kind of model does not seem very natural (different from what real humans are actually doing with their brains), it’s still fun to play with it.

This is not the best option to play with NLP, if you are interested in going further on how to reduce the space of language, I would recommend you to have an eye on words embedding techniques.

Data preparation & analysis

 

lexicon of french rap lyrics

 

number of word per verse

 

Generate new lyrics

Let’s create a new python file generator.py and load the keras model that we created previously.

We just need to input the beginning of a sentence which the model can rely on in order to generate new sequence.

 

Github link

Text tokenization :

 

word2index and index 2word

 

We need to separate the data sentence by sentence and then word by word. Input data in the end will have the follwoing format :

X=\begin{bmatrix} [sentence1]\\ [sentence2] \\ [sentence3] \end{bmatrix}

Each sentence being a matrix of words (words being represented by one hot encoded vector)

sentence= \left [\begin{matrix} 1\\ 0 \\ 0  \end{matrix} \right ] \left [\begin{matrix} 0\\ 1 \\ 0  \end{matrix} \right ] \left [\begin{matrix} 0\\ 0 \\ 1  \end{matrix} \right ]  \right ]

How do modern rap is construct?

References :

  1. https://www.tensorflow.org/tutorials/recurrent
  2. https://colah.github.io/posts/2015-08-Understanding-LSTMs/
  3. https://www.youtube.com/watch?v=9zhrxE5PQgY&t=2s
  4. https://iamtrask.github.io/2015/11/15/anyone-can-code-lstm/ll
  5. http://blog.varunajayasiri.com/numpy_lstm.html
  6. http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/
  7. http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-2-implementing-a-language-model-rnn-with-python-numpy-and-theano/
  8. https://www.coursera.org/learn/nlp-sequence-models/home/welcome