Python : RNN & LSTM

Introduction

Recurrent Neural Network (RNN) are particularly suitable for language analysis and generation of data based on previous sequences of information. Most of the notions and function we created for DNN will apply in our new system. However RNN are slightly more difficult to understand and some functions will need a bit more of a hard work. If you are new to neural network, I would recommend to start there for a first approach on how they work.

Imagine you are reading a text and after a few lines, you are asked to complete the story. Instinctively you will use the available data (environment, characters, atmosphere) to feed your imagination.

“Once upon a time, a knight was riding back home, impatient to meet his children and wife again. Unfortunately,…”

Our brain promptly understand the situation in this short sentence.

We could imagine to feed a simple neural network with the first 9 words and label the 10th word. Do this for every word in long text and we would be able to generate the next word of a given sentence. Neural networks will be able to generate text.

Recurrent Neural Networks are a particular model of Neural Network which allow the network to build a “memory” of past events. One issue with classic RNNs is that they have short term memory. LSTM (Long Short Term Memory neural network) allows us to capture long term memories. RNN and LSTM are specific neural networks that can both learn from sequential data (eg : text, video, music, time series data, …), they just have a different method to calculate the hidden state.

In this page, I will try to have an intuition on how RNN / LSTM models are built and apply them on simple example with the help of keras library.

Classic functions

sigmoid

tanh

softmax

Notations (inspired from deeplearning.ai notations)

Weights and biases

vocab

n_a

n_x

n_y

one hot vector

Useful numpy functions

np.multiply

np.matmul

np.zeros

np.concatenate

RNN cell

Dimensions of vectors and matrices :

RNN equations :

$\left\{\begin{matrix} s_{t}=tanh(Ux_{t}+Ws_{t-1})\\ \widehat{y}_{t}=softmax(Vs_{t}) \end{matrix}\right.$

At each time step t we can calculate the error :

$E_{t}=-y_{t}*log(\widehat{y}_{t})$

$y_{t}$ being the real output value and $\widehat{y}_{t}$ being our prediction.

Total error of the RNN network being the sum of each error at time t.

$E_{total}=\sum_{t=1}^{timesteps}E_{t}$

$E_{total}=-\sum_{t=1}^{timesteps}y_{t}*log(\widehat{y}_{t})$

Different architectures for different purposes

Classic RNN
Bi Rnn
ManyToOne
OneToMany
Encoder / Decoder
Attention Model

Now, in order to train our network we need to minimize this error value by finding the optimal values for U,V and W weights matrices. We will use backpropagation algorithm to achieve this task.

Calculation of backpropagation through time :

We need to calculate the derivative of our Error regarding the weights U,V and W of our network. It’s a much complex version of the calculation we did for a simple deep neural network.

It’s certainly the most difficult part of understanding how recurrent neural networks works, I am going to describe the calculation step by step.

Derivative of Error regarding V

we want to calculate $\frac{\partial E_{total}}{\partial V}$ using the chain derivation rule.

$\frac{\partial E_{total}}{\partial V}=\sum_{t=1}^{timestep}\frac{E_{t}}{\partial V}$

using chain rule and derivation properties (derivative of sum is the sum of derivative) we can write (i removed the symbols around the sum to make it clearer)

$\frac{\partial E_{total}}{\partial V}=\sum \frac{\partial E_{t}}{\partial \widehat{y_{t}}}\frac{\partial \widehat{y_{t}}}{\partial z_{t}}\frac{\partial z_{t}}{\partial V}$

Let’s derivate each part of our equation :

first factor :

we have $E_{t}=-y_{t}*log(\widehat{y}_{t})$ , so

$\frac{\partial E_{t}}{\partial \widehat{y_{t}}}=-y_{t}\frac{\partial log(\widehat{y_{t}})}{\partial \widehat{y_{t}}}$ with $log(z)'=\frac{1}{z}$ which give us

$\frac{\partial E_{t}}{\partial \widehat{y_{t}}}=\frac{-y_{t}}{\widehat{y_{t}}}$

second factor : derivation of softmax function

we have $softmax(z)_{j}=\sigma (z)_{j}=\frac{e^{z_{j}}}{\sum_{k=1}^{K}e^{z_{k}}}$ for all j $\epsilon$ K. I ll write down an example of what the softmax vector look like, so it’s easier to understand.

Imagine we have a 3 dimensions vector z such as $z=\begin{bmatrix} z_{1}\\ z_{2} \\ z_{3} \end{bmatrix}$

$\sigma$ will be :

$\sigma=\begin{bmatrix} \sigma_{1} \\ \sigma_{2} \\ \sigma_{3} \end{bmatrix}=\begin{bmatrix} \frac{e^{z_{1}}}{e^{z_{1}}+e^{z_{2}}+e^{z_{3}}} \\ \frac{e^{z_{2}}}{e^{z_{1}}+e^{z_{2}}+e^{z_{3}}} \\ \frac{e^{z_{3}}}{e^{z_{1}}+e^{z_{2}}+e^{z_{3}}} \end{bmatrix}$

so, using derivation rules $(\frac{u}{v})'=\frac{u'v-uv'}{v^{2}}$ and $(\frac{1}{u})'=\frac{-u'}{u^{2}}$ we have :

$\frac{\partial \sigma }{\partial z_{1}}=\begin{bmatrix} \frac{e^{z_{1}}(e^{z_{1}}+e^{z_{2}}+e^{z_{3}})-(e^{z_{1}})^{2}}{(e^{z_{1}}+e^{z_{2}}+e^{z_{3}})^{2}} \\ \frac{-e^{z_{2}}e^{z_{1}}}{(e^{z_{1}}+e^{z_{2}}+e^{z_{3}})^{2}} \\ \frac{-e^{z_{3}}e^{z_{1}}}{(e^{z_{1}}+e^{z_{2}}+e^{z_{3}})^{2}} \end{bmatrix} \frac{\partial \sigma }{\partial z_{2}}=\begin{bmatrix} \frac{-e^{z_{1}}e^{z_{2}}}{(e^{z_{1}}+e^{z_{2}}+e^{z_{3}})^{2}} \\ \frac{e^{z_{2}}(e^{z_{1}}+e^{z_{2}}+e^{z_{3}})-(e^{z_{2}})^{2}}{(e^{z_{1}}+e^{z_{2}}+e^{z_{3}})^{2}} \\ \frac{-e^{z_{3}}e^{z_{2}}}{(e^{z_{1}}+e^{z_{2}}+e^{z_{3}})^{2}} \end{bmatrix} \frac{\partial \sigma }{\partial z_{3}}=\begin{bmatrix} \frac{-e^{z_{1}}e^{z_{3}}}{(e^{z_{1}}+e^{z_{2}}+e^{z_{3}})^{2}} \\ \frac{-e^{z_{2}}e^{z_{3}}}{(e^{z_{1}}+e^{z_{2}}+e^{z_{3}})^{2}} \\ \frac{e^{z_{3}}(e^{z_{1}}+e^{z_{2}}+e^{z_{3}})-(e^{z_{3}})^{2}}{(e^{z_{1}}+e^{z_{2}}+e^{z_{3}})^{2}} \end{bmatrix}$

which we can write :

$\frac{\partial \sigma }{\partial z_{1}}=\begin{bmatrix} \sigma_{1}(1-\sigma_{1}) \\ -\sigma_{1}\sigma_{2} \\ -\sigma_{1}\sigma_{3} \end{bmatrix} \frac{\partial \sigma }{\partial z_{2}}=\begin{bmatrix} -\sigma_{2}\sigma_{1} \\ \sigma_{2}(1-\sigma_{2}) \\ -\sigma_{2}\sigma_{3} \end{bmatrix} \frac{\partial \sigma }{\partial z_{3}}=\begin{bmatrix} -\sigma_{3}\sigma_{1} \\ -\sigma_{3}\sigma_{2} \\ \sigma_{3}(1-\sigma_{3}) \end{bmatrix}$

Let’s generalize this equation. We have 2 different cases :

if $i=j$ then $\frac{\partial \sigma_{j} }{\partial z_{i}}=\sigma_{i}(1-\sigma_{i})$

if $i\neq j$ then $\frac{\partial \sigma_{j} }{\partial z_{i}}=\sigma_{j}\sigma_{i}$

third factor :

we have $z_{t}=Vs_{t}$ , so

$\frac{\partial z_{t}}{\partial V}=s_{t}$

Derivative of Error regarding W

Using the derivation rule $(g\circ f)'=(g'\circ f)f'$

Derivative of Error regarding U

Vanishing and exploding gradient problem discussion

Clipping the gradient between a min and a max value to avoid exploding gradient is an effective solution. We can use numpy built in function.

gradient.clip(minValue,maxValue,out=gradient)

1	gradient.clip(minValue,maxValue,out=gradient)

LSTM cell

LSTM cell is a solution for the vanishing gradient problem. I’ll introduce new notations which are specific to this kind of cell.

Cost function and optimization

Tensorflow implementation

Let’s load up dependencies and information about the system.

#a tensorflow implementation of Rnn and Lstm networks

#dependencies
import tensorflow as tf
import numpy as np
import pandas as pd
import sys
import os
import random
import matplotlib.pyplot as plt
#informations
print("Python",sys.version)
print("Tensorflow",tf.__version__)
current_folder=os.getcwd()

#a tensorflow implementation of Rnn and Lstm networks

#dependencies

import tensorflow as tf

import numpy as np

import pandas as pd

import sys

import os

import random

import matplotlib.pyplot as plt

#informations

print("Python",sys.version)

print("Tensorflow",tf.__version__)

current_folder=os.getcwd()

We need to prepare data properly before we feed it into our Rnn. For this example we would like to next outputs of a time series. One hot encoding is the important step here. Our data is made of

#data

data=[0,1,2,3,4,1,2,3,4,5,2,3,4,5,6,3,4,1,2,3,4,5,2,3,4,5,6,4,5,6,7,8]

# plt.plot(data)
# plt.show()

Tx=len(data)-1
vocabulary=set(data)
num_classes=len(vocabulary)
num_hidden=10
batch_size=2


#build dics
value_to_idx=dict((v,i) for i,v in enumerate(vocabulary))
idx_to_value=dict((i,v) for i,v in enumerate(vocabulary))

integer_encoded=[value_to_idx[value] for value in data]
one_hot=[]
for data in integer_encoded:
	idx=value_to_idx.get(data)
	x_temp=np.zeros(num_classes)
	x_temp[idx]=1
	one_hot.append(x_temp)

#data preparation for neural network
x=[]
y=[]
time_steps=3
for i in range(0,len(one_hot)-time_steps,1):
	X_buffer=one_hot[i:i+time_steps]
	y_buffer=one_hot[i+time_steps]
	x.append(X_buffer)
	y.append(y_buffer)
x=np.asarray(x)
y=np.asarray(y)

# print("x=",x)
# print("y=",y)
print("x has shape",x.shape)
print("y has shape",y.shape)

#data

data=[0,1,2,3,4,1,2,3,4,5,2,3,4,5,6,3,4,1,2,3,4,5,2,3,4,5,6,4,5,6,7,8]

# plt.plot(data)

# plt.show()

Tx=len(data)-1

vocabulary=set(data)

num_classes=len(vocabulary)

num_hidden=10

batch_size=2

#build dics

value_to_idx=dict((v,i) for i,v in enumerate(vocabulary))

idx_to_value=dict((i,v) for i,v in enumerate(vocabulary))

integer_encoded=[value_to_idx[value] for value in data]

one_hot=[]

for data in integer_encoded:

idx=value_to_idx.get(data)

x_temp=np.zeros(num_classes)

x_temp[idx]=1

one_hot.append(x_temp)

#data preparation for neural network

x=[]

y=[]

time_steps=3

for i in range(0,len(one_hot)-time_steps,1):

X_buffer=one_hot[i:i+time_steps]

y_buffer=one_hot[i+time_steps]

x.append(X_buffer)

y.append(y_buffer)

x=np.asarray(x)

y=np.asarray(y)

# print("x=",x)

# print("y=",y)

print("x has shape",x.shape)

print("y has shape",y.shape)

A particular attention to the dimensions of tensors and matrices is required in order to understand of the forward propagation works. In our example we are trying to predict outputs for time series. As usual, we are feeding the network with batch of inputs and labels (batch_x,batch_y).

Each input is made of time_steps and is represented as 1 hot vector (of length num_classes).

#tensorflow

X=tf.placeholder(tf.float32,[None,time_steps,num_classes])
Y=tf.placeholder(tf.float32,[None,num_classes])
learning_rate=0.001

# #weights and bias applied hidden state to calculate prediction
weights=tf.Variable(tf.random_normal([num_hidden,num_classes]))
biases=tf.Variable(tf.random_normal([num_classes]))

def rnn_model(x,weights,biases):
	cell=tf.contrib.rnn.BasicRNNCell(num_hidden)
	outputs,states=tf.nn.dynamic_rnn(cell=cell,inputs=x,dtype=tf.float32)
	print("outputs has shape (batch_size,time_steps,n_hidden)",outputs.shape)
	print("outputs[:,-1] has shape",outputs[:,-1].shape)
	print("states has shape (batch_size,num_hidden)",states.shape)
	logits=tf.matmul(outputs[:,-1],weights)+biases
	print("logits has shape (batch_size,num_classes)",logits.shape)
	return logits

logits=rnn_model(X,weights,biases)
prediction=tf.nn.softmax(logits)

cost=tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits,labels=Y))
optimizer=tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cost)
correct_prediction=tf.equal(tf.argmax(prediction,1),tf.argmax(Y,1))
accuracy=tf.reduce_mean(tf.cast(correct_prediction,tf.float32))
init=tf.global_variables_initializer()
saver=tf.train.Saver()

with tf.Session() as sess:
	sess.run(init)
	print("Starting training of the network :")
	for epoch in range(1000):
		for batch in range(int(Tx/batch_size)):
			batch_x,batch_y=x[batch:batch+batch_size],y[batch:batch+batch_size]
			sess.run(optimizer,feed_dict={X:batch_x,Y:batch_y})
		if epoch % 500:
			acc=sess.run(accuracy,feed_dict={X:batch_x,Y:batch_y})
			print("Epoch: ",epoch)
			print("Accuracy",acc)
			loss=sess.run(cost,feed_dict={X:batch_x,Y:batch_y})
			print("loss: ",loss)


	save_path=saver.save(sess,current_folder+"/model/model.ckpt")
	print("Model saved")

#tensorflow

X=tf.placeholder(tf.float32,[None,time_steps,num_classes])

Y=tf.placeholder(tf.float32,[None,num_classes])

learning_rate=0.001

# #weights and bias applied hidden state to calculate prediction

weights=tf.Variable(tf.random_normal([num_hidden,num_classes]))

biases=tf.Variable(tf.random_normal([num_classes]))

def rnn_model(x,weights,biases):

cell=tf.contrib.rnn.BasicRNNCell(num_hidden)

outputs,states=tf.nn.dynamic_rnn(cell=cell,inputs=x,dtype=tf.float32)

print("outputs has shape (batch_size,time_steps,n_hidden)",outputs.shape)

print("outputs[:,-1] has shape",outputs[:,-1].shape)

print("states has shape (batch_size,num_hidden)",states.shape)

logits=tf.matmul(outputs[:,-1],weights)+biases

print("logits has shape (batch_size,num_classes)",logits.shape)

return logits

logits=rnn_model(X,weights,biases)

prediction=tf.nn.softmax(logits)

cost=tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits,labels=Y))

optimizer=tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cost)

correct_prediction=tf.equal(tf.argmax(prediction,1),tf.argmax(Y,1))

accuracy=tf.reduce_mean(tf.cast(correct_prediction,tf.float32))

init=tf.global_variables_initializer()

saver=tf.train.Saver()

with tf.Session() as sess:

sess.run(init)

print("Starting training of the network :")

for epoch in range(1000):

for batch in range(int(Tx/batch_size)):

batch_x,batch_y=x[batch:batch+batch_size],y[batch:batch+batch_size]

sess.run(optimizer,feed_dict={X:batch_x,Y:batch_y})

if epoch % 500:

acc=sess.run(accuracy,feed_dict={X:batch_x,Y:batch_y})

print("Epoch: ",epoch)

print("Accuracy",acc)

loss=sess.run(cost,feed_dict={X:batch_x,Y:batch_y})

print("loss: ",loss)

save_path=saver.save(sess,current_folder+"/model/model.ckpt")

print("Model saved")

An application of LSTM with Keras

Word level implementation for text generation is highly consuming in term of computation power. One trick is to generate a sequence model which rely on character level generation. Vocabulary list of words usually contains more than 10 000 different words which leads to high dimensional onehot vector. Computation becomes really slow with such a model. On the other hand character based sequence model can be lowered to less than 100 different inputs (counting special characters and so on). Although this kind of model does not seem very natural (different from what real humans are actually doing with their brains), it’s still fun to play with it.

This is not the best option to play with NLP, if you are interested in going further on how to reduce the space of language, I would recommend you to have an eye on words embedding techniques.

Data preparation & analysis

lexicon of french rap lyrics

number of word per verse

Generate new lyrics

Let’s create a new python file generator.py and load the keras model that we created previously.

#generate text
print("---------------------------------------")
print("TEXT GENERATION")
# load the network weights
filename = "classifier128128100.h5"
model=keras.models.load_model(filename)

#generate text

print("---------------------------------------")

print("TEXT GENERATION")

# load the network weights

filename = "classifier128128100.h5"

model=keras.models.load_model(filename)

We just need to input the beginning of a sentence which the model can rely on in order to generate new sequence.

# pick a random seed
start = "            " #input has to be lower case
integer_list=[value_to_idx[value] for value in start]
print("initial integer list:",integer_list)
inputs=np.asarray(toonehot(start))
inputs=np.reshape(inputs,(1,inputs.shape[0],inputs.shape[1]))
print(type(inputs))
print(inputs.shape)
# generate characters
for i in range(4000):
	next=model.predict(inputs)
	integer_list.append(np.random.choice(num_classes,p=next.ravel()))
	last_five=integer_list[-12:]
	inputs=[]
	for value in last_five:
		x_temp=np.zeros(num_classes)
		x_temp[value]=1
		inputs.append(x_temp)
	inputs=np.asarray(inputs)
	inputs=np.reshape(inputs,(1,inputs.shape[0],inputs.shape[1]))
generated_text=[idx_to_value[idx] for idx in integer_list]
print(''.join(generated_text))
print ("Done.")

# pick a random seed

start = " " #input has to be lower case

integer_list=[value_to_idx[value] for value in start]

print("initial integer list:",integer_list)

inputs=np.asarray(toonehot(start))

inputs=np.reshape(inputs,(1,inputs.shape[0],inputs.shape[1]))

print(type(inputs))

print(inputs.shape)

# generate characters

for i in range(4000):

next=model.predict(inputs)

integer_list.append(np.random.choice(num_classes,p=next.ravel()))

last_five=integer_list[-12:]

inputs=[]

for value in last_five:

x_temp=np.zeros(num_classes)

x_temp[value]=1

inputs.append(x_temp)

inputs=np.asarray(inputs)

inputs=np.reshape(inputs,(1,inputs.shape[0],inputs.shape[1]))

generated_text=[idx_to_value[idx] for idx in integer_list]

print(''.join(generated_text))

print ("Done.")

Github link

Text tokenization :

word2index and index 2word

#creating index2word and word2index vectors
word2index=dict((w,i) for i,w in enumerate(unique))
index2char=dict((i,w) for i,w in enumerate(unique))


integer_encoded=[word2index[word] for word in words]

printed=False
if printed==True:
	print(words) #words version of the data
	print(integer_encoded) #integer version (each word correspond to 1 integer value) of the data

onehot_encoded=list() #empty onehot list

def one_hot(integer_encoded):
	for value in integer_encoded:
		val=[0 for _  in range(len(unique))]
		val[value]=1
		onehot_encoded.append(val)
	return onehot_encoded

save=False
if save==True:
	onehot=one_hot(integer_encoded)
	print(len(onehot), "values in onehot vector")
	onehot=np.asarray(onehot)
	print(type(onehot))
	onehot.tofile("data")

#creating index2word and word2index vectors

word2index=dict((w,i) for i,w in enumerate(unique))

index2char=dict((i,w) for i,w in enumerate(unique))

integer_encoded=[word2index[word] for word in words]

printed=False

if printed==True:

print(words) #words version of the data

print(integer_encoded) #integer version (each word correspond to 1 integer value) of the data

onehot_encoded=list() #empty onehot list

def one_hot(integer_encoded):

for value in integer_encoded:

val=[0 for _ in range(len(unique))]

val[value]=1

onehot_encoded.append(val)

return onehot_encoded

save=False

if save==True:

onehot=one_hot(integer_encoded)

print(len(onehot), "values in onehot vector")

onehot=np.asarray(onehot)

print(type(onehot))

onehot.tofile("data")

We need to separate the data sentence by sentence and then word by word. Input data in the end will have the follwoing format :

$X=\begin{bmatrix} [sentence1]\\ [sentence2] \\ [sentence3] \end{bmatrix}$

Each sentence being a matrix of words (words being represented by one hot encoded vector)

$sentence= \left [\begin{matrix} 1\\ 0 \\ 0 \end{matrix} \right ] \left [\begin{matrix} 0\\ 1 \\ 0 \end{matrix} \right ] \left [\begin{matrix} 0\\ 0 \\ 1 \end{matrix} \right ] \right ]$

How do modern rap is construct?

References :

https://www.tensorflow.org/tutorials/recurrent
https://colah.github.io/posts/2015-08-Understanding-LSTMs/
https://www.youtube.com/watch?v=9zhrxE5PQgY&t=2s
https://iamtrask.github.io/2015/11/15/anyone-can-code-lstm/ll
http://blog.varunajayasiri.com/numpy_lstm.html
http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/
http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-2-implementing-a-language-model-rnn-with-python-numpy-and-theano/
https://www.coursera.org/learn/nlp-sequence-models/home/welcome