Introduction
Recurrent Neural Network (RNN) are particularly suitable for language analysis and generation of data based on previous sequences of information. Most of the notions and function we created for DNN will apply in our new system. However RNN are slightly more difficult to understand and some functions will need a bit more of a hard work. If you are new to neural network, I would recommend to start there for a first approach on how they work.
Imagine you are reading a text and after a few lines, you are asked to complete the story. Instinctively you will use the available data (environment, characters, atmosphere) to feed your imagination.
“Once upon a time, a knight was riding back home, impatient to meet his children and wife again. Unfortunately,…”
Our brain promptly understand the situation in this short sentence.
We could imagine to feed a simple neural network with the first 9 words and label the 10th word. Do this for every word in long text and we would be able to generate the next word of a given sentence. Neural networks will be able to generate text.
Recurrent Neural Networks are a particular model of Neural Network which allow the network to build a “memory” of past events. One issue with classic RNNs is that they have short term memory. LSTM (Long Short Term Memory neural network) allows us to capture long term memories. RNN and LSTM are specific neural networks that can both learn from sequential data (eg : text, video, music, time series data, …), they just have a different method to calculate the hidden state.
In this page, I will try to have an intuition on how RNN / LSTM models are built and apply them on simple example with the help of keras library.
Classic functions
sigmoid
tanh
softmax
Notations (inspired from deeplearning.ai notations)
x
y
a
c
Weights and biases
Tx
Ty
vocab
m
n_a
n_x
n_y
ŷ
one hot vector
Useful numpy functions
np.multiply
np.matmul
np.zeros
np.concatenate
RNN cell
Dimensions of vectors and matrices :
RNN equations :
At each time step t we can calculate the error :
being the real output value and
being our prediction.
Total error of the RNN network being the sum of each error at time t.
Different architectures for different purposes
- Classic RNN
- Bi Rnn
- ManyToOne
- OneToMany
- Encoder / Decoder
- Attention Model
Now, in order to train our network we need to minimize this error value by finding the optimal values for U,V and W weights matrices. We will use backpropagation algorithm to achieve this task.
Calculation of backpropagation through time :
We need to calculate the derivative of our Error regarding the weights U,V and W of our network. It’s a much complex version of the calculation we did for a simple deep neural network.
It’s certainly the most difficult part of understanding how recurrent neural networks works, I am going to describe the calculation step by step.
Derivative of Error regarding V
we want to calculate using the chain derivation rule.
using chain rule and derivation properties (derivative of sum is the sum of derivative) we can write (i removed the symbols around the sum to make it clearer)
Let’s derivate each part of our equation :
first factor :
we have , so
with
which give us
second factor : derivation of softmax function
we have for all j
K. I ll write down an example of what the softmax vector look like, so it’s easier to understand.
Imagine we have a 3 dimensions vector z such as
will be :
so, using derivation rules and
we have :
which we can write :
Let’s generalize this equation. We have 2 different cases :
if then
if then
third factor :
we have , so
Derivative of Error regarding W
Using the derivation rule
Derivative of Error regarding U
Vanishing and exploding gradient problem discussion
Clipping the gradient between a min and a max value to avoid exploding gradient is an effective solution. We can use numpy built in function.
1 |
gradient.clip(minValue,maxValue,out=gradient) |
LSTM cell
LSTM cell is a solution for the vanishing gradient problem. I’ll introduce new notations which are specific to this kind of cell.
Cost function and optimization
Tensorflow implementation
Let’s load up dependencies and information about the system.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
#a tensorflow implementation of Rnn and Lstm networks #dependencies import tensorflow as tf import numpy as np import pandas as pd import sys import os import random import matplotlib.pyplot as plt #informations print("Python",sys.version) print("Tensorflow",tf.__version__) current_folder=os.getcwd() |
We need to prepare data properly before we feed it into our Rnn. For this example we would like to next outputs of a time series. One hot encoding is the important step here. Our data is made of
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
#data data=[0,1,2,3,4,1,2,3,4,5,2,3,4,5,6,3,4,1,2,3,4,5,2,3,4,5,6,4,5,6,7,8] # plt.plot(data) # plt.show() Tx=len(data)-1 vocabulary=set(data) num_classes=len(vocabulary) num_hidden=10 batch_size=2 #build dics value_to_idx=dict((v,i) for i,v in enumerate(vocabulary)) idx_to_value=dict((i,v) for i,v in enumerate(vocabulary)) integer_encoded=[value_to_idx[value] for value in data] one_hot=[] for data in integer_encoded: idx=value_to_idx.get(data) x_temp=np.zeros(num_classes) x_temp[idx]=1 one_hot.append(x_temp) #data preparation for neural network x=[] y=[] time_steps=3 for i in range(0,len(one_hot)-time_steps,1): X_buffer=one_hot[i:i+time_steps] y_buffer=one_hot[i+time_steps] x.append(X_buffer) y.append(y_buffer) x=np.asarray(x) y=np.asarray(y) # print("x=",x) # print("y=",y) print("x has shape",x.shape) print("y has shape",y.shape) |
A particular attention to the dimensions of tensors and matrices is required in order to understand of the forward propagation works. In our example we are trying to predict outputs for time series. As usual, we are feeding the network with batch of inputs and labels (batch_x,batch_y).
Each input is made of time_steps and is represented as 1 hot vector (of length num_classes).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
#tensorflow X=tf.placeholder(tf.float32,[None,time_steps,num_classes]) Y=tf.placeholder(tf.float32,[None,num_classes]) learning_rate=0.001 # #weights and bias applied hidden state to calculate prediction weights=tf.Variable(tf.random_normal([num_hidden,num_classes])) biases=tf.Variable(tf.random_normal([num_classes])) def rnn_model(x,weights,biases): cell=tf.contrib.rnn.BasicRNNCell(num_hidden) outputs,states=tf.nn.dynamic_rnn(cell=cell,inputs=x,dtype=tf.float32) print("outputs has shape (batch_size,time_steps,n_hidden)",outputs.shape) print("outputs[:,-1] has shape",outputs[:,-1].shape) print("states has shape (batch_size,num_hidden)",states.shape) logits=tf.matmul(outputs[:,-1],weights)+biases print("logits has shape (batch_size,num_classes)",logits.shape) return logits logits=rnn_model(X,weights,biases) prediction=tf.nn.softmax(logits) cost=tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits,labels=Y)) optimizer=tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cost) correct_prediction=tf.equal(tf.argmax(prediction,1),tf.argmax(Y,1)) accuracy=tf.reduce_mean(tf.cast(correct_prediction,tf.float32)) init=tf.global_variables_initializer() saver=tf.train.Saver() with tf.Session() as sess: sess.run(init) print("Starting training of the network :") for epoch in range(1000): for batch in range(int(Tx/batch_size)): batch_x,batch_y=x[batch:batch+batch_size],y[batch:batch+batch_size] sess.run(optimizer,feed_dict={X:batch_x,Y:batch_y}) if epoch % 500: acc=sess.run(accuracy,feed_dict={X:batch_x,Y:batch_y}) print("Epoch: ",epoch) print("Accuracy",acc) loss=sess.run(cost,feed_dict={X:batch_x,Y:batch_y}) print("loss: ",loss) save_path=saver.save(sess,current_folder+"/model/model.ckpt") print("Model saved") |
An application of LSTM with Keras
Word level implementation for text generation is highly consuming in term of computation power. One trick is to generate a sequence model which rely on character level generation. Vocabulary list of words usually contains more than 10 000 different words which leads to high dimensional onehot vector. Computation becomes really slow with such a model. On the other hand character based sequence model can be lowered to less than 100 different inputs (counting special characters and so on). Although this kind of model does not seem very natural (different from what real humans are actually doing with their brains), it’s still fun to play with it.
This is not the best option to play with NLP, if you are interested in going further on how to reduce the space of language, I would recommend you to have an eye on words embedding techniques.
Data preparation & analysis
lexicon of french rap lyrics
number of word per verse
Generate new lyrics
Let’s create a new python file generator.py and load the keras model that we created previously.
1 2 3 4 5 6 |
#generate text print("---------------------------------------") print("TEXT GENERATION") # load the network weights filename = "classifier128128100.h5" model=keras.models.load_model(filename) |
We just need to input the beginning of a sentence which the model can rely on in order to generate new sequence.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
# pick a random seed start = " " #input has to be lower case integer_list=[value_to_idx[value] for value in start] print("initial integer list:",integer_list) inputs=np.asarray(toonehot(start)) inputs=np.reshape(inputs,(1,inputs.shape[0],inputs.shape[1])) print(type(inputs)) print(inputs.shape) # generate characters for i in range(4000): next=model.predict(inputs) integer_list.append(np.random.choice(num_classes,p=next.ravel())) last_five=integer_list[-12:] inputs=[] for value in last_five: x_temp=np.zeros(num_classes) x_temp[value]=1 inputs.append(x_temp) inputs=np.asarray(inputs) inputs=np.reshape(inputs,(1,inputs.shape[0],inputs.shape[1])) generated_text=[idx_to_value[idx] for idx in integer_list] print(''.join(generated_text)) print ("Done.") |
Text tokenization :
word2index and index 2word
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
#creating index2word and word2index vectors word2index=dict((w,i) for i,w in enumerate(unique)) index2char=dict((i,w) for i,w in enumerate(unique)) integer_encoded=[word2index[word] for word in words] printed=False if printed==True: print(words) #words version of the data print(integer_encoded) #integer version (each word correspond to 1 integer value) of the data onehot_encoded=list() #empty onehot list def one_hot(integer_encoded): for value in integer_encoded: val=[0 for _ in range(len(unique))] val[value]=1 onehot_encoded.append(val) return onehot_encoded save=False if save==True: onehot=one_hot(integer_encoded) print(len(onehot), "values in onehot vector") onehot=np.asarray(onehot) print(type(onehot)) onehot.tofile("data") |
We need to separate the data sentence by sentence and then word by word. Input data in the end will have the follwoing format :
Each sentence being a matrix of words (words being represented by one hot encoded vector)
How do modern rap is construct?
References :
- https://www.tensorflow.org/tutorials/recurrent
- https://colah.github.io/posts/2015-08-Understanding-LSTMs/
- https://www.youtube.com/watch?v=9zhrxE5PQgY&t=2s
- https://iamtrask.github.io/2015/11/15/anyone-can-code-lstm/ll
- http://blog.varunajayasiri.com/numpy_lstm.html
- http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/
- http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-2-implementing-a-language-model-rnn-with-python-numpy-and-theano/
- https://www.coursera.org/learn/nlp-sequence-models/home/welcome