Text Generation with LSTM in PyTorch

By Adrian Tam on April 8, 2023 in Deep Learning with PyTorch 5

Recurrent neural network can be used for time series prediction. In which, a regression neural network is created. It can also be used as generative model, which usually is a classification neural network model. A generative model is to learn certain pattern from data, such that when it is presented with some prompt, it can create a complete output that in the same style as the learned pattern.

In this post, you will discover how to build a generative model for text using LSTM recurrent neural networks in PyTorch. After finishing this post, you will know:

Where to download a free corpus of text that you can use to train text generative models
How to frame the problem of text sequences to a recurrent neural network generative model
How to develop an LSTM to generate plausible text sequences for a given problem

Kick-start your project with my book Deep Learning with PyTorch. It provides self-study tutorials with working code.

Let’s get started.

Text Generation with LSTM in PyTorch
Photo by Egor Lyfar. Some rights reserved.

Overview

This post is divided into six parts; they are:

What is a Generative Model
Getting Text Data
A Small LSTM Network to Predit the Next Character
Generating Text with an LSTM Model
Using a Larger LSTM Network
Faster Training with GPU

What is a Generative Model

Generative model is indeed, just another machine learning model that happened to be able to create new things. Generative Adverserial Network (GAN) is a class of its own. Transformer models that uses attention mechanism is also found to be useful to generate text passages.

It is just a machine learning model because the model has been trained with existing data, so that it learned something from it. Depends on how to train it, they can work vastly different. In this post, a character-based generative model is created. What it means is to train a model that take a sequence of characters (alphabets and punctuations) as input and the immediate next character as the target. As long as it can predict what is the next character given what are preceding, you can run the model in a loop to generate a long piece of text.

This model is probably the simplest one. However, human language is complex. You shouldn’t expect it can produce very high quality output. Even so, you need a lot of data and train the model for a long time before you can see sensible results.

Want to Get Started With Deep Learning with PyTorch?

Take my free email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Getting Text Data

Obtaining high quality data is important for a successful generative model. Fortunately, many of the classical texts are no longer protected under copyright. This means you can download all the text for these books for free and use them in experiments, like creating generative models. Perhaps the best place to get access to free books that are no longer protected by copyright is Project Gutenberg.

In this post, you will use a favorite book from childhood as the dataset, Alice’s Adventures in Wonderland by Lewis Carroll:

https://www.gutenberg.org/ebooks/11

Your model will learn the dependencies between characters and the conditional probabilities of characters in sequences so that you can, in turn, generate wholly new and original sequences of characters. This post is a lot of fun, and repeating these experiments with other books from Project Gutenberg is recommended. These experiments are not limited to text; you can also experiment with other ASCII data, such as computer source code, marked-up documents in LATEX, HTML or Markdown, and more.

You can download the complete text in ASCII format (Plaintext UTF-8) for this book for free and place it in your working directory with the filename wonderland.txt. Now, you need to prepare the dataset ready for modeling. Project Gutenberg adds a standard header and footer to each book, which is not part of the original text. Open the file in a text editor and delete the header and footer. The header is obvious and ends with the text:

*** START OF THIS PROJECT GUTENBERG EBOOK ALICE'S ADVENTURES IN WONDERLAND ***

1	* START OF THIS PROJECT GUTENBERG EBOOK ALICE'S ADVENTURES IN WONDERLAND *

The footer is all the text after the line of text that says:

THE END

THE END

You should be left with a text file that has about 3,400 lines of text.

A Small LSTM Network to Predict the Next Character

First, you need to do some preprocessing on the data before you can build a model. Neural network models can only work with numbers, not text. Therefore you need to transform the characters into numbers. To make the problem simpler, you also want to transform all uppercase letters into lowercase.

In below, you open the text file, transform all letters into lowercase, and create a Python dict char_to_int to map characters into distinct integers. For example, the list of unique sorted lowercase characters in the book is as follows:

['\n', '\r', ' ', '!', '"', "'", '(', ')', '*', ',', '-', '.', ':', ';', '?', '[', ']',
'_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p',
'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '\xbb', '\xbf', '\xef']

['\n', '\r', ' ', '!', '"', "'", '(', ')', '*', ',', '-', '.', ':', ';', '?', '[', ']',

'_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p',

'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '\xbb', '\xbf', '\xef']

Since this problem is character-based, the “vocabulary” are the distinct characters ever used in the text.

import numpy as np

# load ascii text and covert to lowercase
filename = "wonderland.txt"
raw_text = open(filename, 'r', encoding='utf-8').read()
raw_text = raw_text.lower()

# create mapping of unique chars to integers
chars = sorted(list(set(raw_text)))
char_to_int = dict((c, i) for i, c in enumerate(chars))

# summarize the loaded data
n_chars = len(raw_text)
n_vocab = len(chars)
print("Total Characters: ", n_chars)
print("Total Vocab: ", n_vocab)

import numpy as np

# load ascii text and covert to lowercase

filename = "wonderland.txt"

raw_text = open(filename, 'r', encoding='utf-8').read()

raw_text = raw_text.lower()

# create mapping of unique chars to integers

chars = sorted(list(set(raw_text)))

char_to_int = dict((c, i) for i, c in enumerate(chars))

# summarize the loaded data

n_chars = len(raw_text)

n_vocab = len(chars)

print("Total Characters: ", n_chars)

print("Total Vocab: ", n_vocab)

This should print:

Total Characters: 144574
Total Vocab: 50

1 2	Total Characters: 144574 Total Vocab: 50

You can see the book has just under 150,000 characters, and when converted to lowercase, there are only 50 distinct characters in the vocabulary for the network to learn — much more than the 26 in the alphabet.

Next, you need to separate the text into inputs and targets. A window of 100 character is used here. That is, with character 1 to 100 as input, your model is going to predict for character 101. Should a window of 5 be used, the word “chapter” will become two data samples:

chapt -> e
hapte -> r

1 2	chapt -> e hapte -> r

In a long text such as this one, a myraid of windows can be created and this produced a dataset of a lot of samples:

# prepare the dataset of input to output pairs encoded as integers
seq_length = 100
dataX = []
dataY = []
for i in range(0, n_chars - seq_length, 1):
    seq_in = raw_text[i:i + seq_length]
    seq_out = raw_text[i + seq_length]
    dataX.append([char_to_int[char] for char in seq_in])
    dataY.append(char_to_int[seq_out])
n_patterns = len(dataX)
print("Total Patterns: ", n_patterns)

# prepare the dataset of input to output pairs encoded as integers

seq_length = 100

dataX = []

dataY = []

for i in range(0, n_chars - seq_length, 1):

seq_in = raw_text[i:i + seq_length]

seq_out = raw_text[i + seq_length]

dataX.append([char_to_int[char] for char in seq_in])

dataY.append(char_to_int[seq_out])

n_patterns = len(dataX)

print("Total Patterns: ", n_patterns)

Running the above, you can see a total of 144,474 samples are created. Each sample is now in the form of integers, transformed using the mapping char_to_int. However, a PyTorch model would prefer to see the data in floating point tensors. Hence you should convert these into PyTorch tensors. LSTM layer is going to be used in the model, thus the input tensor should be of dimension (sample, time steps, features). To help training, it is also a good idea to normalize the input to 0 to 1. Hence you have the following:

import torch
import torch.nn as nn
import torch.optim as optim

# reshape X to be [samples, time steps, features]
X = torch.tensor(dataX, dtype=torch.float32).reshape(n_patterns, seq_length, 1)
X = X / float(n_vocab)
y = torch.tensor(dataY)
print(X.shape, y.shape)

import torch

import torch.nn as nn

import torch.optim as optim

# reshape X to be [samples, time steps, features]

X = torch.tensor(dataX, dtype=torch.float32).reshape(n_patterns, seq_length, 1)

X = X / float(n_vocab)

y = torch.tensor(dataY)

print(X.shape, y.shape)

You can now define your LSTM model. Here, you define a single hidden LSTM layer with 256 hidden units. The input is single feature (i.e., one integer for one character). A dropout layer with probability 0.2 is added after the LSTM layer. The output of LSTM layer is a tuple, which the first element is the hidden states from the LSTM cell for each of the time step. It is a history of how the hidden state evolved as the LSTM cell accepts each time step of input. Presumably, the last hidden state contained the most information, hence only the last hidden state is pass on to the output layer. The output layer is a fully-connected layer to produce logits for the 50 vocabularies. The logits can be converted into probability-like prediction using a softmax function.

import torch.nn as nn
import torch.optim as optim
import torch.utils.data as data

class CharModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.lstm = nn.LSTM(input_size=1, hidden_size=256, num_layers=1, batch_first=True)
        self.dropout = nn.Dropout(0.2)
        self.linear = nn.Linear(256, n_vocab)
    def forward(self, x):
        x, _ = self.lstm(x)
        # take only the last output
        x = x[:, -1, :]
        # produce output
        x = self.linear(self.dropout(x))
        return x

import torch.nn as nn

import torch.optim as optim

import torch.utils.data as data

class CharModel(nn.Module):

def __init__(self):

super().__init__()

self.lstm = nn.LSTM(input_size=1, hidden_size=256, num_layers=1, batch_first=True)

self.dropout = nn.Dropout(0.2)

self.linear = nn.Linear(256, n_vocab)

def forward(self, x):

x, _ = self.lstm(x)

# take only the last output

x = x[:, -1, :]

# produce output

x = self.linear(self.dropout(x))

return x

This is a model for single character classification of 50 classes. Therefore cross entropy loss should be used. It is optimized using Adam optimizer. The training loop is as follows. For simplicity, no test set has created, but the model is evaluated with the training set once again at the end of each epoch to keep track on the progress.

This program can run for a long time, especially on CPU! In order to preserve the fruit of work, the best model ever found is saved for future reuse.

n_epochs = 40
batch_size = 128
model = CharModel()

optimizer = optim.Adam(model.parameters())
loss_fn = nn.CrossEntropyLoss(reduction="sum")
loader = data.DataLoader(data.TensorDataset(X, y), shuffle=True, batch_size=batch_size)

best_model = None
best_loss = np.inf
for epoch in range(n_epochs):
    model.train()
    for X_batch, y_batch in loader:
        y_pred = model(X_batch)
        loss = loss_fn(y_pred, y_batch)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    # Validation
    model.eval()
    loss = 0
    with torch.no_grad():
        for X_batch, y_batch in loader:
            y_pred = model(X_batch)
            loss += loss_fn(y_pred, y_batch)
        if loss < best_loss:
            best_loss = loss
            best_model = model.state_dict()
        print("Epoch %d: Cross-entropy: %.4f" % (epoch, loss))

torch.save([best_model, char_to_dict], "single-char.pth")

n_epochs = 40

batch_size = 128

model = CharModel()

optimizer = optim.Adam(model.parameters())

loss_fn = nn.CrossEntropyLoss(reduction="sum")

loader = data.DataLoader(data.TensorDataset(X, y), shuffle=True, batch_size=batch_size)

best_model = None

best_loss = np.inf

for epoch in range(n_epochs):

model.train()

for X_batch, y_batch in loader:

y_pred = model(X_batch)

loss = loss_fn(y_pred, y_batch)

optimizer.zero_grad()

loss.backward()

optimizer.step()

# Validation

model.eval()

loss = 0

with torch.no_grad():

for X_batch, y_batch in loader:

y_pred = model(X_batch)

loss += loss_fn(y_pred, y_batch)

if loss < best_loss:

best_loss = loss

best_model = model.state_dict()

print("Epoch %d: Cross-entropy: %.4f" % (epoch, loss))

torch.save([best_model, char_to_dict], "single-char.pth")

Running the above may produce the following:

...
Epoch 35: Cross-entropy: 245745.2500
Epoch 36: Cross-entropy: 243908.7031
Epoch 37: Cross-entropy: 238833.5000
Epoch 38: Cross-entropy: 239069.0000
Epoch 39: Cross-entropy: 234176.2812

...

Epoch 35: Cross-entropy: 245745.2500

Epoch 36: Cross-entropy: 243908.7031

Epoch 37: Cross-entropy: 238833.5000

Epoch 38: Cross-entropy: 239069.0000

Epoch 39: Cross-entropy: 234176.2812

The cross entropy almost always decreasing in each epoch. This means probably the model is not fully converged and you can train it for more epochs. Upon the training loop completed, you should have the file single-char.pth created to contain the best model weight ever found, as well as the character-to-integer mapping used by this model.

For completeness, below is tying everything above into one script:

import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as data

# load ascii text and covert to lowercase
filename = "wonderland.txt"
raw_text = open(filename, 'r', encoding='utf-8').read()
raw_text = raw_text.lower()

# create mapping of unique chars to integers
chars = sorted(list(set(raw_text)))
char_to_int = dict((c, i) for i, c in enumerate(chars))

# summarize the loaded data
n_chars = len(raw_text)
n_vocab = len(chars)
print("Total Characters: ", n_chars)
print("Total Vocab: ", n_vocab)

# prepare the dataset of input to output pairs encoded as integers
seq_length = 100
dataX = []
dataY = []
for i in range(0, n_chars - seq_length, 1):
    seq_in = raw_text[i:i + seq_length]
    seq_out = raw_text[i + seq_length]
    dataX.append([char_to_int[char] for char in seq_in])
    dataY.append(char_to_int[seq_out])
n_patterns = len(dataX)
print("Total Patterns: ", n_patterns)

# reshape X to be [samples, time steps, features]
X = torch.tensor(dataX, dtype=torch.float32).reshape(n_patterns, seq_length, 1)
X = X / float(n_vocab)
y = torch.tensor(dataY)

class CharModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.lstm = nn.LSTM(input_size=1, hidden_size=256, num_layers=1, batch_first=True)
        self.dropout = nn.Dropout(0.2)
        self.linear = nn.Linear(256, n_vocab)
    def forward(self, x):
        x, _ = self.lstm(x)
        # take only the last output
        x = x[:, -1, :]
        # produce output
        x = self.linear(self.dropout(x))
        return x
    
n_epochs = 40
batch_size = 128
model = CharModel()

optimizer = optim.Adam(model.parameters())
loss_fn = nn.CrossEntropyLoss(reduction="sum")
loader = data.DataLoader(data.TensorDataset(X, y), shuffle=True, batch_size=batch_size)

best_model = None
best_loss = np.inf
for epoch in range(n_epochs):
    model.train()
    for X_batch, y_batch in loader:
        y_pred = model(X_batch)
        loss = loss_fn(y_pred, y_batch)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    # Validation
    model.eval()
    loss = 0
    with torch.no_grad():
        for X_batch, y_batch in loader:
            y_pred = model(X_batch)
            loss += loss_fn(y_pred, y_batch)
        if loss < best_loss:
            best_loss = loss
            best_model = model.state_dict()
        print("Epoch %d: Cross-entropy: %.4f" % (epoch, loss))

torch.save([best_model, char_to_int], "single-char.pth")

import numpy as np

import torch

import torch.nn as nn

import torch.optim as optim

import torch.utils.data as data

# load ascii text and covert to lowercase

filename = "wonderland.txt"

raw_text = open(filename, 'r', encoding='utf-8').read()

raw_text = raw_text.lower()

# create mapping of unique chars to integers

chars = sorted(list(set(raw_text)))

char_to_int = dict((c, i) for i, c in enumerate(chars))

# summarize the loaded data

n_chars = len(raw_text)

n_vocab = len(chars)

print("Total Characters: ", n_chars)

print("Total Vocab: ", n_vocab)

# prepare the dataset of input to output pairs encoded as integers

seq_length = 100

dataX = []

dataY = []

for i in range(0, n_chars - seq_length, 1):

seq_in = raw_text[i:i + seq_length]

seq_out = raw_text[i + seq_length]

dataX.append([char_to_int[char] for char in seq_in])

dataY.append(char_to_int[seq_out])

n_patterns = len(dataX)

print("Total Patterns: ", n_patterns)

# reshape X to be [samples, time steps, features]

X = torch.tensor(dataX, dtype=torch.float32).reshape(n_patterns, seq_length, 1)

X = X / float(n_vocab)

y = torch.tensor(dataY)

class CharModel(nn.Module):

def __init__(self):

super().__init__()

self.lstm = nn.LSTM(input_size=1, hidden_size=256, num_layers=1, batch_first=True)

self.dropout = nn.Dropout(0.2)

self.linear = nn.Linear(256, n_vocab)

def forward(self, x):

x, _ = self.lstm(x)

# take only the last output

x = x[:, -1, :]

# produce output

x = self.linear(self.dropout(x))

return x

n_epochs = 40

batch_size = 128

model = CharModel()

optimizer = optim.Adam(model.parameters())

loss_fn = nn.CrossEntropyLoss(reduction="sum")

loader = data.DataLoader(data.TensorDataset(X, y), shuffle=True, batch_size=batch_size)

best_model = None

best_loss = np.inf

for epoch in range(n_epochs):

model.train()

for X_batch, y_batch in loader:

y_pred = model(X_batch)

loss = loss_fn(y_pred, y_batch)

optimizer.zero_grad()

loss.backward()

optimizer.step()

# Validation

model.eval()

loss = 0

with torch.no_grad():

for X_batch, y_batch in loader:

y_pred = model(X_batch)

loss += loss_fn(y_pred, y_batch)

if loss < best_loss:

best_loss = loss

best_model = model.state_dict()

print("Epoch %d: Cross-entropy: %.4f" % (epoch, loss))

torch.save([best_model, char_to_int], "single-char.pth")

Generating Text with an LSTM Model

Given the model is well trained, generating text using the trained LSTM network is relatively straightforward. Firstly, you need to recreate the network and load the trained model weight from the saved checkpoint. Then you need to create some prompt for the model to start on. The prompt can be anything that the model can understand. It is a seed sequence to be given to the model to obtain one generated character. Then, the generated character is added to the end of this sequence, and trim off the first character to maintain the consistent length. This process is repeated for as long as you want to predict new characters (e.g., a sequence of 1,000 characters in length). You can pick a random input pattern as your seed sequence, then print generated characters as you generate them.

A simple way to generate prompt is to pick a random sample from the original dataset, e.g., with the raw_text obtained in the previous section, a prompt can be created as:

seq_length = 100
start = np.random.randint(0, len(raw_text)-seq_length)
prompt = raw_text[start:start+seq_length]

seq_length = 100

start = np.random.randint(0, len(raw_text)-seq_length)

prompt = raw_text[start:start+seq_length]

But you should be reminded that you need to transform it since this prompt is a string while the model expects a vector of integers.

The entire code is merely as follows:

import numpy as np
import torch
import torch.nn as nn

best_model, char_to_int = torch.load("single-char.pth")
n_vocab = len(char_to_int)
int_to_char = dict((i, c) for c, i in char_to_int.items())

# reload the model
class CharModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.lstm = nn.LSTM(input_size=1, hidden_size=256, num_layers=1, batch_first=True)
        self.dropout = nn.Dropout(0.2)
        self.linear = nn.Linear(256, n_vocab)
    def forward(self, x):
        x, _ = self.lstm(x)
        # take only the last output
        x = x[:, -1, :]
        # produce output
        x = self.linear(self.dropout(x))
        return x
model = CharModel()
model.load_state_dict(best_model)

# randomly generate a prompt
filename = "wonderland.txt"
seq_length = 100
raw_text = open(filename, 'r', encoding='utf-8').read()
raw_text = raw_text.lower()
start = np.random.randint(0, len(raw_text)-seq_length)
prompt = raw_text[start:start+seq_length]
pattern = [char_to_int[c] for c in prompt]

model.eval()
print('Prompt: "%s"' % prompt)
with torch.no_grad():
    for i in range(1000):
        # format input array of int into PyTorch tensor
        x = np.reshape(pattern, (1, len(pattern), 1)) / float(n_vocab)
        x = torch.tensor(x, dtype=torch.float32)
        # generate logits as output from the model
        prediction = model(x)
        # convert logits into one character
        index = int(prediction.argmax())
        result = int_to_char[index]
        print(result, end="")
        # append the new character into the prompt for the next iteration
        pattern.append(index)
        pattern = pattern[1:]
print()
print("Done.")

import numpy as np

import torch

import torch.nn as nn

best_model, char_to_int = torch.load("single-char.pth")

n_vocab = len(char_to_int)

int_to_char = dict((i, c) for c, i in char_to_int.items())

# reload the model

class CharModel(nn.Module):

def __init__(self):

super().__init__()

self.lstm = nn.LSTM(input_size=1, hidden_size=256, num_layers=1, batch_first=True)

self.dropout = nn.Dropout(0.2)

self.linear = nn.Linear(256, n_vocab)

def forward(self, x):

x, _ = self.lstm(x)

# take only the last output

x = x[:, -1, :]

# produce output

x = self.linear(self.dropout(x))

return x

model = CharModel()

model.load_state_dict(best_model)

# randomly generate a prompt

filename = "wonderland.txt"

seq_length = 100

raw_text = open(filename, 'r', encoding='utf-8').read()

raw_text = raw_text.lower()

start = np.random.randint(0, len(raw_text)-seq_length)

prompt = raw_text[start:start+seq_length]

pattern = [char_to_int[c] for c in prompt]

model.eval()

print('Prompt: "%s"' % prompt)

with torch.no_grad():

for i in range(1000):

# format input array of int into PyTorch tensor

x = np.reshape(pattern, (1, len(pattern), 1)) / float(n_vocab)

x = torch.tensor(x, dtype=torch.float32)

# generate logits as output from the model

prediction = model(x)

# convert logits into one character

index = int(prediction.argmax())

result = int_to_char[index]

print(result, end="")

# append the new character into the prompt for the next iteration

pattern.append(index)

pattern = pattern[1:]

print()

print("Done.")

Running this example first outputs the prompt used, then each character as it is generated. For example, below are the results from one run of this text generator. The prompt was:

Prompt: "nother rush at the stick, and tumbled head
over heels in its hurry to get hold of it; then alice, th"

1 2	Prompt: "nother rush at the stick, and tumbled head over heels in its hurry to get hold of it; then alice, th"

The generated text was:

e was qot a litule soteet of thet was sh the thiee harden an the courd, and was tuitk a little toaee th thite ththe and said to the suher, and the whrtght the pacbit sese tha woode of the soeee, and the white rabbit ses ani thr gort to the thite rabbit, and then she was aoiinnene th the three baaed of the sueen and saed “ota turpe ”hun mot,”

“i don’t know the ter ano _enend to mere,” said the maccht ar a sore of great roaee. “ie you don’t teink if thet soued to soeed to the boeie the mooer, io you bane thing it wo
tou het bn the crur,
“h whsh you cen not,” said the manch hare.

“wes, it aadi,” said the manch hare.

“weat you tail to merer ae in an a gens if gre” ”he were thing,” said the maccht ar a sore of geeaghen asd tothe to the thieg harden an the could.
“h dan tor toe taie thing,” said the manch hare.

“wes, it aadi,” said the manch hare.

“weat you tail to merer ae in an a gens if gre” ”he were thing,” said the maccht ar a sore of geeaghen asd tothe to the thieg harden an t

e was qot a litule soteet of thet was sh the thiee harden an the courd, and was tuitk a little toaee th thite ththe and said to the suher, and the whrtght the pacbit sese tha woode of the soeee, and the white rabbit ses ani thr gort to the thite rabbit, and then she was aoiinnene th the three baaed of the sueen and saed “ota turpe ”hun mot,”

“i don’t know the ter ano _enend to mere,” said the maccht ar a sore of great roaee. “ie you don’t teink if thet soued to soeed to the boeie the mooer, io you bane thing it wo

tou het bn the crur,

“h whsh you cen not,” said the manch hare.

“wes, it aadi,” said the manch hare.

“weat you tail to merer ae in an a gens if gre” ”he were thing,” said the maccht ar a sore of geeaghen asd tothe to the thieg harden an the could.

“h dan tor toe taie thing,” said the manch hare.

“wes, it aadi,” said the manch hare.

“weat you tail to merer ae in an a gens if gre” ”he were thing,” said the maccht ar a sore of geeaghen asd tothe to the thieg harden an t

Let’s note some observations about the generated text.

It can emit line breaks. The original text limited the line width to 80 characters and the generative model attempted to replicate this pattern
The characters are separated into word-like groups, and some groups are actual English words (e.g., “the,” “said,” and “rabbit”), but many are not (e.g., “thite,” “soteet,” and “tha”).
Some of the words in sequence make sense (e.g., “i don’t know the“), but many do not (e.g., “he were thing“).

The fact that this character-based model of the book produces output like this is very impressive. It gives you a sense of the learning capabilities of LSTM networks. However, the results are not perfect. In the next section, you will look at improving the quality of results by developing a much larger LSTM network.

Using a Larger LSTM Network

Recall that LSTM is a recurrent neural network. It takes a sequence as input, which in each step of the sequence, the input is mixed with its internal states to produce an output. Hence the output from LSTM is also a sequence. In the above, the output from the last time step is taken for further processing in the neural network but those from earlier steps are discarded. However, it is not necessarily the case. You can treat the sequence output from one LSTM layer as input to another LSTM layer. Then, you are building a larger network.

Similar to convolutional neural networks, a stacked LSTM network is supposed to have the earlier LSTM layers to learn low level features while the later LSTM layers to learn the high level features. It may not be always useful but you can try it out to see whether the model can produce a better result.

In PyTorch, making a stacked LSTM layer is easy. Let’s modify the above model into the following:

class CharModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.lstm = nn.LSTM(input_size=1, hidden_size=256, num_layers=2, batch_first=True, dropout=0.2)
        self.dropout = nn.Dropout(0.2)
        self.linear = nn.Linear(256, n_vocab)
    def forward(self, x):
        x, _ = self.lstm(x)
        # take only the last output
        x = x[:, -1, :]
        # produce output
        x = self.linear(self.dropout(x))
        return x

class CharModel(nn.Module):

def __init__(self):

super().__init__()

self.lstm = nn.LSTM(input_size=1, hidden_size=256, num_layers=2, batch_first=True, dropout=0.2)

self.dropout = nn.Dropout(0.2)

self.linear = nn.Linear(256, n_vocab)

def forward(self, x):

x, _ = self.lstm(x)

# take only the last output

x = x[:, -1, :]

# produce output

x = self.linear(self.dropout(x))

return x

The only change is on the parameter to nn.LSTM(): you set num_layers=2 instead of 1 to add another LSTM layer. But between the two LSTM layers, you also added a dropout layer through the parameter dropout=0.2. Replacing this model with the previous is all the change you need to make. Rerun the training you should see the below:

...
Epoch 34: Cross-entropy: 203763.0312
Epoch 35: Cross-entropy: 204002.5938
Epoch 36: Cross-entropy: 210636.5625
Epoch 37: Cross-entropy: 199619.6875
Epoch 38: Cross-entropy: 199240.2969
Epoch 39: Cross-entropy: 196966.1250

...

Epoch 34: Cross-entropy: 203763.0312

Epoch 35: Cross-entropy: 204002.5938

Epoch 36: Cross-entropy: 210636.5625

Epoch 37: Cross-entropy: 199619.6875

Epoch 38: Cross-entropy: 199240.2969

Epoch 39: Cross-entropy: 196966.1250

You should see the the cross entropy here is lower than that in the previous section. This means this model is performing better. In fact, with this model, you can see the generated text looks more sensible:

Prompt: "ll
say that ‘i see what i eat’ is the same thing as ‘i eat what i see’!”

“you might just as well sa"
y it to sea,” she katter said to the jury. and the thoee hardeners vhine she was seady to alice the was a long tay of the sooe of the court, and she was seady to and taid to the coor and the court.
“well you see what you see, the mookee of the soog of the season of the shase of the court!”

“i don’t know the rame thing is it?” said the caterpillar.

“the cormous was it makes he it was it taie the reason of the shall bbout it, you know.”

“i don’t know the rame thing i can’t gelp the sea,” the hatter went on, “i don’t know the peally was in the shall sereat it would be a teally.
the mookee of the court ”

“i don’t know the rame thing is it?” said the caterpillar.

“the cormous was it makes he it was it taie the reason of the shall bbout it, you know.”

“i don’t know the rame thing i can’t gelp the sea,” the hatter went on, “i don’t know the peally was in the shall sereat it would be a teally.
the mookee of the court ”

“i don’t know the rame thing is it?” said the caterpillar.

“the
Done.

Prompt: "ll

say that ‘i see what i eat’ is the same thing as ‘i eat what i see’!”

“you might just as well sa"

y it to sea,” she katter said to the jury. and the thoee hardeners vhine she was seady to alice the was a long tay of the sooe of the court, and she was seady to and taid to the coor and the court.

“well you see what you see, the mookee of the soog of the season of the shase of the court!”

“i don’t know the rame thing is it?” said the caterpillar.

“the cormous was it makes he it was it taie the reason of the shall bbout it, you know.”

“i don’t know the rame thing i can’t gelp the sea,” the hatter went on, “i don’t know the peally was in the shall sereat it would be a teally.

the mookee of the court ”

“i don’t know the rame thing is it?” said the caterpillar.

“the cormous was it makes he it was it taie the reason of the shall bbout it, you know.”

“i don’t know the rame thing i can’t gelp the sea,” the hatter went on, “i don’t know the peally was in the shall sereat it would be a teally.

the mookee of the court ”

“i don’t know the rame thing is it?” said the caterpillar.

“the

Done.

Not only words are spelled correctly, the text is also more English-like. Since the cross-entropy loss is still decreasing as you trained the model, you can assume the model is not converged yet. You can expect to make the model better if you increased the training epoch.

For completeness, below is the complete code for using this new model, including training and text generation.

import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as data

# load ascii text and covert to lowercase
filename = "wonderland.txt"
raw_text = open(filename, 'r', encoding='utf-8').read()
raw_text = raw_text.lower()

# create mapping of unique chars to integers
chars = sorted(list(set(raw_text)))
char_to_int = dict((c, i) for i, c in enumerate(chars))

# summarize the loaded data
n_chars = len(raw_text)
n_vocab = len(chars)
print("Total Characters: ", n_chars)
print("Total Vocab: ", n_vocab)

# prepare the dataset of input to output pairs encoded as integers
seq_length = 100
dataX = []
dataY = []
for i in range(0, n_chars - seq_length, 1):
    seq_in = raw_text[i:i + seq_length]
    seq_out = raw_text[i + seq_length]
    dataX.append([char_to_int[char] for char in seq_in])
    dataY.append(char_to_int[seq_out])
n_patterns = len(dataX)
print("Total Patterns: ", n_patterns)

# reshape X to be [samples, time steps, features]
X = torch.tensor(dataX, dtype=torch.float32).reshape(n_patterns, seq_length, 1)
X = X / float(n_vocab)
y = torch.tensor(dataY)

class CharModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.lstm = nn.LSTM(input_size=1, hidden_size=256, num_layers=2, batch_first=True, dropout=0.2)
        self.dropout = nn.Dropout(0.2)
        self.linear = nn.Linear(256, n_vocab)
    def forward(self, x):
        x, _ = self.lstm(x)
        # take only the last output
        x = x[:, -1, :]
        # produce output
        x = self.linear(self.dropout(x))
        return x

n_epochs = 40
batch_size = 128
model = CharModel()

optimizer = optim.Adam(model.parameters())
loss_fn = nn.CrossEntropyLoss(reduction="sum")
loader = data.DataLoader(data.TensorDataset(X, y), shuffle=True, batch_size=batch_size)

best_model = None
best_loss = np.inf
for epoch in range(n_epochs):
    model.train()
    for X_batch, y_batch in loader:
        y_pred = model(X_batch)
        loss = loss_fn(y_pred, y_batch)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    # Validation
    model.eval()
    loss = 0
    with torch.no_grad():
        for X_batch, y_batch in loader:
            y_pred = model(X_batch)
            loss += loss_fn(y_pred, y_batch)
        if loss < best_loss:
            best_loss = loss
            best_model = model.state_dict()
        print("Epoch %d: Cross-entropy: %.4f" % (epoch, loss))

torch.save([best_model, char_to_int], "single-char.pth")

# Generation using the trained model
best_model, char_to_int = torch.load("single-char.pth")
n_vocab = len(char_to_int)
int_to_char = dict((i, c) for c, i in char_to_int.items())
model.load_state_dict(best_model)

# randomly generate a prompt
filename = "wonderland.txt"
seq_length = 100
raw_text = open(filename, 'r', encoding='utf-8').read()
raw_text = raw_text.lower()
start = np.random.randint(0, len(raw_text)-seq_length)
prompt = raw_text[start:start+seq_length]
pattern = [char_to_int[c] for c in prompt]

model.eval()
print('Prompt: "%s"' % prompt)
with torch.no_grad():
    for i in range(1000):
        # format input array of int into PyTorch tensor
        x = np.reshape(pattern, (1, len(pattern), 1)) / float(n_vocab)
        x = torch.tensor(x, dtype=torch.float32)
        # generate logits as output from the model
        prediction = model(x)
        # convert logits into one character
        index = int(prediction.argmax())
        result = int_to_char[index]
        print(result, end="")
        # append the new character into the prompt for the next iteration
        pattern.append(index)
        pattern = pattern[1:]
print()
print("Done.")

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

import numpy as np

import torch

import torch.nn as nn

import torch.optim as optim

import torch.utils.data as data

# load ascii text and covert to lowercase

filename = "wonderland.txt"

raw_text = open(filename, 'r', encoding='utf-8').read()

raw_text = raw_text.lower()

# create mapping of unique chars to integers

chars = sorted(list(set(raw_text)))

char_to_int = dict((c, i) for i, c in enumerate(chars))

# summarize the loaded data

n_chars = len(raw_text)

n_vocab = len(chars)

print("Total Characters: ", n_chars)

print("Total Vocab: ", n_vocab)

# prepare the dataset of input to output pairs encoded as integers

seq_length = 100

dataX = []

dataY = []

for i in range(0, n_chars - seq_length, 1):

seq_in = raw_text[i:i + seq_length]

seq_out = raw_text[i + seq_length]

dataX.append([char_to_int[char] for char in seq_in])

dataY.append(char_to_int[seq_out])

n_patterns = len(dataX)

print("Total Patterns: ", n_patterns)

# reshape X to be [samples, time steps, features]

X = torch.tensor(dataX, dtype=torch.float32).reshape(n_patterns, seq_length, 1)

X = X / float(n_vocab)

y = torch.tensor(dataY)

class CharModel(nn.Module):

def __init__(self):

super().__init__()

self.lstm = nn.LSTM(input_size=1, hidden_size=256, num_layers=2, batch_first=True, dropout=0.2)

self.dropout = nn.Dropout(0.2)

self.linear = nn.Linear(256, n_vocab)

def forward(self, x):

x, _ = self.lstm(x)

# take only the last output

x = x[:, -1, :]

# produce output

x = self.linear(self.dropout(x))

return x

n_epochs = 40

batch_size = 128

model = CharModel()

optimizer = optim.Adam(model.parameters())

loss_fn = nn.CrossEntropyLoss(reduction="sum")

loader = data.DataLoader(data.TensorDataset(X, y), shuffle=True, batch_size=batch_size)

best_model = None

best_loss = np.inf

for epoch in range(n_epochs):

model.train()

for X_batch, y_batch in loader:

y_pred = model(X_batch)

loss = loss_fn(y_pred, y_batch)

optimizer.zero_grad()

loss.backward()

optimizer.step()

# Validation

model.eval()

loss = 0

with torch.no_grad():

for X_batch, y_batch in loader:

y_pred = model(X_batch)

loss += loss_fn(y_pred, y_batch)

if loss < best_loss:

best_loss = loss

best_model = model.state_dict()

print("Epoch %d: Cross-entropy: %.4f" % (epoch, loss))

torch.save([best_model, char_to_int], "single-char.pth")

# Generation using the trained model

best_model, char_to_int = torch.load("single-char.pth")

n_vocab = len(char_to_int)

int_to_char = dict((i, c) for c, i in char_to_int.items())

model.load_state_dict(best_model)

# randomly generate a prompt

filename = "wonderland.txt"

seq_length = 100

raw_text = open(filename, 'r', encoding='utf-8').read()

raw_text = raw_text.lower()

start = np.random.randint(0, len(raw_text)-seq_length)

prompt = raw_text[start:start+seq_length]

pattern = [char_to_int[c] for c in prompt]

model.eval()

print('Prompt: "%s"' % prompt)

with torch.no_grad():

for i in range(1000):

# format input array of int into PyTorch tensor

x = np.reshape(pattern, (1, len(pattern), 1)) / float(n_vocab)

x = torch.tensor(x, dtype=torch.float32)

# generate logits as output from the model

prediction = model(x)

# convert logits into one character

index = int(prediction.argmax())

result = int_to_char[index]

print(result, end="")

# append the new character into the prompt for the next iteration

pattern.append(index)

pattern = pattern[1:]

print()

print("Done.")

Faster Training with GPU

Running programs from this post can be pathetically slow. Even if you have a GPU, you will not see immediate improvement. It is because the design of PyTorch, it may not use your GPU automatically. However, if you have a CUDA-capable GPU, you can improve the performance a lot by carefully moving the heavy computation away from your CPU.

A PyTorch model is a program of tensor calculation. The tensors can be stored in GPU or in CPU. Operation can be carried out as long as all the operators are in the same device. In this particular example, the model weight (i.e., those of the LSTM layers and the fully connected layer) can be moved to GPU. By doing so, the input should also be moved to the GPU before execution and the output will also be stored in the GPU unless you move it back.

In PyTorch, you can check if you have a CUDA-capable GPU using the following function:

torch.cuda.is_available()

1	torch.cuda.is_available()

It returns a boolean to indicate if you can use GPU, which in turn, depends on the hardware model you have, whether your OS has the appropriate library installed, and whether your PyTorch is compiled with corresponding GPU support. If everything works in concert, you can create a device and assign your model to it:

device = torch.device("cuda:0")
model.to(device)

1 2	device = torch.device("cuda:0") model.to(device)

If your model is running on CUDA device but your input tensor is not, you will see PyTorch complain about that and fail to proceed. To move your tensor to the CUDA device, you should run like the following:

y_pred = model(X_batch.to(device))

1	y_pred = model(X_batch.to(device))

Which the .to(device) part will do the magic. But remember that y_pred produced above will also be on the CUDA device. Hence you should do the same when you run the loss function. Modifying the above program to make it capable to run on GPU will become the following:

import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as data

# load ascii text and covert to lowercase
filename = "wonderland.txt"
raw_text = open(filename, 'r', encoding='utf-8').read()
raw_text = raw_text.lower()

# create mapping of unique chars to integers
chars = sorted(list(set(raw_text)))
char_to_int = dict((c, i) for i, c in enumerate(chars))

# summarize the loaded data
n_chars = len(raw_text)
n_vocab = len(chars)
print("Total Characters: ", n_chars)
print("Total Vocab: ", n_vocab)

# prepare the dataset of input to output pairs encoded as integers
seq_length = 100
dataX = []
dataY = []
for i in range(0, n_chars - seq_length, 1):
    seq_in = raw_text[i:i + seq_length]
    seq_out = raw_text[i + seq_length]
    dataX.append([char_to_int[char] for char in seq_in])
    dataY.append(char_to_int[seq_out])
n_patterns = len(dataX)
print("Total Patterns: ", n_patterns)

# reshape X to be [samples, time steps, features]
X = torch.tensor(dataX, dtype=torch.float32).reshape(n_patterns, seq_length, 1)
X = X / float(n_vocab)
y = torch.tensor(dataY)

class CharModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.lstm = nn.LSTM(input_size=1, hidden_size=256, num_layers=2, batch_first=True, dropout=0.2)
        self.dropout = nn.Dropout(0.2)
        self.linear = nn.Linear(256, n_vocab)
    def forward(self, x):
        x, _ = self.lstm(x)
        # take only the last output
        x = x[:, -1, :]
        # produce output
        x = self.linear(self.dropout(x))
        return x

n_epochs = 40
batch_size = 128
model = CharModel()
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model.to(device)

optimizer = optim.Adam(model.parameters())
loss_fn = nn.CrossEntropyLoss(reduction="sum")
loader = data.DataLoader(data.TensorDataset(X, y), shuffle=True, batch_size=batch_size)

best_model = None
best_loss = np.inf
for epoch in range(n_epochs):
    model.train()
    for X_batch, y_batch in loader:
        y_pred = model(X_batch.to(device))
        loss = loss_fn(y_pred, y_batch.to(device))
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    # Validation
    model.eval()
    loss = 0
    with torch.no_grad():
        for X_batch, y_batch in loader:
            y_pred = model(X_batch.to(device))
            loss += loss_fn(y_pred, y_batch.to(device))
        if loss < best_loss:
            best_loss = loss
            best_model = model.state_dict()
        print("Epoch %d: Cross-entropy: %.4f" % (epoch, loss))

torch.save([best_model, char_to_int], "single-char.pth")

# Generation using the trained model
best_model, char_to_int = torch.load("single-char.pth")
n_vocab = len(char_to_int)
int_to_char = dict((i, c) for c, i in char_to_int.items())
model.load_state_dict(best_model)

# randomly generate a prompt
filename = "wonderland.txt"
seq_length = 100
raw_text = open(filename, 'r', encoding='utf-8').read()
raw_text = raw_text.lower()
start = np.random.randint(0, len(raw_text)-seq_length)
prompt = raw_text[start:start+seq_length]
pattern = [char_to_int[c] for c in prompt]

model.eval()
print('Prompt: "%s"' % prompt)
with torch.no_grad():
    for i in range(1000):
        # format input array of int into PyTorch tensor
        x = np.reshape(pattern, (1, len(pattern), 1)) / float(n_vocab)
        x = torch.tensor(x, dtype=torch.float32)
        # generate logits as output from the model
        prediction = model(x.to(device))
        # convert logits into one character
        index = int(prediction.argmax())
        result = int_to_char[index]
        print(result, end="")
        # append the new character into the prompt for the next iteration
        pattern.append(index)
        pattern = pattern[1:]
print()
print("Done.")

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

import numpy as np

import torch

import torch.nn as nn

import torch.optim as optim

import torch.utils.data as data

# load ascii text and covert to lowercase

filename = "wonderland.txt"

raw_text = open(filename, 'r', encoding='utf-8').read()

raw_text = raw_text.lower()

# create mapping of unique chars to integers

chars = sorted(list(set(raw_text)))

char_to_int = dict((c, i) for i, c in enumerate(chars))

# summarize the loaded data

n_chars = len(raw_text)

n_vocab = len(chars)

print("Total Characters: ", n_chars)

print("Total Vocab: ", n_vocab)

# prepare the dataset of input to output pairs encoded as integers

seq_length = 100

dataX = []

dataY = []

for i in range(0, n_chars - seq_length, 1):

seq_in = raw_text[i:i + seq_length]

seq_out = raw_text[i + seq_length]

dataX.append([char_to_int[char] for char in seq_in])

dataY.append(char_to_int[seq_out])

n_patterns = len(dataX)

print("Total Patterns: ", n_patterns)

# reshape X to be [samples, time steps, features]

X = torch.tensor(dataX, dtype=torch.float32).reshape(n_patterns, seq_length, 1)

X = X / float(n_vocab)

y = torch.tensor(dataY)

class CharModel(nn.Module):

def __init__(self):

super().__init__()

self.lstm = nn.LSTM(input_size=1, hidden_size=256, num_layers=2, batch_first=True, dropout=0.2)

self.dropout = nn.Dropout(0.2)

self.linear = nn.Linear(256, n_vocab)

def forward(self, x):

x, _ = self.lstm(x)

# take only the last output

x = x[:, -1, :]

# produce output

x = self.linear(self.dropout(x))

return x

n_epochs = 40

batch_size = 128

model = CharModel()

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

model.to(device)

optimizer = optim.Adam(model.parameters())

loss_fn = nn.CrossEntropyLoss(reduction="sum")

loader = data.DataLoader(data.TensorDataset(X, y), shuffle=True, batch_size=batch_size)

best_model = None

best_loss = np.inf

for epoch in range(n_epochs):

model.train()

for X_batch, y_batch in loader:

y_pred = model(X_batch.to(device))

loss = loss_fn(y_pred, y_batch.to(device))

optimizer.zero_grad()

loss.backward()

optimizer.step()

# Validation

model.eval()

loss = 0

with torch.no_grad():

for X_batch, y_batch in loader:

y_pred = model(X_batch.to(device))

loss += loss_fn(y_pred, y_batch.to(device))

if loss < best_loss:

best_loss = loss

best_model = model.state_dict()

print("Epoch %d: Cross-entropy: %.4f" % (epoch, loss))

torch.save([best_model, char_to_int], "single-char.pth")

# Generation using the trained model

best_model, char_to_int = torch.load("single-char.pth")

n_vocab = len(char_to_int)

int_to_char = dict((i, c) for c, i in char_to_int.items())

model.load_state_dict(best_model)

# randomly generate a prompt

filename = "wonderland.txt"

seq_length = 100

raw_text = open(filename, 'r', encoding='utf-8').read()

raw_text = raw_text.lower()

start = np.random.randint(0, len(raw_text)-seq_length)

prompt = raw_text[start:start+seq_length]

pattern = [char_to_int[c] for c in prompt]

model.eval()

print('Prompt: "%s"' % prompt)

with torch.no_grad():

for i in range(1000):

# format input array of int into PyTorch tensor

x = np.reshape(pattern, (1, len(pattern), 1)) / float(n_vocab)

x = torch.tensor(x, dtype=torch.float32)

# generate logits as output from the model

prediction = model(x.to(device))

# convert logits into one character

index = int(prediction.argmax())

result = int_to_char[index]

print(result, end="")

# append the new character into the prompt for the next iteration

pattern.append(index)

pattern = pattern[1:]

print()

print("Done.")

Compare to the code in the previous section, you should see they are essentially the same. Except the CUDA device is detected with the line:

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

1	device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

which will be your GPU or fall back to CPU if no CUDA device is found. Afterward, .to(device) is added at several strategic location to move the computation to the GPU.

Summary

In this post, you discovered how you can develop an LSTM recurrent neural network for text generation in PyTorch. After completing this post, you know:

How to find text for classical books for free as dataset for your machine learning model
How to train an LSTM network for text sequences
How to use a LSTM network to generate text sequencesHow to optimize deep learning training in PyTorch using CUDA devices

5 Responses to Text Generation with LSTM in PyTorch

Zineb April 4, 2023 at 9:27 pm #

Thanks a lot for this wonderful article

torch.save([best_model, char_to_dict], “single-char.pth”) should be torch.save([best_model, char_to_int], “single-char.pth”)

Emmett Polhemus March 20, 2024 at 5:05 am #

I am having trouble were the algorithm does not print the final output and instead stops.

- James Carmichael March 20, 2024 at 8:49 am #
  
  Hi Emmett…Please provide any specific error messages you may be encountering. That will better enable us to guide you.
  
  - Emmett polhemus March 21, 2024 at 10:10 pm #
    
    I have gotten it working the error had to do with the signle-char.path file I was using a .txt. Also, do you need to train it every time you want a prompt?
    
shadow April 8, 2024 at 10:41 am #

Very nice

Navigation

Text Generation with LSTM in PyTorch

Overview

What is a Generative Model

Want to Get Started With Deep Learning with PyTorch?

Getting Text Data

A Small LSTM Network to Predict the Next Character

Generating Text with an LSTM Model

Using a Larger LSTM Network

Faster Training with GPU

Further Readings

Articles

Papers

APIs

Summary

Get Started on Deep Learning with PyTorch!

Learn how to build deep learning models

Kick-start your deep learning journey with hands-on exercises

More On This Topic

5 Responses to Text Generation with LSTM in PyTorch

Leave a Reply Click here to cancel reply.