The post A Gentle Introduction to Generative Adversarial Network Loss Functions appeared first on MachineLearningMastery.com.

]]>The GAN architecture is relatively straightforward, although one aspect that remains challenging for beginners is the topic of GAN loss functions. The main reason is that the architecture involves the simultaneous training of two models: the generator and the discriminator.

The discriminator model is updated like any other deep learning neural network, although the generator uses the discriminator as the loss function, meaning that the loss function for the generator is implicit and learned during training.

In this post, you will discover an introduction to loss functions for generative adversarial networks.

After reading this post, you will know:

- The GAN architecture is defined with the minimax GAN loss, although it is typically implemented using the non-saturating loss function.
- Common alternate loss functions used in modern GANs include the least squares and Wasserstein loss functions.
- Large-scale evaluation of GAN loss functions suggests little difference when other concerns, such as computational budget and model hyperparameters, are held constant.

**Kick-start your project** with my new book Generative Adversarial Networks with Python, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

This tutorial is divided into four parts; they are:

- Challenge of GAN Loss
- Standard GAN Loss Functions
- Alternate GAN Loss Functions
- Effect of Different GAN Loss Functions

The generative adversarial network, or GAN for short, is a deep learning architecture for training a generative model for image synthesis.

They have proven very effective, achieving impressive results in generating photorealistic faces, scenes, and more.

The GAN architecture is relatively straightforward, although one aspect that remains challenging for beginners is the topic of GAN loss functions.

The GAN architecture is comprised of two models: a discriminator and a generator. The discriminator is trained directly on real and generated images and is responsible for classifying images as real or fake (generated). The generator is not trained directly and instead is trained via the discriminator model.

Specifically, the discriminator is learned to provide the loss function for the generator.

The two models compete in a two-player game, where simultaneous improvements are made to both generator and discriminator models that compete.

We typically seek convergence of a model on a training dataset observed as the minimization of the chosen loss function on the training dataset. In a GAN, convergence signals the end of the two player game. Instead, equilibrium between generator and discriminator loss is sought.

We will take a closer look at the official GAN loss function used to train the generator and discriminator models and some alternate popular loss functions that may be used instead.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

The GAN architecture was described by Ian Goodfellow, et al. in their 2014 paper titled “Generative Adversarial Networks.”

The approach was introduced with two loss functions: the first that has become known as the Minimax GAN Loss and the second that has become known as the Non-Saturating GAN Loss.

Under both schemes, the discriminator loss is the same. The discriminator seeks to maximize the probability assigned to real and fake images.

We train D to maximize the probability of assigning the correct label to both training examples and samples from G.

— Generative Adversarial Networks, 2014.

Described mathematically, the discriminator seeks to maximize the average of the log probability for real images and the log of the inverted probabilities of fake images.

- maximize log D(x) + log(1 – D(G(z)))

If implemented directly, this would require changes be made to model weights using stochastic ascent rather than stochastic descent.

It is more commonly implemented as a traditional binary classification problem with labels 0 and 1 for generated and real images respectively.

The model is fit seeking to minimize the average binary cross entropy, also called log loss.

- minimize y_true * -log(y_predicted) + (1 – y_true) * -log(1 – y_predicted)

Minimax GAN loss refers to the minimax simultaneous optimization of the discriminator and generator models.

Minimax refers to an optimization strategy in two-player turn-based games for minimizing the loss or cost for the worst case of the other player.

For the GAN, the generator and discriminator are the two players and take turns involving updates to their model weights. The min and max refer to the minimization of the generator loss and the maximization of the discriminator’s loss.

- min max(D, G)

As stated above, the discriminator seeks to maximize the average of the log probability of real images and the log of the inverse probability for fake images.

- discriminator: maximize log D(x) + log(1 – D(G(z)))

The generator seeks to minimize the log of the inverse probability predicted by the discriminator for fake images. This has the effect of encouraging the generator to generate samples that have a low probability of being fake.

- generator: minimize log(1 – D(G(z)))

Here the generator learns to generate samples that have a low probability of being fake.

— Are GANs Created Equal? A Large-Scale Study, 2018.

This framing of the loss for the GAN was found to be useful in the analysis of the model as a minimax game, but in practice, it was found that, in practice, this loss function for the generator saturates.

This means that if it cannot learn as quickly as the discriminator, the discriminator wins, the game ends, and the model cannot be trained effectively.

In practice, [the loss function] may not provide sufficient gradient for G to learn well. Early in learning, when G is poor, D can reject samples with high confidence because they are clearly different from the training data.

— Generative Adversarial Networks, 2014.

The Non-Saturating GAN Loss is a modification to the generator loss to overcome the saturation problem.

It is a subtle change that involves the generator maximizing the log of the discriminator probabilities for generated images instead of minimizing the log of the inverted discriminator probabilities for generated images.

- generator: maximize log(D(G(z)))

This is a change in the framing of the problem.

In the previous case, the generator sought to minimize the probability of images being predicted as fake. Here, the generator seeks to maximize the probability of images being predicted as real.

To improve the gradient signal, the authors also propose the non-saturating loss, where the generator instead aims to maximize the probability of generated samples being real.

— Are GANs Created Equal? A Large-Scale Study, 2018.

The result is better gradient information when updating the weights of the generator and a more stable training process.

This objective function results in the same fixed point of the dynamics of G and D but provides much stronger gradients early in learning.

— Generative Adversarial Networks, 2014.

In practice, this is also implemented as a binary classification problem, like the discriminator. Instead of maximizing the loss, we can flip the labels for real and fake images and minimize the cross-entropy.

… one approach is to continue to use cross-entropy minimization for the generator. Instead of flipping the sign on the discriminator’s cost to obtain a cost for the generator, we flip the target used to construct the cross-entropy cost.

— NIPS 2016 Tutorial: Generative Adversarial Networks, 2016.

The choice of loss function is a hot research topic and many alternate loss functions have been proposed and evaluated.

Two popular alternate loss functions used in many GAN implementations are the least squares loss and the Wasserstein loss.

The least squares loss was proposed by Xudong Mao, et al. in their 2016 paper titled “Least Squares Generative Adversarial Networks.”

Their approach was based on the observation of the limitations for using binary cross entropy loss when generated images are very different from real images, which can lead to very small or vanishing gradients, and in turn, little or no update to the model.

… this loss function, however, will lead to the problem of vanishing gradients when updating the generator using the fake samples that are on the correct side of the decision boundary, but are still far from the real data.

— Least Squares Generative Adversarial Networks, 2016.

The discriminator seeks to minimize the sum squared difference between predicted and expected values for real and fake images.

- discriminator: minimize (D(x) – 1)^2 + (D(G(z)))^2

The generator seeks to minimize the sum squared difference between predicted and expected values as though the generated images were real.

- generator: minimize (D(G(z)) – 1)^2

In practice, this involves maintaining the class labels of 0 and 1 for fake and real images respectively, minimizing the least squares, also called mean squared error or L2 loss.

- l2 loss = sum (y_predicted – y_true)^2

The benefit of the least squares loss is that it gives more penalty to larger errors, in turn resulting in a large correction rather than a vanishing gradient and no model update.

… the least squares loss function is able to move the fake samples toward the decision boundary, because the least squares loss function penalizes samples that lie in a long way on the correct side of the decision boundary.

— Least Squares Generative Adversarial Networks, 2016.

The Wasserstein loss was proposed by Martin Arjovsky, et al. in their 2017 paper titled “Wasserstein GAN.”

The Wasserstein loss is informed by the observation that the traditional GAN is motivated to minimize the distance between the actual and predicted probability distributions for real and generated images, the so-called Kullback-Leibler divergence, or the Jensen-Shannon divergence.

Instead, they propose modeling the problem on the Earth-Mover’s distance, also referred to as the Wasserstein-1 distance. The Earth-Mover’s distance calculates the distance between two probability distributions in terms of the cost of turning one distribution (pile of earth) into another.

The GAN using Wasserstein loss involves changing the notion of the discriminator into a critic that is updated more often (e.g. five times more often) than the generator model. The critic scores images with a real value instead of predicting a probability. It also requires that model weights be kept small, e.g. clipped to a hypercube of [-0.01, 0.01].

The score is calculated such that the distance between scores for real and fake images are maximally separate.

The loss function can be implemented by calculating the average predicted score across real and fake images and multiplying the average score by 1 and -1 respectively. This has the desired effect of driving the scores for real and fake images apart.

The benefit of Wasserstein loss is that it provides a useful gradient almost everywhere, allowing for the continued training of the models. It also means that a lower Wasserstein loss correlates with better generator image quality, meaning that we are explicitly seeking a minimization of generator loss.

To our knowledge, this is the first time in GAN literature that such a property is shown, where the loss of the GAN shows properties of convergence.

— Wasserstein GAN, 2017.

Many loss functions have been developed and evaluated in an effort to improve the stability of training GAN models.

The most common is the non-saturating loss, generally, and the Least Squares and Wasserstein loss in larger and more recent GAN models.

As such, there is much interest in whether one loss function is truly better than another for a given model implementation.

This question motivated a large study of GAN loss functions by Mario Lucic, et al. in their 2018 paper titled “Are GANs Created Equal? A Large-Scale Study.”

Despite a very rich research activity leading to numerous interesting GAN algorithms, it is still very hard to assess which algorithm(s) perform better than others. We conduct a neutral, multi-faceted large-scale empirical study on state-of-the-art models and evaluation measures.

— Are GANs Created Equal? A Large-Scale Study, 2018.

They fix the computational budget and hyperparameter configuration for models and look at a suite of seven loss functions.

This includes the Minimax loss (MM GAN), Non-Saturating loss (NS GAN), Wasserstein loss (WGAN), and Least-Squares loss (LS GAN) described above. The study also includes an extension of Wasserstein loss to remove the weight clipping called Wasserstein Gradient Penalty loss (WGAN GP) and two others, DRAGAN and BEGAN.

The table below, taken from the paper, provides a useful summary of the different loss functions for both the discriminator and generator.

The models were evaluated systematically using a range of GAN evaluation metrics, including the popular Frechet Inception Distance, or FID.

Surprisingly, they discover that all evaluated loss functions performed approximately the same when all other elements were held constant.

We provide a fair and comprehensive comparison of the state-of-the-art GANs, and empirically demonstrate that nearly all of them can reach similar values of FID, given a high enough computational budget.

— Are GANs Created Equal? A Large-Scale Study, 2018.

This does not mean that the choice of loss does not matter for specific problems and model configurations.

Instead, the result suggests that the difference in the choice of loss function disappears when the other concerns of the model are held constant, such as computational budget and model configuration.

This section provides more resources on the topic if you are looking to go deeper.

- Generative Adversarial Networks, 2014.
- NIPS 2016 Tutorial: Generative Adversarial Networks, 2016.
- Least Squares Generative Adversarial Networks, 2016.
- Wasserstein GAN, 2017.
- Improved Training of Wasserstein GANs, 2017.
- Are GANs Created Equal? A Large-Scale Study, 2018.

In this post, you discovered an introduction to loss functions for generative adversarial networks.

Specifically, you learned:

- The GAN architecture is defined with the minimax GAN loss, although it is typically implemented using the non-saturating loss function.
- Common alternate loss functions used in modern GANs include the least squares and Wasserstein loss functions.
- Large-scale evaluation of GAN loss functions suggests little difference when other concerns, such as computational budget and model hyperparameters, are held constant.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to Generative Adversarial Network Loss Functions appeared first on MachineLearningMastery.com.

]]>The post How to Implement the Frechet Inception Distance (FID) for Evaluating GANs appeared first on MachineLearningMastery.com.

]]>The score summarizes how similar the two groups are in terms of statistics on computer vision features of the raw images calculated using the inception v3 model used for image classification. Lower scores indicate the two groups of images are more similar, or have more similar statistics, with a perfect score being 0.0 indicating that the two groups of images are identical.

The FID score is used to evaluate the quality of images generated by generative adversarial networks, and lower scores have been shown to correlate well with higher quality images.

In this tutorial, you will discover how to implement the Frechet Inception Distance for evaluating generated images.

After completing this tutorial, you will know:

- The Frechet Inception Distance summarizes the distance between the Inception feature vectors for real and generated images in the same domain.
- How to calculate the FID score and implement the calculation from scratch in NumPy.
- How to implement the FID score using the Keras deep learning library and calculate it with real images.

**Kick-start your project** with my new book Generative Adversarial Networks with Python, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

**Updated Oct 2019**: Fixed minor typo in the description of the method.

This tutorial is divided into five parts; they are:

- What Is the Frechet Inception Distance?
- How to Calculate the Frechet Inception Distance
- How to Implement the Frechet Inception Distance With NumPy
- How to Implement the Frechet Inception Distance With Keras
- How to Calculate the Frechet Inception Distance for Real Images

The Frechet Inception Distance, or FID for short, is a metric for evaluating the quality of generated images and specifically developed to evaluate the performance of generative adversarial networks.

The FID score was proposed and used by Martin Heusel, et al. in their 2017 paper titled “GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium.”

The score was proposed as an improvement over the existing Inception Score, or IS.

For the evaluation of the performance of GANs at image generation, we introduce the “Frechet Inception Distance” (FID) which captures the similarity of generated images to real ones better than the Inception Score.

— GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium, 2017.

The inception score estimates the quality of a collection of synthetic images based on how well the top-performing image classification model Inception v3 classifies them as one of 1,000 known objects. The scores combine both the confidence of the conditional class predictions for each synthetic image (quality) and the integral of the marginal probability of the predicted classes (diversity).

The inception score does not capture how synthetic images compare to real images. The goal in developing the FID score was to evaluate synthetic images based on the statistics of a collection of synthetic images compared to the statistics of a collection of real images from the target domain.

Drawback of the Inception Score is that the statistics of real world samples are not used and compared to the statistics of synthetic samples.

— GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium, 2017.

Like the inception score, the FID score uses the inception v3 model. Specifically, the coding layer of the model (the last pooling layer prior to the output classification of images) is used to capture computer-vision-specific features of an input image. These activations are calculated for a collection of real and generated images.

The activations are summarized as a multivariate Gaussian by calculating the mean and covariance of the images. These statistics are then calculated for the activations across the collection of real and generated images.

The distance between these two distributions is then calculated using the Frechet distance, also called the Wasserstein-2 distance.

The difference of two Gaussians (synthetic and real-world images) is measured by the Frechet distance also known as Wasserstein-2 distance.

— GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium, 2017.

The use of activations from the Inception v3 model to summarize each image gives the score its name of “*Frechet Inception Distance*.”

A lower FID indicates better-quality images; conversely, a higher score indicates a lower-quality image and the relationship may be linear.

The authors of the score show that lower FID scores correlate with better-quality images when systematic distortions were applied such as the addition of random noise and blur.

The FID score is calculated by first loading a pre-trained Inception v3 model.

The output layer of the model is removed and the output is taken as the activations from the last pooling layer, a global spatial pooling layer.

This output layer has 2,048 activations, therefore, each image is predicted as 2,048 activation features. This is called the coding vector or feature vector for the image.

A 2,048 feature vector is then predicted for a collection of real images from the problem domain to provide a reference for how real images are represented. Feature vectors can then be calculated for synthetic images.

The result will be two collections of 2,048 feature vectors for real and generated images.

The FID score is then calculated using the following equation taken from the paper:

- d^2 = ||mu_1 – mu_2||^2 + Tr(C_1 + C_2 – 2*sqrt(C_1*C_2))

The score is referred to as *d^2*, showing that it is a distance and has squared units.

The “*mu_1*” and “*mu_2*” refer to the feature-wise mean of the real and generated images, e.g. 2,048 element vectors where each element is the mean feature observed across the images.

The *C_1* and *C_2* are the covariance matrix for the real and generated feature vectors, often referred to as sigma.

The *||mu_1 – mu_2||^2* refers to the sum squared difference between the two mean vectors. *Tr* refers to the trace linear algebra operation, e.g. the sum of the elements along the main diagonal of the square matrix.

The sqrt is the square root of the square matrix, given as the product between the two covariance matrices.

The square root of a matrix is often also written as *M^(1/2)*, e.g. the matrix to the power of one half, which has the same effect. This operation can fail depending on the values in the matrix because the operation is solved using numerical methods. Commonly, some elements in the resulting matrix may be imaginary, which often can be detected and removed.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Implementing the calculation of the FID score in Python with NumPy arrays is straightforward.

First, let’s define a function that will take a collection of activations for real and generated images and return the FID score.

The *calculate_fid()* function listed below implements the procedure.

Here, we implement the FID calculation almost directly. It is worth noting that the official implementation in TensorFlow implements elements of the calculation in a slightly different order, likely for efficiency, and introduces additional checks around the matrix square root to handle possible numerical instabilities.

I recommend reviewing the official implementation and extending the implementation below to add these checks if you experience problems calculating the FID on your own datasets.

# calculate frechet inception distance def calculate_fid(act1, act2): # calculate mean and covariance statistics mu1, sigma1 = act1.mean(axis=0), cov(act1, rowvar=False) mu2, sigma2 = act2.mean(axis=0), cov(act2, rowvar=False) # calculate sum squared difference between means ssdiff = numpy.sum((mu1 - mu2)**2.0) # calculate sqrt of product between cov covmean = sqrtm(sigma1.dot(sigma2)) # check and correct imaginary numbers from sqrt if iscomplexobj(covmean): covmean = covmean.real # calculate score fid = ssdiff + trace(sigma1 + sigma2 - 2.0 * covmean) return fid

We can then test out this function to calculate the inception score for some contrived feature vectors.

Feature vectors will probably contain small positive values and will have a length of 2,048 elements. We can construct two lots of 10 images worth of feature vectors with small random numbers as follows:

... # define two collections of activations act1 = random(10*2048) act1 = act1.reshape((10,2048)) act2 = random(10*2048) act2 = act2.reshape((10,2048))

One test would be to calculate the FID between a set of activations and itself, which we would expect to have a score of 0.0.

We can then calculate the distance between the two sets of random activations, which we would expect to be a large number.

... # fid between act1 and act1 fid = calculate_fid(act1, act1) print('FID (same): %.3f' % fid) # fid between act1 and act2 fid = calculate_fid(act1, act2) print('FID (different): %.3f' % fid)

Tying this all together, the complete example is listed below.

# example of calculating the frechet inception distance import numpy from numpy import cov from numpy import trace from numpy import iscomplexobj from numpy.random import random from scipy.linalg import sqrtm # calculate frechet inception distance def calculate_fid(act1, act2): # calculate mean and covariance statistics mu1, sigma1 = act1.mean(axis=0), cov(act1, rowvar=False) mu2, sigma2 = act2.mean(axis=0), cov(act2, rowvar=False) # calculate sum squared difference between means ssdiff = numpy.sum((mu1 - mu2)**2.0) # calculate sqrt of product between cov covmean = sqrtm(sigma1.dot(sigma2)) # check and correct imaginary numbers from sqrt if iscomplexobj(covmean): covmean = covmean.real # calculate score fid = ssdiff + trace(sigma1 + sigma2 - 2.0 * covmean) return fid # define two collections of activations act1 = random(10*2048) act1 = act1.reshape((10,2048)) act2 = random(10*2048) act2 = act2.reshape((10,2048)) # fid between act1 and act1 fid = calculate_fid(act1, act1) print('FID (same): %.3f' % fid) # fid between act1 and act2 fid = calculate_fid(act1, act2) print('FID (different): %.3f' % fid)

Running the example first reports the FID between the act1 activations and itself, which is 0.0 as we expect (**Note**: the sign of the score can be ignored).

The distance between the two collections of random activations is also as we expect: a large number, which in this case was 358.

FID (same): -0.000 FID (different): 358.927

You may want to experiment with the calculation of the FID score and test other pathological cases.

Now that we know how to calculate the FID score and to implement it in NumPy, we can develop an implementation in Keras.

This involves the preparation of the image data and using a pretrained Inception v3 model to calculate the activations or feature vectors for each image.

First, we can load the Inception v3 model in Keras directly.

... # load inception v3 model model = InceptionV3()

This will prepare a version of the inception model for classifying images as one of 1,000 known classes. We can remove the output (the top) of the model via the *include_top=False* argument. Painfully, this also removes the global average pooling layer that we require, but we can add it back via specifying the *pooling=’avg’* argument.

When the output layer of the model is removed, we must specify the shape of the input images, which is 299x299x3 pixels, e.g. the *input_shape=(299,299,3)* argument.

Therefore, the inception model can be loaded as follows:

... # prepare the inception v3 model model = InceptionV3(include_top=False, pooling='avg', input_shape=(299,299,3))

This model can then be used to predict the feature vector for one or more images.

Our images are likely to not have the required shape. We will use the scikit-image library to resize the NumPy array of pixel values to the required size. The *scale_images()* function below implements this.

# scale an array of images to a new size def scale_images(images, new_shape): images_list = list() for image in images: # resize with nearest neighbor interpolation new_image = resize(image, new_shape, 0) # store images_list.append(new_image) return asarray(images_list)

Note, you may need to install the scikit-image library. This can be achieved as follows:

sudo pip install scikit-image

Once resized, the image pixel values will also need to be scaled to meet the expectations for inputs to the inception model. This can be achieved by calling the *preprocess_input()* function.

We can update our *calculate_fid()* function defined in the previous section to take the loaded inception model and two NumPy arrays of image data as arguments, instead of activations. The function will then calculate the activations before calculating the FID score as before.

The updated version of the *calculate_fid()* function is listed below.

# calculate frechet inception distance def calculate_fid(model, images1, images2): # calculate activations act1 = model.predict(images1) act2 = model.predict(images2) # calculate mean and covariance statistics mu1, sigma1 = act1.mean(axis=0), cov(act1, rowvar=False) mu2, sigma2 = act2.mean(axis=0), cov(act2, rowvar=False) # calculate sum squared difference between means ssdiff = numpy.sum((mu1 - mu2)**2.0) # calculate sqrt of product between cov covmean = sqrtm(sigma1.dot(sigma2)) # check and correct imaginary numbers from sqrt if iscomplexobj(covmean): covmean = covmean.real # calculate score fid = ssdiff + trace(sigma1 + sigma2 - 2.0 * covmean) return fid

We can then test this function with some contrived collections of images, in this case, 10 32×32 images with random pixel values in the range [0,255].

... # define two fake collections of images images1 = randint(0, 255, 10*32*32*3) images1 = images1.reshape((10,32,32,3)) images2 = randint(0, 255, 10*32*32*3) images2 = images2.reshape((10,32,32,3))

We can then convert the integer pixel values to floating point values and scale them to the required size of 299×299 pixels.

... # convert integer to floating point values images1 = images1.astype('float32') images2 = images2.astype('float32') # resize images images1 = scale_images(images1, (299,299,3)) images2 = scale_images(images2, (299,299,3))

Then the pixel values can be scaled to meet the expectations of the Inception v3 model.

... # pre-process images images1 = preprocess_input(images1) images2 = preprocess_input(images2)

Then calculate the FID scores, first between a collection of images and itself, then between the two collections of images.

... # fid between images1 and images1 fid = calculate_fid(model, images1, images1) print('FID (same): %.3f' % fid) # fid between images1 and images2 fid = calculate_fid(model, images1, images2) print('FID (different): %.3f' % fid)

Tying all of this together, the complete example is listed below.

# example of calculating the frechet inception distance in Keras import numpy from numpy import cov from numpy import trace from numpy import iscomplexobj from numpy import asarray from numpy.random import randint from scipy.linalg import sqrtm from keras.applications.inception_v3 import InceptionV3 from keras.applications.inception_v3 import preprocess_input from keras.datasets.mnist import load_data from skimage.transform import resize # scale an array of images to a new size def scale_images(images, new_shape): images_list = list() for image in images: # resize with nearest neighbor interpolation new_image = resize(image, new_shape, 0) # store images_list.append(new_image) return asarray(images_list) # calculate frechet inception distance def calculate_fid(model, images1, images2): # calculate activations act1 = model.predict(images1) act2 = model.predict(images2) # calculate mean and covariance statistics mu1, sigma1 = act1.mean(axis=0), cov(act1, rowvar=False) mu2, sigma2 = act2.mean(axis=0), cov(act2, rowvar=False) # calculate sum squared difference between means ssdiff = numpy.sum((mu1 - mu2)**2.0) # calculate sqrt of product between cov covmean = sqrtm(sigma1.dot(sigma2)) # check and correct imaginary numbers from sqrt if iscomplexobj(covmean): covmean = covmean.real # calculate score fid = ssdiff + trace(sigma1 + sigma2 - 2.0 * covmean) return fid # prepare the inception v3 model model = InceptionV3(include_top=False, pooling='avg', input_shape=(299,299,3)) # define two fake collections of images images1 = randint(0, 255, 10*32*32*3) images1 = images1.reshape((10,32,32,3)) images2 = randint(0, 255, 10*32*32*3) images2 = images2.reshape((10,32,32,3)) print('Prepared', images1.shape, images2.shape) # convert integer to floating point values images1 = images1.astype('float32') images2 = images2.astype('float32') # resize images images1 = scale_images(images1, (299,299,3)) images2 = scale_images(images2, (299,299,3)) print('Scaled', images1.shape, images2.shape) # pre-process images images1 = preprocess_input(images1) images2 = preprocess_input(images2) # fid between images1 and images1 fid = calculate_fid(model, images1, images1) print('FID (same): %.3f' % fid) # fid between images1 and images2 fid = calculate_fid(model, images1, images2) print('FID (different): %.3f' % fid)

Running the example first summarizes the shapes of the fabricated images and their rescaled versions, matching our expectations.

**Note**: the first time the InceptionV3 model is used, Keras will download the model weights and save them into the *~/.keras/models/* directory on your workstation. The weights are about 100 megabytes and may take a moment to download depending on the speed of your internet connection.

The FID score between a given set of images and itself is 0.0, as we expect, and the distance between the two collections of random images is about 35.

Prepared (10, 32, 32, 3) (10, 32, 32, 3) Scaled (10, 299, 299, 3) (10, 299, 299, 3) FID (same): -0.000 FID (different): 35.495

It may be useful to calculate the FID score between two collections of real images.

The Keras library provides a number of computer vision datasets, including the CIFAR-10 dataset. These are color photos with the small size of 32×32 pixels and is split into train and test elements and can be loaded as follows:

... # load cifar10 images (images1, _), (images2, _) = cifar10.load_data()

The training dataset has 50,000 images, whereas the test dataset has only 10,000 images. It may be interesting to calculate the FID score between these two datasets to get an idea of how representative the test dataset is of the training dataset.

Scaling and scoring 50K images takes a long time, therefore, we can reduce the “*training set*” to a 10K random sample as follows:

... shuffle(images1) images1 = images1[:10000]

Tying this all together, we can calculate the FID score between a sample of the train and the test dataset as follows.

# example of calculating the frechet inception distance in Keras for cifar10 import numpy from numpy import cov from numpy import trace from numpy import iscomplexobj from numpy import asarray from numpy.random import shuffle from scipy.linalg import sqrtm from keras.applications.inception_v3 import InceptionV3 from keras.applications.inception_v3 import preprocess_input from keras.datasets.mnist import load_data from skimage.transform import resize from keras.datasets import cifar10 # scale an array of images to a new size def scale_images(images, new_shape): images_list = list() for image in images: # resize with nearest neighbor interpolation new_image = resize(image, new_shape, 0) # store images_list.append(new_image) return asarray(images_list) # calculate frechet inception distance def calculate_fid(model, images1, images2): # calculate activations act1 = model.predict(images1) act2 = model.predict(images2) # calculate mean and covariance statistics mu1, sigma1 = act1.mean(axis=0), cov(act1, rowvar=False) mu2, sigma2 = act2.mean(axis=0), cov(act2, rowvar=False) # calculate sum squared difference between means ssdiff = numpy.sum((mu1 - mu2)**2.0) # calculate sqrt of product between cov covmean = sqrtm(sigma1.dot(sigma2)) # check and correct imaginary numbers from sqrt if iscomplexobj(covmean): covmean = covmean.real # calculate score fid = ssdiff + trace(sigma1 + sigma2 - 2.0 * covmean) return fid # prepare the inception v3 model model = InceptionV3(include_top=False, pooling='avg', input_shape=(299,299,3)) # load cifar10 images (images1, _), (images2, _) = cifar10.load_data() shuffle(images1) images1 = images1[:10000] print('Loaded', images1.shape, images2.shape) # convert integer to floating point values images1 = images1.astype('float32') images2 = images2.astype('float32') # resize images images1 = scale_images(images1, (299,299,3)) images2 = scale_images(images2, (299,299,3)) print('Scaled', images1.shape, images2.shape) # pre-process images images1 = preprocess_input(images1) images2 = preprocess_input(images2) # calculate fid fid = calculate_fid(model, images1, images2) print('FID: %.3f' % fid)

Running the example may take some time depending on the speed of your workstation.

At the end of the run, we can see that the FID score between the train and test datasets is about five.

Loaded (10000, 32, 32, 3) (10000, 32, 32, 3) Scaled (10000, 299, 299, 3) (10000, 299, 299, 3) FID: 5.492

This section provides more resources on the topic if you are looking to go deeper.

- GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium, 2017.
- Are GANs Created Equal? A Large-Scale Study, 2017.
- Pros and Cons of GAN Evaluation Measures, 2018.

- Official Implementation in TensorFlow, GitHub.
- Frechet Inception Distance (FID score) in PyTorch, GitHub.

- numpy.trace API.
- numpy.cov API.
- numpy.iscomplexobj API.
- Keras Inception v3 Model
- scikit-image Library

- Frechet Inception Distance, 2017.
- Frechet Inception Distance, 2018.
- Frechet distance, Wikipedia.
- Covariance matrix, Wikipedia.
- Square root of a matrix, Wikipedia.

In this tutorial, you discovered how to implement the Frechet Inception Distance for evaluating generated images.

Specifically, you learned:

- The Frechet Inception Distance summarizes the distance between the Inception feature vectors for real and generated images in the same domain.
- How to calculate the FID score and implement the calculation from scratch in NumPy.
- How to implement the FID score using the Keras deep learning library and calculate it with real images.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Implement the Frechet Inception Distance (FID) for Evaluating GANs appeared first on MachineLearningMastery.com.

]]>The post How to Implement the Inception Score (IS) for Evaluating GANs appeared first on MachineLearningMastery.com.

]]>A problem with generative models is that there is no objective way to evaluate the quality of the generated images.

As such, it is common to periodically generate and save images during the model training process and use subjective human evaluation of the generated images in order to both evaluate the quality of the generated images and to select a final generator model.

Many attempts have been made to establish an objective measure of generated image quality. An early and somewhat widely adopted example of an objective evaluation method for generated images is the Inception Score, or IS.

In this tutorial, you will discover the inception score for evaluating the quality of generated images.

After completing this tutorial, you will know:

- How to calculate the inception score and the intuition behind what it measures.
- How to implement the inception score in Python with NumPy and the Keras deep learning library.
- How to calculate the inception score for small images such as those in the CIFAR-10 dataset.

**Kick-start your project** with my new book Generative Adversarial Networks with Python, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

**Update Oct/2019**: Updated small bug in inception score for equal distribution example.

This tutorial is divided into five parts; they are:

- What Is the Inception Score?
- How to Calculate the Inception Score
- How to Implement the Inception Score With NumPy
- How to Implement the Inception Score With Keras
- Problems With the Inception Score

The Inception Score, or IS for short, is an objective metric for evaluating the quality of generated images, specifically synthetic images output by generative adversarial network models.

The inception score was proposed by Tim Salimans, et al. in their 2016 paper titled “Improved Techniques for Training GANs.”

In the paper, the authors use a crowd-sourcing platform (Amazon Mechanical Turk) to evaluate a large number of GAN generated images. They developed the inception score as an attempt to remove the subjective human evaluation of images.

The authors discover that their scores correlated well with the subjective evaluation.

As an alternative to human annotators, we propose an automatic method to evaluate samples, which we find to correlate well with human evaluation …

— Improved Techniques for Training GANs, 2016.

The inception score involves using a pre-trained deep learning neural network model for image classification to classify the generated images. Specifically, the Inception v3 model described by Christian Szegedy, et al. in their 2015 paper titled “Rethinking the Inception Architecture for Computer Vision.” The reliance on the inception model gives the inception score its name.

A large number of generated images are classified using the model. Specifically, the probability of the image belonging to each class is predicted. These predictions are then summarized into the inception score.

The score seeks to capture two properties of a collection of generated images:

**Image Quality**. Do images look like a specific object?**Image Diversity**. Is a wide range of objects generated?

The inception score has a lowest value of 1.0 and a highest value of the number of classes supported by the classification model; in this case, the Inception v3 model supports the 1,000 classes of the ILSVRC 2012 dataset, and as such, the highest inception score on this dataset is 1,000.

The CIFAR-10 dataset is a collection of 50,000 images divided into 10 classes of objects. The original paper that introduces the inception calculated the score on the real CIFAR-10 training dataset, achieving a result of 11.24 +/- 0.12.

Using the GAN model also introduced in their paper, they achieved an inception score of 8.09 +/- .07 when generating synthetic images for this dataset.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

The inception score is calculated by first using a pre-trained Inception v3 model to predict the class probabilities for each generated image.

These are conditional probabilities, e.g. class label conditional on the generated image. Images that are classified strongly as one class over all other classes indicate a high quality. As such, the conditional probability of all generated images in the collection should have a low entropy.

Images that contain meaningful objects should have a conditional label distribution p(y|x) with low entropy.

— Improved Techniques for Training GANs, 2016.

The entropy is calculated as the negative sum of each observed probability multiplied by the log of the probability. The intuition here is that large probabilities have less information than small probabilities.

- entropy = -sum(p_i * log(p_i))

The conditional probability captures our interest in image quality.

To capture our interest in a variety of images, we use the marginal probability. This is the probability distribution of all generated images. We, therefore, would prefer the integral of the marginal probability distribution to have a high entropy.

Moreover, we expect the model to generate varied images, so the marginal integral p(y|x = G(z))dz should have high entropy.

— Improved Techniques for Training GANs, 2016.

These elements are combined by calculating the Kullback-Leibler divergence, or KL divergence (relative entropy), between the conditional and marginal probability distributions.

Calculating the divergence between two distributions is written using the “||” operator, therefore we can say we are interested in the KL divergence between C for conditional and M for marginal distributions or:

- KL (C || M)

Specifically, we are interested in the average of the KL divergence for all generated images.

Combining these two requirements, the metric that we propose is: exp(Ex KL(p(y|x)||p(y))).

— Improved Techniques for Training GANs, 2016.

We don’t need to translate the calculation of the inception score. Thankfully, the authors of the paper also provide source code on GitHub that includes an implementation of the inception score.

The calculation of the score assumes a large number of images for a range of objects, such as 50,000.

The images are split into 10 groups, e.g 5,000 images per group, and the inception score is calculated on each group of images, then the average and standard deviation of the score is reported.

The calculation of the inception score on a group of images involves first using the inception v3 model to calculate the conditional probability for each image (p(y|x)). The marginal probability is then calculated as the average of the conditional probabilities for the images in the group (p(y)).

The KL divergence is then calculated for each image as the conditional probability multiplied by the log of the conditional probability minus the log of the marginal probability.

- KL divergence = p(y|x) * (log(p(y|x)) – log(p(y)))

The KL divergence is then summed over all images and averaged over all classes and the exponent of the result is calculated to give the final score.

This defines the official inception score implementation used when reported in most papers that use the score, although variations on how to calculate the score do exist.

Implementing the calculation of the inception score in Python with NumPy arrays is straightforward.

First, let’s define a function that will take a collection of conditional probabilities and calculate the inception score.

The *calculate_inception_score()* function listed below implements the procedure.

One small change is the introduction of an epsilon (a tiny number close to zero) when calculating the log probabilities to avoid blowing up when trying to calculate the log of a zero probability. This is probably not needed in practice (e.g. with real generated images) but is useful here and good practice when working with log probabilities.

# calculate the inception score for p(y|x) def calculate_inception_score(p_yx, eps=1E-16): # calculate p(y) p_y = expand_dims(p_yx.mean(axis=0), 0) # kl divergence for each image kl_d = p_yx * (log(p_yx + eps) - log(p_y + eps)) # sum over classes sum_kl_d = kl_d.sum(axis=1) # average over images avg_kl_d = mean(sum_kl_d) # undo the logs is_score = exp(avg_kl_d) return is_score

We can then test out this function to calculate the inception score for some contrived conditional probabilities.

We can imagine the case of three classes of image and a perfect confident prediction for each class for three images.

# conditional probabilities for high quality images p_yx = asarray([[1.0, 0.0, 0.0], [0.0, 1.0, 0.0], [0.0, 0.0, 1.0]])

We would expect the inception score for this case to be 3.0 (or very close to it). This is because we have the same number of images for each image class (one image for each of the three classes) and each conditional probability is maximally confident.

The complete example for calculating the inception score for these probabilities is listed below.

# calculate inception score in numpy from numpy import asarray from numpy import expand_dims from numpy import log from numpy import mean from numpy import exp # calculate the inception score for p(y|x) def calculate_inception_score(p_yx, eps=1E-16): # calculate p(y) p_y = expand_dims(p_yx.mean(axis=0), 0) # kl divergence for each image kl_d = p_yx * (log(p_yx + eps) - log(p_y + eps)) # sum over classes sum_kl_d = kl_d.sum(axis=1) # average over images avg_kl_d = mean(sum_kl_d) # undo the logs is_score = exp(avg_kl_d) return is_score # conditional probabilities for high quality images p_yx = asarray([[1.0, 0.0, 0.0], [0.0, 1.0, 0.0], [0.0, 0.0, 1.0]]) score = calculate_inception_score(p_yx) print(score)

Running the example gives the expected score of 3.0 (or a number extremely close).

2.999999999999999

We can also try the worst case.

This is where we still have the same number of images for each class (one for each of the three classes), but the objects are unknown, giving a uniform predicted probability distribution across each class.

# conditional probabilities for low quality images p_yx = asarray([[0.33, 0.33, 0.33], [0.33, 0.33, 0.33], [0.33, 0.33, 0.33]]) score = calculate_inception_score(p_yx) print(score)

In this case, we would expect the inception score to be the worst possible where there is no difference between the conditional and marginal distributions, e.g. an inception score of 1.0.

Tying this together, the complete example is listed below.

# calculate inception score in numpy from numpy import asarray from numpy import expand_dims from numpy import log from numpy import mean from numpy import exp # calculate the inception score for p(y|x) def calculate_inception_score(p_yx, eps=1E-16): # calculate p(y) p_y = expand_dims(p_yx.mean(axis=0), 0) # kl divergence for each image kl_d = p_yx * (log(p_yx + eps) - log(p_y + eps)) # sum over classes sum_kl_d = kl_d.sum(axis=1) # average over images avg_kl_d = mean(sum_kl_d) # undo the logs is_score = exp(avg_kl_d) return is_score # conditional probabilities for low quality images p_yx = asarray([[0.33, 0.33, 0.33], [0.33, 0.33, 0.33], [0.33, 0.33, 0.33]]) score = calculate_inception_score(p_yx) print(score)

Running the example reports the expected inception score of 1.0.

1.0

You may want to experiment with the calculation of the inception score and test other pathological cases.

Now that we know how to calculate the inception score and to implement it in Python, we can develop an implementation in Keras.

This involves using the real Inception v3 model to classify images and to average the calculation of the score across multiple splits of a collection of images.

First, we can load the Inception v3 model in Keras directly.

... # load inception v3 model model = InceptionV3()

The model expects images to be color and to have the shape 299×299 pixels.

Additionally, the pixel values must be scaled in the same way as the training data images, before they can be classified.

This can be achieved by converting the pixel values from integers to floating point values and then calling the *preprocess_input()* function for the images.

... # convert from uint8 to float32 processed = images.astype('float32') # pre-process raw images for inception v3 model processed = preprocess_input(processed)

Then the conditional probabilities for each of the 1,000 image classes can be predicted for the images.

... # predict class probabilities for images yhat = model.predict(images)

The inception score can then be calculated directly on the NumPy array of probabilities as we did in the previous section.

Before we do that, we must split the conditional probabilities into groups, controlled by a *n_split* argument and set to the default of 10 as was used in the original paper.

... n_part = floor(images.shape[0] / n_split)

We can then enumerate over the conditional probabilities in blocks of *n_part* images or predictions and calculate the inception score.

... # retrieve p(y|x) ix_start, ix_end = i * n_part, (i+1) * n_part p_yx = yhat[ix_start:ix_end]

After calculating the scores for each split of conditional probabilities, we can calculate and return the average and standard deviation inception scores.

... # average across images is_avg, is_std = mean(scores), std(scores)

Tying all of this together, the *calculate_inception_score()* function below takes an array of images with the expected size and pixel values in [0,255] and calculates the average and standard deviation inception scores using the inception v3 model in Keras.

# assumes images have the shape 299x299x3, pixels in [0,255] def calculate_inception_score(images, n_split=10, eps=1E-16): # load inception v3 model model = InceptionV3() # convert from uint8 to float32 processed = images.astype('float32') # pre-process raw images for inception v3 model processed = preprocess_input(processed) # predict class probabilities for images yhat = model.predict(processed) # enumerate splits of images/predictions scores = list() n_part = floor(images.shape[0] / n_split) for i in range(n_split): # retrieve p(y|x) ix_start, ix_end = i * n_part, i * n_part + n_part p_yx = yhat[ix_start:ix_end] # calculate p(y) p_y = expand_dims(p_yx.mean(axis=0), 0) # calculate KL divergence using log probabilities kl_d = p_yx * (log(p_yx + eps) - log(p_y + eps)) # sum over classes sum_kl_d = kl_d.sum(axis=1) # average over images avg_kl_d = mean(sum_kl_d) # undo the log is_score = exp(avg_kl_d) # store scores.append(is_score) # average across images is_avg, is_std = mean(scores), std(scores) return is_avg, is_std

We can test this function with 50 artificial images with the value 1.0 for all pixels.

... # pretend to load images images = ones((50, 299, 299, 3)) print('loaded', images.shape)

This will calculate the score for each group of five images and the low quality would suggest that an average inception score of 1.0 will be reported.

The complete example is listed below.

# calculate inception score with Keras from math import floor from numpy import ones from numpy import expand_dims from numpy import log from numpy import mean from numpy import std from numpy import exp from keras.applications.inception_v3 import InceptionV3 from keras.applications.inception_v3 import preprocess_input # assumes images have the shape 299x299x3, pixels in [0,255] def calculate_inception_score(images, n_split=10, eps=1E-16): # load inception v3 model model = InceptionV3() # convert from uint8 to float32 processed = images.astype('float32') # pre-process raw images for inception v3 model processed = preprocess_input(processed) # predict class probabilities for images yhat = model.predict(processed) # enumerate splits of images/predictions scores = list() n_part = floor(images.shape[0] / n_split) for i in range(n_split): # retrieve p(y|x) ix_start, ix_end = i * n_part, i * n_part + n_part p_yx = yhat[ix_start:ix_end] # calculate p(y) p_y = expand_dims(p_yx.mean(axis=0), 0) # calculate KL divergence using log probabilities kl_d = p_yx * (log(p_yx + eps) - log(p_y + eps)) # sum over classes sum_kl_d = kl_d.sum(axis=1) # average over images avg_kl_d = mean(sum_kl_d) # undo the log is_score = exp(avg_kl_d) # store scores.append(is_score) # average across images is_avg, is_std = mean(scores), std(scores) return is_avg, is_std # pretend to load images images = ones((50, 299, 299, 3)) print('loaded', images.shape) # calculate inception score is_avg, is_std = calculate_inception_score(images) print('score', is_avg, is_std)

Running the example first defines the 50 fake images, then calculates the inception score on each batch and reports the expected inception score of 1.0, with a standard deviation of 0.0.

**Note**: the first time the InceptionV3 model is used, Keras will download the model weights and save them into the *~/.keras/models/* directory on your workstation. The weights are about 100 megabytes and may take a moment to download depending on the speed of your internet connection.

loaded (50, 299, 299, 3) score 1.0 0.0

We can test the calculation of the inception score on some real images.

The Keras API provides access to the CIFAR-10 dataset.

These are color photos with the small size of 32×32 pixels. First, we can split the images into groups, then upsample the images to the expected size of 299×299, preprocess the pixel values, predict the class probabilities, then calculate the inception score.

This will be a useful example if you intend to calculate the inception score on your own generated images, as you may have to either scale the images to the expected size for the inception v3 model or change the model to perform the upsampling for you.

First, the images can be loaded and shuffled to ensure each split covers a diverse set of classes.

... # load cifar10 images (images, _), (_, _) = cifar10.load_data() # shuffle images shuffle(images)

Next, we need a way to scale the images.

We will use the scikit-image library to resize the NumPy array of pixel values to the required size. The *scale_images()* function below implements this.

# scale an array of images to a new size def scale_images(images, new_shape): images_list = list() for image in images: # resize with nearest neighbor interpolation new_image = resize(image, new_shape, 0) # store images_list.append(new_image) return asarray(images_list)

Note, you may have to install the scikit-image library if it is not already installed. This can be achieved as follows:

sudo pip install scikit-image

We can then enumerate the number of splits, select a subset of the images, scale them, pre-process them, and use the model to predict the conditional class probabilities.

... # retrieve images ix_start, ix_end = i * n_part, (i+1) * n_part subset = images[ix_start:ix_end] # convert from uint8 to float32 subset = subset.astype('float32') # scale images to the required size subset = scale_images(subset, (299,299,3)) # pre-process images, scale to [-1,1] subset = preprocess_input(subset) # predict p(y|x) p_yx = model.predict(subset)

The rest of the calculation of the inception score is the same.

Tying this all together, the complete example for calculating the inception score on the real CIFAR-10 training dataset is listed below.

Based on the similar calculation reported in the original inception score paper, we would expect the reported score on this dataset to be approximately 11. Interestingly, the best inception score for CIFAR-10 with generated images is about 8.8 at the time of writing using a progressive growing GAN.

# calculate inception score for cifar-10 in Keras from math import floor from numpy import ones from numpy import expand_dims from numpy import log from numpy import mean from numpy import std from numpy import exp from numpy.random import shuffle from keras.applications.inception_v3 import InceptionV3 from keras.applications.inception_v3 import preprocess_input from keras.datasets import cifar10 from skimage.transform import resize from numpy import asarray # scale an array of images to a new size def scale_images(images, new_shape): images_list = list() for image in images: # resize with nearest neighbor interpolation new_image = resize(image, new_shape, 0) # store images_list.append(new_image) return asarray(images_list) # assumes images have any shape and pixels in [0,255] def calculate_inception_score(images, n_split=10, eps=1E-16): # load inception v3 model model = InceptionV3() # enumerate splits of images/predictions scores = list() n_part = floor(images.shape[0] / n_split) for i in range(n_split): # retrieve images ix_start, ix_end = i * n_part, (i+1) * n_part subset = images[ix_start:ix_end] # convert from uint8 to float32 subset = subset.astype('float32') # scale images to the required size subset = scale_images(subset, (299,299,3)) # pre-process images, scale to [-1,1] subset = preprocess_input(subset) # predict p(y|x) p_yx = model.predict(subset) # calculate p(y) p_y = expand_dims(p_yx.mean(axis=0), 0) # calculate KL divergence using log probabilities kl_d = p_yx * (log(p_yx + eps) - log(p_y + eps)) # sum over classes sum_kl_d = kl_d.sum(axis=1) # average over images avg_kl_d = mean(sum_kl_d) # undo the log is_score = exp(avg_kl_d) # store scores.append(is_score) # average across images is_avg, is_std = mean(scores), std(scores) return is_avg, is_std # load cifar10 images (images, _), (_, _) = cifar10.load_data() # shuffle images shuffle(images) print('loaded', images.shape) # calculate inception score is_avg, is_std = calculate_inception_score(images) print('score', is_avg, is_std)

Running the example loads the dataset, prepares the model, and calculates the inception score on the CIFAR-10 training dataset.

We can see that the score is 11.3, which is close to the expected score of 11.24.

**Note**: the first time that the CIFAR-10 dataset is used, Keras will download the images in a compressed format and store them in the *~/.keras/datasets/* directory. The download is about 161 megabytes and may take a few minutes based on the speed of your internet connection.

loaded (50000, 32, 32, 3) score 11.317895 0.14821531

The inception score is effective, but it is not perfect.

Generally, the inception score is appropriate for generated images of objects known to the model used to calculate the conditional class probabilities.

In this case, because the inception v3 model is used, this means that it is most suitable for 1,000 object types used in the ILSVRC 2012 dataset. This is a lot of classes, but not all objects that may interest us.

You can see a full list of the classes here:

It also requires that the images are square and have the relatively small size of about 300×300 pixels, including any scaling required to get your generated images to that size.

A good score also requires having a good distribution of generated images across the possible objects supported by the model, and close to an even number of examples for each class. This can be hard to control for many GAN models that don’t offer controls over the types of objects generated.

Shane Barratt and Rishi Sharma take a closer look at the inception score and list a number of technical issues and edge cases in their 2018 paper titled “A Note on the Inception Score.” This is a good reference if you wish to dive deeper.

This section provides more resources on the topic if you are looking to go deeper.

- Improved Techniques for Training GANs, 2016.
- A Note on the Inception Score, 2018.
- Rethinking the Inception Architecture for Computer Vision, 2015.

- Code for the paper “Improved Techniques for Training GANs”
- Large Scale Visual Recognition Challenge 2012 (ILSVRC2012)

- Image Generation on CIFAR-10
- Inception Score calculation, 2017.
- A simple explanation of the Inception Score
- Inception Score — evaluating the realism of your GAN, 2018.
- Kullback–Leibler divergence, Wikipedia.
- Entropy (information theory), Wikipedia.

In this tutorial, you discovered the inception score for evaluating the quality of generated images.

Specifically, you learned:

- How to calculate the inception score and the intuition behind what it measures.
- How to implement the inception score in Python with NumPy and the Keras deep learning library.
- How to calculate the inception score for small images such as those in the CIFAR-10 dataset.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Implement the Inception Score (IS) for Evaluating GANs appeared first on MachineLearningMastery.com.

]]>The post How to Evaluate Generative Adversarial Networks appeared first on MachineLearningMastery.com.

]]>Unlike other deep learning neural network models that are trained with a loss function until convergence, a GAN generator model is trained using a second model called a discriminator that learns to classify images as real or generated. Both the generator and discriminator model are trained together to maintain an equilibrium.

As such, there is no objective loss function used to train the GAN generator models and no way to objectively assess the progress of the training and the relative or absolute quality of the model from loss alone.

Instead, a suite of qualitative and quantitative techniques have been developed to assess the performance of a GAN model based on the quality and diversity of the generated synthetic images.

In this post, you will discover techniques for evaluating generative adversarial network models based on generated synthetic images.

After reading this post, you will know:

- There is no objective function used when training GAN generator models, meaning models must be evaluated using the quality of the generated synthetic images.
- Manual inspection of generated images is a good starting point when getting started.
- Quantitative measures, such as the inception score and the Frechet inception distance, can be combined with qualitative assessment to provide a robust assessment of GAN models.

**Kick-start your project** with my new book Generative Adversarial Networks with Python, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

This tutorial is divided into five parts; they are:

- The Problem of Evaluating GAN Generator Models
- Manual GAN Generator Evaluation
- Qualitative GAN Generator Evaluation
- Quantitative GAN Generator Evaluation
- Which GAN Evaluation Scheme to Use

Generative adversarial networks are a type of deep-learning-based generative model.

GANs have proved to be remarkably effective at generating both high-quality and large synthetic images in a range of problem domains.

Instead of being trained directly, the generator models are trained by a second model, called the discriminator, that learns to differentiate real images from fake or generated images. As such, there is no objective function or objective measure for the generator model.

Generative adversarial networks lack an objective function, which makes it difficult to compare performance of different models.

— Improved Techniques for Training GANs, 2016.

This means that there is no generally agreed upon way of evaluating a given GAN generator model.

This is a problem for the research and use of GANs; for example, when:

- Choosing a final GAN generator model during a training run.
- Choosing generated images to demonstrate the capability of a GAN generator model.
- Comparing GAN model architectures.
- Comparing GAN model configurations.

The objective evaluation of GAN generator models remains an open problem.

While several measures have been introduced, as of yet, there is no consensus as to which measure best captures strengths and limitations of models and should be used for fair model comparison.

— Pros and Cons of GAN Evaluation Measures, 2018.

As such, GAN generator models are evaluated based on the quality of the images generated, often in the context of the target problem domain.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Many GAN practitioners fall back to the evaluation of GAN generators via the manual assessment of images synthesized by a generator model.

This involves using the generator model to create a batch of synthetic images, then evaluating the quality and diversity of the images in relation to the target domain.

This may be performed by the researcher or practitioner themselves.

Visual examination of samples by humans is one of the common and most intuitive ways to evaluate GANs.

— Pros and Cons of GAN Evaluation Measures, 2018.

The generator model is trained iteratively over many training epochs. As there is no objective measure of model performance, we cannot know when the training process should stop and when a final model should be saved for later use.

Therefore, it is common to use the current state of the model during training to generate a large number of synthetic images and to save the current state of the generator used to generate the images. This allows for the post-hoc evaluation of each saved generator model via its generated images.

One training epoch refers to one cycle through the images in the training dataset used to update the model. Models may be saved systematically across training epochs, such as every one, five, ten, or more training epochs.

Although manual inspection is the simplest method of model evaluation, it has many limitations, including:

- It is subjective, including biases of the reviewer about the model, its configuration, and the project objective.
- It requires knowledge of what is realistic and what is not for the target domain.
- It is limited to the number of images that can be reviewed in a reasonable time.

… evaluating the quality of generated images with human vision is expensive and cumbersome, biased […] difficult to reproduce, and does not fully reflect the capacity of models.

— Pros and Cons of GAN Evaluation Measures, 2018.

The subjective nature almost certainty leads to biased model selection and cherry picking and should not be used for final model selection on non-trivial projects.

Nevertheless, it is a starting point for practitioners when getting familiar with the technique.

Thankfully, more sophisticated GAN generator evaluation methods have been proposed and adopted.

For a thorough survey, see the 2018 paper titled “Pros and Cons of GAN Evaluation Measures.” This paper divides GAN generator model evaluation into qualitative and quantitative measures, and we will review some of them in the following sections using this division.

Qualitative measures are those measures that are not numerical and often involve human subjective evaluation or evaluation via comparison.

Five qualitative techniques for evaluating GAN generator models are listed below.

- Nearest Neighbors.
- Rapid Scene Categorization.
- Rating and Preference Judgment.
- Evaluating Mode Drop and Mode Collapse.
- Investigating and Visualizing the Internals of Networks.

Perhaps the most used qualitative GAN generator model is an extension of the manual inspection of images referred to as “*Rating and Preference Judgment*.”

These types of experiments ask subjects to rate models in terms of the fidelity of their generated images.

— Pros and Cons of GAN Evaluation Measures, 2018.

This is where human judges are asked to rank or compare examples of real and generated images from the domain.

The “*Rapid Scene Categorization*” method is generally the same, although images are presented to human judges for a very limited amount of time, such as a fraction of a second, and classified as real or fake.

Images are often presented in pairs and the human judge is asked which image they prefer, e.g. which image is more realistic. A score or rating is determined based on the number of times a specific model generated images on such tournaments. Variance in the judging is reduced by averaging the ratings across multiple different human judges.

This is a labor-intensive exercise, although costs can be lowered by using a crowdsourcing platform like Amazon’s Mechanical Turk, and efficiency can be increased by using a web interface.

One intuitive metric of performance can be obtained by having human annotators judge the visual quality of samples. We automate this process using Amazon Mechanical Turk […] using the web interface […] which we use to ask annotators to distinguish between generated data and real data.

— Improved Techniques for Training GANs, 2016.

A major downside of the approach is that the performance of human judges is not fixed and can improve over time. This is especially the case if they are given feedback, such as clues on how to detect generated images.

By learning from such feedback, annotators are better able to point out the flaws in generated images, giving a more pessimistic quality assessment.

— Improved Techniques for Training GANs, 2016.

Another popular approach for subjectively summarizing generator performance is “*Nearest Neighbors*.” This involves selecting examples of real images from the domain and locating one or more most similar generated images for comparison.

Distance measures, such as Euclidean distance between the image pixel data, is often used for selecting the most similar generated images.

The nearest neighbor approach is useful to give context for evaluating how realistic the generated images happen to be.

Quantitative GAN generator evaluation refers to the calculation of specific numerical scores used to summarize the quality of generated images.

Twenty-four quantitative techniques for evaluating GAN generator models are listed below.

- Average Log-likelihood
- Coverage Metric
- Inception Score (IS)
- Modified Inception Score (m-IS)
- Mode Score
- AM Score
- Frechet Inception Distance (FID)
- Maximum Mean Discrepancy (MMD)
- The Wasserstein Critic
- Birthday Paradox Test
- Classifier Two-sample Tests (C2ST)
- Classification Performance
- Boundary Distortion
- Number of Statistically-Different Bins (NDB)
- Image Retrieval Performance
- Generative Adversarial Metric (GAM)
- Tournament Win Rate and Skill Rating
- Normalized Relative Discriminative Score (NRDS)
- Adversarial Accuracy and Adversarial Divergence
- Geometry Score
- Reconstruction Error
- Image Quality Measures (SSIM, PSNR and Sharpness Difference)
- Low-level Image Statistics
- Precision, Recall and F1 Score

The original 2014 GAN paper by Goodfellow, et al. titled “Generative Adversarial Networks” used the “*Average Log-likelihood*” method, also referred to as kernel estimation or Parzen density estimation, to summarize the quality of the generated images.

This involves the challenging approach of estimating how well the generator captures the probability distribution of images in the domain and has generally been found not to be effective for evaluating GANs.

Parzen windows estimation of likelihood favors trivial models and is irrelevant to visual fidelity of samples. Further, it fails to approximate the true likelihood in high dimensional spaces or to rank models

— Pros and Cons of GAN Evaluation Measures, 2018.

Two widely adopted metrics for evaluating generated images are the Inception Score and the Frechet Inception Distance.

The inception score was proposed by Tim Salimans, et al. in their 2016 paper titled “Improved Techniques for Training GANs.”

Inception Score (IS) […] is perhaps the most widely adopted score for GAN evaluation.

— Pros and Cons of GAN Evaluation Measures, 2018.

Calculating the inception score involves using a pre-trained deep learning neural network model for image classification to classify the generated images. Specifically, the Inception v3 model described by Christian Szegedy, et al. in their 2015 paper titled “Rethinking the Inception Architecture for Computer Vision.” The reliance on the inception model gives the inception score its name.

A large number of generated images are classified using the model. Specifically, the probability of the image belonging to each class is predicted. The probabilities are then summarized in the score to both capture how much each image looks like a known class and how diverse the set of images are across the known classes.

A higher inception score indicates better-quality generated images.

The Frechet Inception Distance, or FID, score was proposed and used by Martin Heusel, et al. in their 2017 paper titled “GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium.” The score was proposed as an improvement over the existing Inception Score.

FID performs well in terms of discriminability, robustness and computational efficiency. […] It has been shown that FID is consistent with human judgments and is more robust to noise than IS.

— Pros and Cons of GAN Evaluation Measures, 2018.

Like the inception score, the FID score uses the inception v3 model. Specifically, the coding layer of the model (the last pooling layer prior to the output classification of images) is used to capture computer vision specific features of an input image. These activations are calculated for a collection of real and generated images.

The activations for each real and generated image are summarized as a multivariate Gaussian and the distance between these two distributions is then calculated using the Frechet distance, also called the Wasserstein-2 distance.

A lower FID score indicates more realistic images that match the statistical properties of real images.

When getting started, it is a good idea to start with the manual inspection of generated images in order to evaluate and select generator models.

- Manual Image Inspection

Developing GAN models is complex enough for beginners. Manual inspection can get you a long way while refining your model implementation and testing model configurations.

Once your confidence in developing GAN models improves, both the Inception Score and the Frechet Inception Distance can be used to quantitatively summarize the quality of generated images. There is no single best and agreed upon measure, although, these two measures come close.

As of yet, there is no consensus regarding the best score. Different scores assess various aspects of the image generation process, and it is unlikely that a single score can cover all aspects. Nevertheless, some measures seem more plausible than others (e.g. FID score).

— Pros and Cons of GAN Evaluation Measures, 2018.

These measures capture the quality and diversity of generated images, both alone (former) and compared to real images (latter) and are widely used.

- Inception Score
- Frechet Inception Distance

Both measures are easy to implement and calculate on batches of generated images. As such, the practice of systematically generating images and saving models during training can and should continue to be used to allow post-hoc model selection.

The nearest neighbor method can be used to qualitatively summarize generated images. Human-based ratings and preference judgments can also be used if needed via a crowdsourcing platform.

- Nearest Neighbors
- Rating and Preference Judgment

This section provides more resources on the topic if you are looking to go deeper.

- Generative Adversarial Networks, 2014.
- Pros and Cons of GAN Evaluation Measures, 2018.
- Improved Techniques for Training GANs, 2016.
- GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium, 2017.
- Are GANs Created Equal? A Large-Scale Study, 2017.

In this post, you discovered techniques for evaluating generative adversarial network models based on generated synthetic images.

Specifically, you learned:

- There is no objective function used when training GAN generator models, meaning models must be evaluated using the quality of the generated synthetic images.
- Manual inspection of generated images is a good starting point when getting started.
- Quantitative measures, such as the inception score and the Frechet inception distance, can be combined with qualitative assessment to provide a robust assessment of GAN models.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Evaluate Generative Adversarial Networks appeared first on MachineLearningMastery.com.

]]>The post A Gentle Introduction to BigGAN the Big Generative Adversarial Network appeared first on MachineLearningMastery.com.

]]>Nevertheless, they are typically restricted to generating small images and the training process remains fragile, dependent upon specific augmentations and hyperparameters in order to achieve good results.

The BigGAN is an approach to pull together a suite of recent best practices in training class-conditional images and scaling up the batch size and number of model parameters. The result is the routine generation of both high-resolution (large) and high-quality (high-fidelity) images.

In this post, you will discover the BigGAN model for scaling up class-conditional image synthesis.

After reading this post, you will know:

- Image size and training brittleness remain large problems for GANs.
- Scaling up model size and batch size can result in dramatically larger and higher-quality images.
- Specific model architectural and training configurations required to scale up GANs.

**Kick-start your project** with my new book Generative Adversarial Networks with Python, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

This tutorial is divided into four parts; they are:

- Brittleness of GAN Training
- Develop Better GANs by Scaling Up
- How to Scale-Up GANs With BigGAN
- Example of Images Generated by BigGAN

Generative Adversarial Networks, or GANs for short, are capable of generating high-quality synthetic images.

Nevertheless, the size of generated images remains relatively small, e.g. 64×64 or 128×128 pixels.

Additionally, the model training process remains brittle regardless of the large number of studies that have investigated and proposed improvements.

Without auxiliary stabilization techniques, this training procedure is notoriously brittle, requiring finely-tuned hyperparameters and architectural choices to work at all.

— Large Scale GAN Training for High Fidelity Natural Image Synthesis, 2018.

Most of the improvements to the training process have focused on changes to the objective function or constraining the discriminator model during the training process.

Much recent research has accordingly focused on modifications to the vanilla GAN procedure to impart stability, drawing on a growing body of empirical and theoretical insights. One line of work is focused on changing the objective function […] to encourage convergence. Another line is focused on constraining D through gradient penalties […] or normalization […] both to counteract the use of unbounded loss functions and ensure D provides gradients everywhere to G.

— Large Scale GAN Training for High Fidelity Natural Image Synthesis, 2018.

More recently, work has focused on the effective application of the GAN for generating both high-quality and larger images.

One approach is to try scaling up GAN models that already work well.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

The BigGAN is an implementation of the GAN architecture designed to leverage the best from what has been reported to work more generally.

It was described by Andrew Brock, et al. in their 2018 paper titled “Large Scale GAN Training for High Fidelity Natural Image Synthesis” and presented at the ICLR 2019 conference.

Specifically, the BigGAN is designed for class-conditional image generation. That is, the generation of images using both a point from latent space and image class information as input. Example datasets used to train class-conditional GANs include the CIFAR or ImageNet image classification datasets that have tens, hundreds, or thousands of image classes.

As its name suggests, the BigGAN is focused on scaling up the GAN models.

This includes GAN models with:

- More model parameters (e.g. more feature maps).
- Larger Batch Sizes
- Architectural changes

We demonstrate that GANs benefit dramatically from scaling, and train models with two to four times as many parameters and eight times the batch size compared to prior art. We introduce two simple, general architectural changes that improve scalability, and modify a regularization scheme to improve conditioning, demonstrably boosting performance.

— Large Scale GAN Training for High Fidelity Natural Image Synthesis, 2018.

The BigGAN architecture also introduces a “*truncation trick*” used during image generation that results in an improvement in image quality, and a corresponding regularizing technique to better support this trick.

The result is an approach capable of generating larger and higher-quality images, such as 256×256 and 512×512 images.

When trained on ImageNet at 128×128 resolution, our models (BigGANs) improve the state-of-the-art […] We also successfully train BigGANs on ImageNet at 256×256 and 512×512 resolution …

— Large Scale GAN Training for High Fidelity Natural Image Synthesis, 2018.

The contribution of the BigGAN model is the design decisions for both the models and the training process.

These design decisions are important for both re-implementing the BigGAN, but also in providing insight on configuration options that may prove beneficial with GANs more generally.

The focus of the BigGAN model is to increase the number of model parameters and batch size, then configure the model and training process to achieve the best results.

In this section, we will review the specific design decisions in the BigGAN.

The base for the model is the Self-Attention GAN, or SAGAN for short, described by Han Zhang, et al. in the 2018 paper tilted “Self-Attention Generative Adversarial Networks.” This involves introducing an attention map that is applied to feature maps, allowing the generator and discriminator models to focus on different parts of the image.

This involves adding an attention module to the deep convolutional model architecture.

Additionally, the model is trained via hinge loss, commonly used for training support vector machines.

In SAGAN, the proposed attention module has been applied to both generator and discriminator, which are trained in an alternating fashion by minimizing the hinge version of the adversarial loss

— Self-Attention Generative Adversarial Networks, 2018.

The BigGAN uses the model architecture with attention modules from SAGAN and is trained via hinge loss.

Appendix B of the paper titled “*Architectural Details*” provides a summary of the modules and their configurations used in the generator and discriminator models. There are two versions of the model described BigGAN and BigGAN-deep, the latter involving deeper resnet modules and, in turn, achieving better results.

The class information is provided to the generator model via class-conditional batch normalization.

This was described by Vincent Dumoulin, et al. in their 2016 paper titled “A Learned Representation For Artistic Style.” In the paper, the technique is referred to as “*conditional instance normalization*” that involves normalizing activations based on the statistics from images of a given style, or in the case of BigGAN, images of a given class.

We call this approach conditional instance normalization. The goal of the procedure is [to] transform a layer’s activations x into a normalized activation z specific to painting style s.

— A Learned Representation For Artistic Style, 2016.

Class information is provided to the discriminator via projection.

This is described by Takeru Miyato, et al. in their 2018 paper titled “Spectral Normalization for Generative Adversarial Networks.” This involves using an integer embedding of the class value that is concatenated into an intermediate layer of the network.

Discriminator for conditional GANs. For computational ease, we embedded the integer label y in {0, . . . , 1000} into 128 dimension before concatenating the vector to the output of the intermediate layer.

— Spectral Normalization for Generative Adversarial Networks, 2018.

Instead of using one class embedding per class label, a shared embedding was used in order to reduce the number of weights.

Instead of having a separate layer for each embedding, we opt to use a shared embedding, which is linearly projected to each layer’s gains and biases. This reduces computation and memory costs, and improves training speed (in number of iterations required to reach a given performance) by 37%.

— Large Scale GAN Training for High Fidelity Natural Image Synthesis, 2018.

The weights of the generator are normalized using spectral normalization.

Spectral normalization for use in GANs was described by Takeru Miyato, et al. in their 2018 paper titled “Spectral Normalization for Generative Adversarial Networks.” Specifically, it involves normalizing the spectral norm of the weight matrix.

Our spectral normalization normalizes the spectral norm of the weight matrix W so that it satisfies the Lipschitz constraint sigma(W) = 1:

— Spectral Normalization for Generative Adversarial Networks, 2018.

The efficient implementation requires a change to the weight updates during mini-batch stochastic gradient descent, described in Appendix A of the spectral normalization paper.

In the GAN training algorithm, it is common to first update the discriminator model and then to update the generator model.

The BigGAN slightly modifies this and updates the discriminator model twice before updating the generator model in each training iteration.

The generator model is evaluated based on the images that are generated.

Before images are generated for evaluation, the model weights are averaged across prior training iterations using a moving average.

This approach to model weight moving average for generator evaluation was described and used by Tero Karras, et al. in their 2017 paper titled “Progressive Growing of GANs for Improved Quality, Stability, and Variation.”

… for visualizing generator output at any given point during the training, we use an exponential running average for the weights of the generator with decay 0.999.

— Progressive Growing of GANs for Improved Quality, Stability, and Variation, 2017.

Model weights are initialized using Orthogonal Initialization.

This was described by Andrew Saxe, et al. in their 2013 paper titled “Exact Solutions To The Nonlinear Dynamics Of Learning In Deep Linear Neural Networks.” This involves setting the weights to be a random orthogonal matrix.

… the initial weights in each layer to be a random orthogonal matrix (satisfying W^T . W = I) …

— Exact Solutions To The Nonlinear Dynamics Of Learning In Deep Linear Neural Networks, 2013.

Note that Keras supports orthogonal weight initialization directly.

Very large batch sizes were tested and evaluated.

This includes batch sizes of 256, 512, 1024, and 2,048 images.

Larger batch sizes generally resulted in better quality images, with the best image quality achieved with a batch size of 2,048 images.

… simply increasing the batch size by a factor of 8 improves the state-of-the-art IS by 46%.

— Large Scale GAN Training for High Fidelity Natural Image Synthesis, 2018.

The intuition is that the larger batch size provides more “*modes*”, and in turn, provides better gradient information for updating the models.

We conjecture that this is a result of each batch covering more modes, providing better gradients for both networks.

— Large Scale GAN Training for High Fidelity Natural Image Synthesis, 2018.

The number of model parameters was also dramatically increased.

This was achieved by doubling the number of channels or feature maps (filters) in each layer.

We then increase the width (number of channels) in each layer by 50%, approximately doubling the number of parameters in both models. This leads to a further IS improvement of 21%, which we posit is due to the increased capacity of the model relative to the complexity of the dataset.

— Large Scale GAN Training for High Fidelity Natural Image Synthesis, 2018.

Skip connections were added to the generator model to directly connect the input latent point to specific layers deep in the network.

These are referred to as skip-z connections, where z refers to the input latent vector.

Next, we add direct skip connections (skip-z) from the noise vector z to multiple layers of G rather than just the initial layer. The intuition behind this design is to allow G to use the latent space to directly influence features at different resolutions and levels of hierarchy. […] Skip-z provides a modest performance improvement of around 4%, and improves training speed by a further 18%.

— Large Scale GAN Training for High Fidelity Natural Image Synthesis, 2018.

The truncation trick involves using a different distribution for the generator’s latent space during training than during inference or image synthesis.

A Gaussian distribution is used during training, and a truncated Gaussian is used during inference. This is referred to as the “*truncation trick*.”

We call this the Truncation Trick: truncating a z vector by resampling the values with magnitude above a chosen threshold leads to improvement in individual sample quality at the cost of reduction in overall sample variety.

— Large Scale GAN Training for High Fidelity Natural Image Synthesis, 2018.

The truncation trick provides a trade-off between image quality or fidelity and image variety. A more narrow sampling range results in better quality, whereas a larger sampling range results in more variety in sampled images.

This technique allows fine-grained, post-hoc selection of the trade-off between sample quality and variety for a given G.

— Large Scale GAN Training for High Fidelity Natural Image Synthesis, 2018.

Not all models respond well to the truncation trick.

Some of the deeper models would provide saturated artifacts when the truncation trick was used.

To better encourage a broader range of models to work well with the truncation trick, orthogonal regularization was used.

This was introduced by Andrew Brock, et al. in their 2016 paper titled “Neural Photo Editing with Introspective Adversarial Networks.”

This is related to the orthogonal weight initialization and introduces a weight regularization term to encourage the weights to maintain their orthogonal property.

Orthogonality is a desirable quality in ConvNet filters, partially because multiplication by an orthogonal matrix leaves the norm of the original matrix unchanged. […] we propose a simple weight regularization technique, Orthogonal Regularization, that encourages weights to be orthogonal by pushing them towards the nearest orthogonal manifold.

— Neural Photo Editing with Introspective Adversarial Networks, 2016.

The BigGAN is capable of generating large, high-quality images.

In this section, we will review a few examples presented in the paper.

Below are some examples of high-quality images generated by BigGAN.

Below are examples of large and high-quality images generated by BigGAN.

One of the issues described when training BigGAN generators is the idea of “class leakage”, a new type of failure mode.

Below is an example of class leakage from a partially trained BigGAN, showing a cross between a tennis ball and perhaps a dog.

Below are some additional images generated by the BigGAN at 256×256 resolution.

Below are some more images generated by the BigGAN at 512×512 resolution.

This section provides more resources on the topic if you are looking to go deeper.

- Large Scale GAN Training for High Fidelity Natural Image Synthesis, 2018.
- Large Scale GAN Training for High Fidelity Natural Image Synthesis, ICLR 2019.
- Self-Attention Generative Adversarial Networks, 2018.
- A Learned Representation For Artistic Style, 2016.
- Spectral Normalization for Generative Adversarial Networks, 2018.
- Progressive Growing of GANs for Improved Quality, Stability, and Variation, 2017.
- Exact Solutions To The Nonlinear Dynamics Of Learning In Deep Linear Neural Networks, 2013.
- Neural Photo Editing with Introspective Adversarial Networks, 2016.

In this post, you discovered the BigGAN model for scaling up class-conditional image synthesis.

Specifically, you learned:

- Image size and training brittleness remain large problems for GANs.
- Scaling up model size and batch size can result in dramatically larger and higher-quality images.
- Specific model architectural and training configurations required to scale up GANs.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to BigGAN the Big Generative Adversarial Network appeared first on MachineLearningMastery.com.

]]>The post 9 Books on Generative Adversarial Networks (GANs) appeared first on MachineLearningMastery.com.

]]>Since then, GANs have seen a lot of attention given that they are perhaps one of the most effective techniques for generating large, high-quality synthetic images.

As such, a number of books have been written about GANs, mostly focusing on how to develop and use the models in practice.

In this post, you will discover books written on Generative Adversarial Networks.

**Kick-start your project** with my new book Generative Adversarial Networks with Python, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

Most of the books have been written and released under the Packt publishing company.

Almost all of the books suffer the same problems: that is, they are generally low quality and summarize the usage of third-party code on GitHub with little original content. This particularly applies to the books from Packt.

Nevertheless, it is useful to have an idea of what books are available and the topics covered. This can be helpful both in choosing a book for self-study and to get an idea of the types of topics you may want to explore when getting started with GANs.

We will review the following seven books:

- GANs in Action.
- Generative Deep Learning.
- Advanced Deep Learning with Keras.
- Learning Generative Adversarial Networks.
- Generative Adversarial Networks Projects.
- Generative Adversarial Networks Cookbook.
- Hands-On Generative Adversarial Networks with Keras.

Additionally, we will also review the GAN section of two popular deep learning books.

If I have missed a book on GANs, please let me know in the comments below.

The books mostly seem to cover the same GAN architectures, such as:

**Standard**: GAN, DCGAN.**Conditional**: cGAN, SS-GAN, InfoGAN, ACGAN.**Loss**: WGAN, WGAN-GP, LSGAN.**Image Translation**: Pix2Pix, CycleGAN.**Advanced GANs**: BigGAN, PG-GAN, StyleGAN.**Other**: StackGAN, 3DGAN, BEGAN, SRGAN, DiscoGAN, SEGAN.

Let’s take a closer look at the topics covered by each book.

**Title**: GANs in Action: Deep learning with Generative Adversarial Networks

Written by Jakub Langr and Vladimir Bok, published in 2019.

This book provides a gentle introduction to GANs using the Keras deep learning library.

- Chapter 1: Introduction to GANs
- Chapter 2: Autoencoders as a Path to GANs
- Chapter 3: Your First GAN: Generating Handwritten Digits
- Chapter 4: Deep Convolutional GAN (DCGAN)
- Chapter 5: Training and Common Challenges: GANing for Success
- Chapter 6: Progressing with GANs
- Chapter 7: Semi-Supervised GAN
- Chapter 8: Conditional GAN
- Chapter 9: CycleGaN
- Chapter 10: Adversarial Examples
- Chapter 11: Practical Applications of GANs
- Chapter 12: Looking Ahead

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

**Title**: Generative Deep Learning: Teaching Machines to Paint, Write, Compose, and Play

Written by David Foster, published in 2019.

This book focuses on the more general problem of generative modeling with deep learning, allowing variational autoencoders to be discussed. It does cover a range of GAN models, but also language modeling with LSTMs.

- Part 1: Introduction to Generative Deep Learning
- Chapter 1. Generative Modeling
- Chapter 2. Deep Learning
- Chapter 3. Variational Autoencoders
- Chapter 4. Generative Adversarial Networks

- Part 2: Teaching Machines to Paint, Write, Compose and Play
- Chapter 5. Paint
- Chapter 6. Write
- Chapter 7. Compose
- Chapter 8. Play
- Chapter 9. The Future of Generative Modeling

**Title**: Advanced Deep Learning with Keras: Apply deep learning techniques, autoencoders, GANs, variational autoencoders, deep reinforcement learning, policy gradients, and more

Written by Rowel Atienza, published in 2018.

This book is on the more general topic of advanced deep learning with Keras, allowing the coverage of autoencoders, variational autoencoders, and deep reinforcement learning. Nevertheless, the book has four chapters on GANs and I consider it a GAN book.

- Advanced Deep Learning with Keras, Amazon.
- Advanced Deep Learning with Keras, Packt.
- Book Source Code, GitHub.

- Chapter 1: Introducing Advanced Deep Learning with Kera
- Chapter 2: Deep Neural Networks
- Chapter 3: Autoencoders
- Chapter 4: Generative Adversarial Networks (GANs)
- Chapter 5: Improved GANs
- Chapter 6: Disentangled Representation GANs
- Chapter 7: Cross-Domain GANs
- Chapter 8: Variational Autoencoders (VAEs)
- Chapter 9: Deep Reinforcement Learning
- Chapter 10: Policy Gradient Methods

**Title**: Learning Generative Adversarial Networks: Next-generation deep learning simplified.

Written by Kuntal Ganguly, published in 2017.

This book provides a very simple introduction to GANs. The book may have been removed or unpublished by Packt and replaced with a video course.

- Chapter 1: Introduction to Deep Learning
- Chapter 2: Unsupervised Learning with GAN
- Chapter 3: Transfer Image Style Across Various Domains
- Chapter 4: Building Realistic Images from Your Text
- Chapter 5: Using Various Generative Models to Generate Images
- Chapter 6: Taking Machine Learning to Production

**Title**: Generative Adversarial Networks Projects: Build next-generation generative models using TensorFlow and Keras.

Written by Kailash Ahirwar, published in 2019.

This book summarizes a range of GANs with code examples in Keras.

- Generative Adversarial Networks Projects, Amazon.
- Generative Adversarial Networks Projects, Packt.
- Book Source Code

- Chapter 1: Introduction to Generative Adversarial Networks
- Chapter 2: 3D-GAN – Generating Shapes Using GANs
- Chapter 3: Face Aging Using Conditional GAN
- Chapter 4: Generating Anime Characters Using DCGANs
- Chapter 5: Using SRGANs to Generate Photo-Realistic Images
- Chapter 6: StackGAN – Text to Photo-Realistic Image Synthesis
- Chapter 7: CycleGAN – Turn Painting into Photos
- Chapter 8: Conditional GAN – Image-to-Image Translation Using Conditional Adversarial Networks
- Chapter 9: Predicting the Future of GANs

**Title**: Generative Adversarial Networks Cookbook: Over 100 recipes to build generative models using Python, TensorFlow, and Keras

Written by Josh Kalin, published in 2018.

- Generative Adversarial Networks Cookbook, Amazon.
- Generative Adversarial Networks Cookbook, Packt.
- Book Source Code.

- Chapter 1: What is a Generative Adversarial Network
- Chapter 2: Data First, Easy Environment, and Data Prep
- Chapter 3: My First GAN in Under 100 Lines
- Chapter 4: Dreaming of New Outdoor Structures Using DCGAN
- Chapter 5: Pix2Pix Image-to-Image Translation
- Chapter 6: Style Transferring Your Image Using CycleGAN
- Chapter 7: Using Simulated Images to Create Photo-Realistic Eyeballs with SimGAN
- Chapter 8: From Image to 3D Models Using GANs

**Title**: Hands-On Generative Adversarial Networks with Keras: Your guide to implementing next-generation generative adversarial networks

Written by Rafael Valle, published in 2019.

This may be one of the better Packt published books as the code appears to be better quality and a wider array of GANs are covered.

- Hands-On Generative Adversarial Networks with Keras, Amazon.
- Hands-On Generative Adversarial Networks with Keras, Packt.
- Book Source Code.

- Section 1: Introduction and Environmental Setup
- Chapter 1: Deep Learning Basics and Environment Setup
- Chapter 2: Introduction to Generative Models

- Section 2: Training GANs
- Chapter 3: Training GANs
- Chapter 4: Evaluating Your First GAN
- Chapter 5: Improving Your First GAN

- Section 3: Applications of GANS in Computer Vision, Natural Language Processing and Audio
- Chapter 6: Synthesizing and Manipulating Images with GANs
- Chapter 7: Progressive Growing of GANs
- Chapter 8: Generation of Discrete Sequences Using GANs
- Chapter 9: Text-to-Image Synthesis with GANs
- Chapter 10: Speech Enhancement with GANs
- Chapter 11: TequilaGAN – Identifying GAN Samples
- Chapter 12: What’s next in GANs

The topic of GANs has been covered in other modern books on deep learning.

Two important examples are listed below.

GANs were described in the 2016 textbook titled “Deep Learning” by Ian Goodfellow, et al., specifically:

- Chapter 20: Deep Generative Models.

Section 20.10.4 titled “*Generative Adversarial Networks*” provides a short introduction to GANs at the time of writing, two years after the original paper.

It would be great to see Goodfellow write a dedicated textbook on the topic sometime in the future.

GANs were also covered by Francois Chollet in his 2017 book titled “Deep Learning with Python“, specifically:

- Chapter 8: Generative Deep Learning.

In Section 8.5 titled “*Introduction to generative adversarial networks*,” the topic of GANs is introduced and a worked example of developing a GAN for one image class (frogs) in the CIFAR-10 dataset is covered. Source code is provided here:

In this post, you discovered a suite of books on the topic of Generative Adversarial Networks, or GANs.

Have you read any of the listed books?

Let me know what you think of it in the comments below.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post 9 Books on Generative Adversarial Networks (GANs) appeared first on MachineLearningMastery.com.

]]>The post A Gentle Introduction to StyleGAN the Style Generative Adversarial Network appeared first on MachineLearningMastery.com.

]]>Most improvement has been made to discriminator models in an effort to train more effective generator models, although less effort has been put into improving the generator models.

The Style Generative Adversarial Network, or StyleGAN for short, is an extension to the GAN architecture that proposes large changes to the generator model, including the use of a mapping network to map points in latent space to an intermediate latent space, the use of the intermediate latent space to control style at each point in the generator model, and the introduction to noise as a source of variation at each point in the generator model.

The resulting model is capable not only of generating impressively photorealistic high-quality photos of faces, but also offers control over the style of the generated image at different levels of detail through varying the style vectors and noise.

In this post, you will discover the Style Generative Adversarial Network that gives control over the style of generated synthetic images.

After reading this post, you will know:

- The lack of control over the style of synthetic images generated by traditional GAN models.
- The architecture of StyleGAN model that introduces control over the style of generated images at different levels of detail.
- Impressive results achieved with the StyleGAN architecture when used to generate synthetic human faces.

**Kick-start your project** with my new book Generative Adversarial Networks with Python, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

This tutorial is divided into four parts; they are:

- Lacking Control Over Synthesized Images
- Control Style Using New Generator Model
- What Is the StyleGAN Model Architecture
- Examples of StyleGAN Generated Images

Generative adversarial networks are effective at generating high-quality and large-resolution synthetic images.

The generator model takes as input a point from latent space and generates an image. This model is trained by a second model, called the discriminator, that learns to differentiate real images from the training dataset from fake images generated by the generator model. As such, the two models compete in an adversarial game and find a balance or equilibrium during the training process.

Many improvements to the GAN architecture have been achieved through enhancements to the discriminator model. These changes are motivated by the idea that a better discriminator model will, in turn, lead to the generation of more realistic synthetic images.

As such, the generator has been somewhat neglected and remains a black box. For example, the source of randomness used in the generation of synthetic images is not well understood, including both the amount of randomness in the sampled points and the structure of the latent space.

Yet the generators continue to operate as black boxes, and despite recent efforts, the understanding of various aspects of the image synthesis process, […] is still lacking. The properties of the latent space are also poorly understood …

— A Style-Based Generator Architecture for Generative Adversarial Networks, 2018.

This limited understanding of the generator is perhaps most exemplified by the general lack of control over the generated images. There are few tools to control the properties of generated images, e.g. the style. This includes high-level features such as background and foreground, and fine-grained details such as the features of synthesized objects or subjects.

This requires both disentangling features or properties in images and adding controls for these properties to the generator model.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

The Style Generative Adversarial Network, or StyleGAN for short, is an extension to the GAN architecture to give control over the disentangled style properties of generated images.

Our generator starts from a learned constant input and adjusts the “style” of the image at each convolution layer based on the latent code, therefore directly controlling the strength of image features at different scales

— A Style-Based Generator Architecture for Generative Adversarial Networks, 2018.

The StyleGAN is an extension of the progressive growing GAN that is an approach for training generator models capable of synthesizing very large high-quality images via the incremental expansion of both discriminator and generator models from small to large images during the training process.

In addition to the incremental growing of the models during training, the style GAN changes the architecture of the generator significantly.

The StyleGAN generator no longer takes a point from the latent space as input; instead, there are two new sources of randomness used to generate a synthetic image: a standalone mapping network and noise layers.

The output from the mapping network is a vector that defines the styles that is integrated at each point in the generator model via a new layer called adaptive instance normalization. The use of this style vector gives control over the style of the generated image.

Stochastic variation is introduced through noise added at each point in the generator model. The noise is added to entire feature maps that allow the model to interpret the style in a fine-grained, per-pixel manner.

This per-block incorporation of style vector and noise allows each block to localize both the interpretation of style and the stochastic variation to a given level of detail.

The new architecture leads to an automatically learned, unsupervised separation of high-level attributes (e.g., pose and identity when trained on human faces) and stochastic variation in the generated images (e.g., freckles, hair), and it enables intuitive, scale-specific control of the synthesis

— A Style-Based Generator Architecture for Generative Adversarial Networks, 2018.

The StyleGAN is described as a progressive growing GAN architecture with five modifications, each of which was added and evaluated incrementally in an ablative study.

The incremental list of changes to the generator are:

- Baseline Progressive GAN.
- Addition of tuning and bilinear upsampling.
- Addition of mapping network and AdaIN (styles).
- Removal of latent vector input to generator.
- Addition of noise to each block.
- Addition Mixing regularization.

The image below summarizes the StyleGAN generator architecture.

We can review each of these changes in more detail.

The StyleGAN generator and discriminator models are trained using the progressive growing GAN training method.

This means that both models start with small images, in this case, 4×4 images. The models are fit until stable, then both discriminator and generator are expanded to double the width and height (quadruple the area), e.g. 8×8.

A new block is added to each model to support the larger image size, which is faded in slowly over training. Once faded-in, the models are again trained until reasonably stable and the process is repeated with ever-larger image sizes until the desired target image size is met, such as 1024×1024.

For more on the progressive growing GAN, see the paper:

The progressive growing GAN uses nearest neighbor layers for upsampling instead of transpose convolutional layers that are common in other generator models.

The first point of deviation in the StyleGAN is that bilinear upsampling layers are unused instead of nearest neighbor.

We replace the nearest-neighbor up/downsampling in both networks with bilinear sampling, which we implement by lowpass filtering the activations with a separable 2nd order binomial filter after each upsampling layer and before each downsampling layer.

— A Style-Based Generator Architecture for Generative Adversarial Networks, 2018.

Next, a standalone mapping network is used that takes a randomly sampled point from the latent space as input and generates a style vector.

The mapping network is comprised of eight fully connected layers, e.g. it is a standard deep neural network.

For simplicity, we set the dimensionality of both [the latent and intermediate latent] spaces to 512, and the mapping f is implemented using an 8-layer MLP …

— A Style-Based Generator Architecture for Generative Adversarial Networks, 2018.

The style vector is then transformed and incorporated into each block of the generator model after the convolutional layers via an operation called adaptive instance normalization or AdaIN.

The AdaIN layers involve first standardizing the output of feature map to a standard Gaussian, then adding the style vector as a bias term.

Learned affine transformations then specialize [the intermediate latent vector] to styles y = (ys, yb) that control adaptive instance normalization (AdaIN) operations after each convolution layer of the synthesis network g.

— A Style-Based Generator Architecture for Generative Adversarial Networks, 2018.

The addition of the new mapping network to the architecture also results in the renaming of the generator model to a “*synthesis network*.”

The next change involves modifying the generator model so that it no longer takes a point from the latent space as input.

Instead, the model has a constant 4x4x512 constant value input in order to start the image synthesis process.

The output of each convolutional layer in the synthesis network is a block of activation maps.

Gaussian noise is added to each of these activation maps prior to the AdaIN operations. A different sample of noise is generated for each block and is interpreted using per-layer scaling factors.

These are single-channel images consisting of uncorrelated Gaussian noise, and we feed a dedicated noise image to each layer of the synthesis network. The noise image is broadcasted to all feature maps using learned per-feature scaling factors and then added to the output of the corresponding convolution …

— A Style-Based Generator Architecture for Generative Adversarial Networks, 2018.

This noise is used to introduce style-level variation at a given level of detail.

Mixing regularization involves first generating two style vectors from the mapping network.

A split point in the synthesis network is chosen and all AdaIN operations prior to the split point use the first style vector and all AdaIN operations after the split point get the second style vector.

… we employ mixing regularization, where a given percentage of images are generated using two random latent codes instead of one during training.

— A Style-Based Generator Architecture for Generative Adversarial Networks, 2018.

This encourages the layers and blocks to localize the style to specific parts of the model and corresponding level of detail in the generated image.

The StyleGAN is both effective at generating large high-quality images and at controlling the style of the generated images.

In this section, we will review some examples of generated images.

A video demonstrating the capability of the model was released by the authors of the paper, providing a useful overview.

The image below taken from the paper shows synthetic faces generated with the StyleGAN with the sizes 4×4, 8×8, 16×16, and 32×32.

The use of different style vectors at different points of the synthesis network gives control over the styles of the resulting image at different levels of detail.

For example, blocks of layers in the synthesis network at lower resolutions (e.g. 4×4 and 8×8) control high-level styles such as pose and hairstyle. Blocks of layers in the model of the network (e.g. as 16×16 and 32×32) control hairstyles and facial expression. Finally, blocks of layers closer to the output end of the network (e.g. 64×64 to 1024×1024) control color schemes and very fine details.

The image below taken from the paper shows generated images on the left and across the top. The two rows of intermediate images are examples of the style vectors used to generate the images on the left, where the style vectors used for the images on the top are used only in the lower levels. This allows the images on the left to adopt high-level styles such as pose and hairstyle from the images on the top in each column.

Copying the styles corresponding to coarse spatial resolutions (4^2 – 8^2) brings high-level aspects such as pose, general hair style, face shape, and eyeglasses from source B, while all colors (eyes, hair, lighting) and finer facial features resemble A.

— A Style-Based Generator Architecture for Generative Adversarial Networks, 2018.

The authors varied the use of noise at different levels of detail in the model (e.g. fine, middle, coarse), much like the previous example of varying style.

The result is that noise gives control over the generation of detail, from broader structure when noise is used in the coarse blocks of layers to the generation of fine detail when noise is added to the layers closer to the output of the network.

We can see that the artificial omission of noise leads to featureless “painterly” look. Coarse noise causes large-scale curling of hair and appearance of larger background features, while the fine noise brings out the finer curls of hair, finer background detail, and skin pores.

— A Style-Based Generator Architecture for Generative Adversarial Networks, 2018.

This section provides more resources on the topic if you are looking to go deeper.

- A Style-Based Generator Architecture for Generative Adversarial Networks, 2018.
- Progressive Growing of GANs for Improved Quality, Stability, and Variation, 2017.
- StyleGAN – Official TensorFlow Implementation, GitHub.
- StyleGAN Results Video, YouTube.

In this post, you discovered the Style Generative Adversarial Network that gives control over the style of generated synthetic images.

Specifically, you learned:

- The lack of control over the style of synthetic images generated by traditional GAN models.
- The architecture of StyleGAN model GAN model that introduces control over the style of generated images at different levels of detail
- Impressive results achieved with the StyleGAN architecture when used to generate synthetic human faces.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to StyleGAN the Style Generative Adversarial Network appeared first on MachineLearningMastery.com.

]]>The post How to Train a Progressive Growing GAN in Keras for Synthesizing Faces appeared first on MachineLearningMastery.com.

]]>A limitation of GANs is that the are only capable of generating relatively small images, such as 64×64 pixels.

The Progressive Growing GAN is an extension to the GAN training procedure that involves training a GAN to generate very small images, such as 4×4, and incrementally increasing the size of the generated images to 8×8, 16×16, until the desired output size is met. This has allowed the progressive GAN to generate photorealistic synthetic faces with 1024×1024 pixel resolution.

The key innovation of the progressive growing GAN is the two-phase training procedure that involves the fading-in of new blocks to support higher-resolution images followed by fine-tuning.

In this tutorial, you will discover how to implement and train a progressive growing generative adversarial network for generating celebrity faces.

After completing this tutorial, you will know:

- How to prepare the celebrity faces dataset for training a progressive growing GAN model.
- How to define and train the progressive growing GAN on the celebrity faces dataset.
- How to load saved generator models and use them for generating ad hoc synthetic celebrity faces.

**Kick-start your project** with my new book Generative Adversarial Networks with Python, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

**Updated Sep/2019**: Fixed small bug when summarizing performance during training.

This tutorial is divided into five parts; they are:

- What Is the Progressive Growing GAN
- How to Prepare the Celebrity Faces Dataset
- How to Develop Progressive Growing GAN Models
- How to Train Progressive Growing GAN Models
- How to Synthesize Images With a Progressive Growing GAN Model

GANs are effective at generating crisp synthetic images, although are typically limited in the size of the images that can be generated.

The Progressive Growing GAN is an extension to the GAN that allows the training generator models to be capable of generating large high-quality images, such as photorealistic faces with the size 1024×1024 pixels. It was described in the 2017 paper by Tero Karras, et al. from Nvidia titled “Progressive Growing of GANs for Improved Quality, Stability, and Variation.”

The key innovation of the Progressive Growing GAN is the incremental increase in the size of images output by the generator, starting with a 4×4 pixel image and doubling to 8×8, 16×16, and so on until the desired output resolution.

This is achieved by a training procedure that involves periods of fine-tuning the model with a given output resolution, and periods of slowly phasing in a new model with a larger resolution. All layers remain trainable during the training process, including existing layers when new layers are added.

Progressive Growing GAN involves using a generator and discriminator model with the same general structure and starting with very small images. During training, new blocks of convolutional layers are systematically added to both the generator model and the discriminator models.

The incremental addition of the layers allows the models to effectively learn coarse-level detail and later learn ever-finer detail, both on the generator and discriminator sides.

This incremental nature allows the training to first discover large-scale structure of the image distribution and then shift attention to increasingly finer-scale detail, instead of having to learn all scales simultaneously.

The next step is to select a dataset to use for developing a Progressive Growing GAN.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

In this tutorial, we will use the Large-scale Celebrity Faces Attributes Dataset, referred to as CelebA.

This dataset was developed and published by Ziwei Liu, et al. for their 2015 paper tilted “From Facial Parts Responses to Face Detection: A Deep Learning Approach.”

The dataset provides about 200,000 photographs of celebrity faces along with annotations for what appears in given photos, such as glasses, face shape, hats, hair type, etc. As part of the dataset, the authors provide a version of each photo centered on the face and cropped to the portrait with varying sizes around 150 pixels wide and 200 pixels tall. We will use this as the basis for developing our GAN model.

The dataset can be easily downloaded from the Kaggle webpage. Note: this may require an account with Kaggle.

Specifically, download the file “*img_align_celeba.zip*“, which is about 1.3 gigabytes. To do this, click on the filename on the Kaggle website and then click the download icon.

The download might take a while depending on the speed of your internet connection.

After downloading, unzip the archive.

This will create a new directory named “*img_align_celeba*” that contains all of the images with filenames like *202599.jpg* and *202598.jpg*.

When working with a GAN, it is easier to model a dataset if all of the images are small and square in shape.

Further, as we are only interested in the face in each photo and not the background, we can perform face detection and extract only the face before resizing the result to a fixed size.

There are many ways to perform face detection. In this case, we will use a pre-trained Multi-Task Cascaded Convolutional Neural Network, or MTCNN. This is a state-of-the-art deep learning model for face detection, described in the 2016 paper titled “Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks.”

We will use the implementation provided by Iván de Paz Centeno in the ipazc/mtcnn project. This can also be installed via pip as follows:

sudo pip install mtcnn

We can confirm that the library was installed correctly by importing the library and printing the version; for example:

# confirm mtcnn was installed correctly import mtcnn # print version print(mtcnn.__version__)

Running the example prints the current version of the library.

0.0.8

The MTCNN model is very easy to use.

First, an instance of the MTCNN model is created, then the *detect_faces()* function can be called passing in the pixel data for one image.

The result a list of detected faces, with a bounding box defined in pixel offset values.

... # prepare model model = MTCNN() # detect face in the image faces = model.detect_faces(pixels) # extract details of the face x1, y1, width, height = faces[0]['box']

Although the progressive growing GAN supports the synthesis of large images, such as 1024×1024, this requires enormous resources, such as a single top of the line GPU training the model for a month.

Instead, we will reduce the size of the generated images to 128×128 which will, in turn, allow us to train a reasonable model on a GPU in a few hours and still discover how the progressive growing model can be implemented, trained, and used.

As such, we can develop a function to load a file and extract the face from the photo, then and resize the extracted face pixels to a predefined size. In this case, we will use the square shape of 128×128 pixels.

The *load_image()* function below will load a given photo file name as a NumPy array of pixels.

# load an image as an rgb numpy array def load_image(filename): # load image from file image = Image.open(filename) # convert to RGB, if needed image = image.convert('RGB') # convert to array pixels = asarray(image) return pixels

The *extract_face()* function below takes the MTCNN model and pixel values for a single photograph as arguments and returns a 128x128x3 array of pixel values with just the face, or *None* if no face was detected (which can happen rarely).

# extract the face from a loaded image and resize def extract_face(model, pixels, required_size=(128, 128)): # detect face in the image faces = model.detect_faces(pixels) # skip cases where we could not detect a face if len(faces) == 0: return None # extract details of the face x1, y1, width, height = faces[0]['box'] # force detected pixel values to be positive (bug fix) x1, y1 = abs(x1), abs(y1) # convert into coordinates x2, y2 = x1 + width, y1 + height # retrieve face pixels face_pixels = pixels[y1:y2, x1:x2] # resize pixels to the model size image = Image.fromarray(face_pixels) image = image.resize(required_size) face_array = asarray(image) return face_array

The *load_faces()* function below enumerates all photograph files in a directory and extracts and resizes the face from each and returns a NumPy array of faces.

We limit the total number of faces loaded via the *n_faces* argument, as we don’t need them all.

# load images and extract faces for all images in a directory def load_faces(directory, n_faces): # prepare model model = MTCNN() faces = list() # enumerate files for filename in listdir(directory): # load the image pixels = load_image(directory + filename) # get face face = extract_face(model, pixels) if face is None: continue # store faces.append(face) print(len(faces), face.shape) # stop once we have enough if len(faces) >= n_faces: break return asarray(faces)

Tying this together, the complete example of preparing a dataset of celebrity faces for training a GAN model is listed below.

In this case, we increase the total number of loaded faces to 50,000 to provide a good training dataset for our GAN model.

# example of extracting and resizing faces into a new dataset from os import listdir from numpy import asarray from numpy import savez_compressed from PIL import Image from mtcnn.mtcnn import MTCNN from matplotlib import pyplot # load an image as an rgb numpy array def load_image(filename): # load image from file image = Image.open(filename) # convert to RGB, if needed image = image.convert('RGB') # convert to array pixels = asarray(image) return pixels # extract the face from a loaded image and resize def extract_face(model, pixels, required_size=(128, 128)): # detect face in the image faces = model.detect_faces(pixels) # skip cases where we could not detect a face if len(faces) == 0: return None # extract details of the face x1, y1, width, height = faces[0]['box'] # force detected pixel values to be positive (bug fix) x1, y1 = abs(x1), abs(y1) # convert into coordinates x2, y2 = x1 + width, y1 + height # retrieve face pixels face_pixels = pixels[y1:y2, x1:x2] # resize pixels to the model size image = Image.fromarray(face_pixels) image = image.resize(required_size) face_array = asarray(image) return face_array # load images and extract faces for all images in a directory def load_faces(directory, n_faces): # prepare model model = MTCNN() faces = list() # enumerate files for filename in listdir(directory): # load the image pixels = load_image(directory + filename) # get face face = extract_face(model, pixels) if face is None: continue # store faces.append(face) print(len(faces), face.shape) # stop once we have enough if len(faces) >= n_faces: break return asarray(faces) # directory that contains all images directory = 'img_align_celeba/' # load and extract all faces all_faces = load_faces(directory, 50000) print('Loaded: ', all_faces.shape) # save in compressed format savez_compressed('img_align_celeba_128.npz', all_faces)

Running the example may take a few minutes given the larger number of faces to be loaded.

At the end of the run, the array of extracted and resized faces is saved as a compressed NumPy array with the filename ‘*img_align_celeba_128.npz*‘.

The prepared dataset can then be loaded any time, as follows.

# load the prepared dataset from numpy import load # load the face dataset data = load('img_align_celeba_128.npz') faces = data['arr_0'] print('Loaded: ', faces.shape)

Loading the dataset summarizes the shape of the array, showing 50K images with the size of 128×128 pixels and three color channels.

Loaded: (50000, 128, 128, 3)

We can elaborate on this example and plot the first 100 faces in the dataset as a 10×10 grid. The complete example is listed below.

# load the prepared dataset from numpy import load from matplotlib import pyplot # plot a list of loaded faces def plot_faces(faces, n): for i in range(n * n): # define subplot pyplot.subplot(n, n, 1 + i) # turn off axis pyplot.axis('off') # plot raw pixel data pyplot.imshow(faces[i].astype('uint8')) pyplot.show() # load the face dataset data = load('img_align_celeba_128.npz') faces = data['arr_0'] print('Loaded: ', faces.shape) plot_faces(faces, 10)

Running the example loads the dataset and creates a plot of the first 100 images.

We can see that each image only contains the face and all faces have the same square shape. Our goal is to generate new faces with the same general properties.

We are now ready to develop a GAN model to generate faces using this dataset.

There are many ways to implement the progressive growing GAN models.

In this tutorial, we will develop and implement each phase of growth as a separate Keras model and each model will share the same layers and weights.

This approach allows for the convenient training of each model, just like a normal Keras model, although it requires a slightly complicated model construction process to ensure that the layers are reused correctly.

First, we will define some custom layers required in the definition of the generator and discriminator models, then proceed to define functions to create and grow the discriminator and generator models themselves.

There are three custom layers required to implement the progressive growing generative adversarial network.

They are the layers:

**WeightedSum**: Used to control the weighted sum of the old and new layers during a growth phase.**MinibatchStdev**: Used to summarize statistics for a batch of images in the discriminator.**PixelNormalization**: Used to normalize activation maps in the generator model.

Additionally, a weight constraint is used in the paper referred to as “*equalized learning rate*“. This too would need to be implemented as a custom layer. In the interest of brevity, we won’t use equalized learning rate in this tutorial and instead we use a simple max norm weight constraint.

The *WeightedSum* layer is a merge layer that combines the activations from two input layers, such as two input paths in a discriminator or two output paths in a generator model. It uses a variable called *alpha* that controls how much to weight the first and second inputs.

It is used during the growth phase of training when the model is in transition from one image size to a new image size with double the width and height (quadruple the area), such as from 4×4 to 8×8 pixels.

During the growth phase, the alpha parameter is linearly scaled from 0.0 at the beginning to 1.0 at the end, allowing the output of the layer to transition from giving full weight to the old layers to giving full weight to the new layers (second input).

- weighted sum = ((1.0 – alpha) * input1) + (alpha * input2)

The *WeightedSum* class is defined below as an extension to the *Add* merge layer.

# weighted sum output class WeightedSum(Add): # init with default value def __init__(self, alpha=0.0, **kwargs): super(WeightedSum, self).__init__(**kwargs) self.alpha = backend.variable(alpha, name='ws_alpha') # output a weighted sum of inputs def _merge_function(self, inputs): # only supports a weighted sum of two inputs assert (len(inputs) == 2) # ((1-a) * input1) + (a * input2) output = ((1.0 - self.alpha) * inputs[0]) + (self.alpha * inputs[1]) return output

The mini-batch standard deviation layer, or *MinibatchStdev*, is only used in the output block of the discriminator layer.

The objective of the layer is to provide a statistical summary of the batch of activations. The discriminator can then learn to better detect batches of fake samples from batches of real samples. This, in turn, encourages the generator that is trained via the discriminator to create batches of samples with realistic batch statistics.

It is implemented as calculating the standard deviation for each pixel value in the activation maps across the batch, calculating the average of this value, and then creating a new activation map (one channel) that is appended to the list of activation maps provided as input.

The *MinibatchStdev* layer is defined below.

# mini-batch standard deviation layer class MinibatchStdev(Layer): # initialize the layer def __init__(self, **kwargs): super(MinibatchStdev, self).__init__(**kwargs) # perform the operation def call(self, inputs): # calculate the mean value for each pixel across channels mean = backend.mean(inputs, axis=0, keepdims=True) # calculate the squared differences between pixel values and mean squ_diffs = backend.square(inputs - mean) # calculate the average of the squared differences (variance) mean_sq_diff = backend.mean(squ_diffs, axis=0, keepdims=True) # add a small value to avoid a blow-up when we calculate stdev mean_sq_diff += 1e-8 # square root of the variance (stdev) stdev = backend.sqrt(mean_sq_diff) # calculate the mean standard deviation across each pixel coord mean_pix = backend.mean(stdev, keepdims=True) # scale this up to be the size of one input feature map for each sample shape = backend.shape(inputs) output = backend.tile(mean_pix, (shape[0], shape[1], shape[2], 1)) # concatenate with the output combined = backend.concatenate([inputs, output], axis=-1) return combined # define the output shape of the layer def compute_output_shape(self, input_shape): # create a copy of the input shape as a list input_shape = list(input_shape) # add one to the channel dimension (assume channels-last) input_shape[-1] += 1 # convert list to a tuple return tuple(input_shape)

The generator and discriminator models don’t use batch normalization like other GAN models; instead, each pixel in the activation maps is normalized to unit length.

This is a variation of local response normalization and is referred to in the paper as pixelwise feature vector normalization. Also, unlike other GAN models, normalization is only used in the generator model, not the discriminator.

This is a type of activity regularization and could be implemented as an activity constraint, although it is easily implemented as a new layer that scales the activations of the prior layer.

The *PixelNormalization* class below implements this and can be used after each Convolution layer in the generator, but before any activation function.

# pixel-wise feature vector normalization layer class PixelNormalization(Layer): # initialize the layer def __init__(self, **kwargs): super(PixelNormalization, self).__init__(**kwargs) # perform the operation def call(self, inputs): # calculate square pixel values values = inputs**2.0 # calculate the mean pixel values mean_values = backend.mean(values, axis=-1, keepdims=True) # ensure the mean is not zero mean_values += 1.0e-8 # calculate the sqrt of the mean squared value (L2 norm) l2 = backend.sqrt(mean_values) # normalize values by the l2 norm normalized = inputs / l2 return normalized # define the output shape of the layer def compute_output_shape(self, input_shape): return input_shape

We now have all of the custom layers required and can define our models.

The discriminator model is defined as a deep convolutional neural network that expects a 4×4 color image as input and predicts whether it is real or fake.

The first hidden layer is a 1×1 convolutional layer. The output block involves a *MinibatchStdev*, 3×3, and 4×4 convolutional layers, and a fully connected layer that outputs a prediction. Leaky ReLU activation functions are used after all layers and the output layers use a linear activation function.

This model is trained for normal interval then the model undergoes a growth phase to 8×8. This involves adding a block of two 3×3 convolutional layers and an average pooling downsample layer. The input image passes through the new block with a new 1×1 convolutional hidden layer. The input image is also passed through a downsample layer and through the old 1×1 convolutional hidden layer. The output of the old 1×1 convolution layer and the new block are then combined via a *WeightedSum* layer.

After an interval of training transitioning the *WeightedSum’s* alpha parameter from 0.0 (all old) to 1.0 (all new), another training phase is run to tune the new model with the old layer and pathway removed.

This process repeats until the desired image size is met, in our case, 128×128 pixel images.

We can achieve this with two functions: the *define_discriminator()* function that defines the base model that accepts 4×4 images and the *add_discriminator_block()* function that takes a model and creates a growth version of the model with two pathways and the *WeightedSum* and a second version of the model with the same layers/weights but without the old 1×1 layer and *WeightedSum* layers. The *define_discriminator()* function can then call the *add_discriminator_block()* function as many times as is needed to create the models up to the desired level of growth.

All layers are initialized with small Gaussian random numbers with a standard deviation of 0.02, which is common for GAN models. A maxnorm weight constraint is used with a value of 1.0, instead of the more elaborate ‘*equalized learning rate*‘ weight constraint used in the paper.

The paper defines a number of filters that increases with the depth of the model from 16 to 32, 64, all the way up to 512. This requires projection of the number of feature maps during the growth phase so that the weighted sum can be calculated correctly. To avoid this complication, we fix the number of filters to be the same in all layers.

Each model is compiled and will be fit. In this case, we will use Wasserstein loss (or WGAN loss) and the Adam version of stochastic gradient descent configured as is specified in the paper. The authors of the paper recommend exploring using both WGAN-GP loss and least squares loss and found that the former performed slightly better. Nevertheless, we will use Wasserstein loss as it greatly simplifies the implementation.

First, we must define the loss function as the average predicted value multiplied by the target value. The target value will be 1 for real images and -1 for fake images. This means that weight updates will seek to increase the divide between real and fake images.

# calculate wasserstein loss def wasserstein_loss(y_true, y_pred): return backend.mean(y_true * y_pred)

The functions for defining and creating the growth versions of the discriminator models are listed below.

We make careful use of the functional API and knowledge of the model structure to create the two models for each growth phase. The growth phase also always doubles the expected input shape.

# add a discriminator block def add_discriminator_block(old_model, n_input_layers=3): # weight initialization init = RandomNormal(stddev=0.02) # weight constraint const = max_norm(1.0) # get shape of existing model in_shape = list(old_model.input.shape) # define new input shape as double the size input_shape = (in_shape[-2].value*2, in_shape[-2].value*2, in_shape[-1].value) in_image = Input(shape=input_shape) # define new input processing layer d = Conv2D(128, (1,1), padding='same', kernel_initializer=init, kernel_constraint=const)(in_image) d = LeakyReLU(alpha=0.2)(d) # define new block d = Conv2D(128, (3,3), padding='same', kernel_initializer=init, kernel_constraint=const)(d) d = LeakyReLU(alpha=0.2)(d) d = Conv2D(128, (3,3), padding='same', kernel_initializer=init, kernel_constraint=const)(d) d = LeakyReLU(alpha=0.2)(d) d = AveragePooling2D()(d) block_new = d # skip the input, 1x1 and activation for the old model for i in range(n_input_layers, len(old_model.layers)): d = old_model.layers[i](d) # define straight-through model model1 = Model(in_image, d) # compile model model1.compile(loss=wasserstein_loss, optimizer=Adam(lr=0.001, beta_1=0, beta_2=0.99, epsilon=10e-8)) # downsample the new larger image downsample = AveragePooling2D()(in_image) # connect old input processing to downsampled new input block_old = old_model.layers[1](downsample) block_old = old_model.layers[2](block_old) # fade in output of old model input layer with new input d = WeightedSum()([block_old, block_new]) # skip the input, 1x1 and activation for the old model for i in range(n_input_layers, len(old_model.layers)): d = old_model.layers[i](d) # define straight-through model model2 = Model(in_image, d) # compile model model2.compile(loss=wasserstein_loss, optimizer=Adam(lr=0.001, beta_1=0, beta_2=0.99, epsilon=10e-8)) return [model1, model2] # define the discriminator models for each image resolution def define_discriminator(n_blocks, input_shape=(4,4,3)): # weight initialization init = RandomNormal(stddev=0.02) # weight constraint const = max_norm(1.0) model_list = list() # base model input in_image = Input(shape=input_shape) # conv 1x1 d = Conv2D(128, (1,1), padding='same', kernel_initializer=init, kernel_constraint=const)(in_image) d = LeakyReLU(alpha=0.2)(d) # conv 3x3 (output block) d = MinibatchStdev()(d) d = Conv2D(128, (3,3), padding='same', kernel_initializer=init, kernel_constraint=const)(d) d = LeakyReLU(alpha=0.2)(d) # conv 4x4 d = Conv2D(128, (4,4), padding='same', kernel_initializer=init, kernel_constraint=const)(d) d = LeakyReLU(alpha=0.2)(d) # dense output layer d = Flatten()(d) out_class = Dense(1)(d) # define model model = Model(in_image, out_class) # compile model model.compile(loss=wasserstein_loss, optimizer=Adam(lr=0.001, beta_1=0, beta_2=0.99, epsilon=10e-8)) # store model model_list.append([model, model]) # create submodels for i in range(1, n_blocks): # get prior model without the fade-on old_model = model_list[i - 1][0] # create new model for next resolution models = add_discriminator_block(old_model) # store model model_list.append(models) return model_list

The *define_discriminator()* function is called by specifying the number of blocks to create.

We will create 6 blocks, which will create 6 pairs of models that expect the input image sizes of 4×4, 8×8, 16×16, 32×32, 64×64, 128×128.

The function returns a list where each element in the list contains two models. The first model is the ‘*normal model*‘ or straight through model, and the second is the version of the model that includes the old 1×1 and new block with the weighted sum, used for the transition or growth phase of training.

The generator model takes a random point from the latent space as input and generates a synthetic image.

The generator models are defined in the same way as the discriminator models.

Specifically, a base model for generating 4×4 images is defined and growth versions of the model are created for the large image output size.

The main difference is that during the growth phase, the output of the model is the output of the *WeightedSum* layer. The growth phase version of the model involves first adding a nearest neighbor upsampling layer; this is then connected to the new block with the new output layer and to the old old output layer. The old and new output layers are then combined via a *WeightedSum* output layer.

The base model has an input block defined with a fully connected layer with a sufficient number of activations to create a given number of 4×4 feature maps. This is followed by 4×4 and 3×3 convolution layers and a 1×1 output layer that generates color images. New blocks are added with an upsample layer and two 3×3 convolutional layers.

The *LeakyReLU* activation function is used and the *PixelNormalization* layer is used after each convolutional layer. A linear activation function is used in the output layer, instead of the more common tanh function, yet real images are still scaled to the range [-1,1], which is common for most GAN models.

The paper defines the number of feature maps decreasing with the depth of the model from 512 to 16. As with the discriminator, the difference in the number of feature maps across blocks introduces a challenge for the *WeightedSum*, so for simplicity, we fix all layers to have the same number of filters.

Also like the discriminator model, weights are initialized with Gaussian random numbers with a standard deviation of 0.02 and the maxnorm weight constraint is used with a value of 1.0, instead of the equalized learning rate weight constraint used in the paper.

The functions for defining and growing the generator models are defined below.

# add a generator block def add_generator_block(old_model): # weight initialization init = RandomNormal(stddev=0.02) # weight constraint const = max_norm(1.0) # get the end of the last block block_end = old_model.layers[-2].output # upsample, and define new block upsampling = UpSampling2D()(block_end) g = Conv2D(128, (3,3), padding='same', kernel_initializer=init, kernel_constraint=const)(upsampling) g = PixelNormalization()(g) g = LeakyReLU(alpha=0.2)(g) g = Conv2D(128, (3,3), padding='same', kernel_initializer=init, kernel_constraint=const)(g) g = PixelNormalization()(g) g = LeakyReLU(alpha=0.2)(g) # add new output layer out_image = Conv2D(3, (1,1), padding='same', kernel_initializer=init, kernel_constraint=const)(g) # define model model1 = Model(old_model.input, out_image) # get the output layer from old model out_old = old_model.layers[-1] # connect the upsampling to the old output layer out_image2 = out_old(upsampling) # define new output image as the weighted sum of the old and new models merged = WeightedSum()([out_image2, out_image]) # define model model2 = Model(old_model.input, merged) return [model1, model2] # define generator models def define_generator(latent_dim, n_blocks, in_dim=4): # weight initialization init = RandomNormal(stddev=0.02) # weight constraint const = max_norm(1.0) model_list = list() # base model latent input in_latent = Input(shape=(latent_dim,)) # linear scale up to activation maps g = Dense(128 * in_dim * in_dim, kernel_initializer=init, kernel_constraint=const)(in_latent) g = Reshape((in_dim, in_dim, 128))(g) # conv 4x4, input block g = Conv2D(128, (3,3), padding='same', kernel_initializer=init, kernel_constraint=const)(g) g = PixelNormalization()(g) g = LeakyReLU(alpha=0.2)(g) # conv 3x3 g = Conv2D(128, (3,3), padding='same', kernel_initializer=init, kernel_constraint=const)(g) g = PixelNormalization()(g) g = LeakyReLU(alpha=0.2)(g) # conv 1x1, output block out_image = Conv2D(3, (1,1), padding='same', kernel_initializer=init, kernel_constraint=const)(g) # define model model = Model(in_latent, out_image) # store model model_list.append([model, model]) # create submodels for i in range(1, n_blocks): # get prior model without the fade-on old_model = model_list[i - 1][0] # create new model for next resolution models = add_generator_block(old_model) # store model model_list.append(models) return model_list

Calling the *define_generator()* function requires that the size of the latent space be defined.

Like the discriminator, we will set the *n_blocks* argument to 6 to create six pairs of models.

The function returns a list of models where each item in the list contains the normal or straight-through version of each generator and the growth version for phasing in the new block at the larger output image size.

The generator models are not compiled as they are not trained directly.

Instead, the generator models are trained via the discriminator models using Wasserstein loss.

This involves presenting generated images to the discriminator as real images and calculating the loss that is then used to update the generator models.

A given generator model must be paired with a given discriminator model both in terms of the same image size (e.g. 4×4 or 8×8) and in terms of the same phase of training, such as growth phase (introducing the new block) or fine-tuning phase (normal or straight-through).

We can achieve this by creating a new model for each pair of models that stacks the generator on top of the discriminator so that the synthetic image feeds directly into the discriminator model to be deemed real or fake. This composite model can then be used to train the generator via the discriminator and the weights of the discriminator can be marked as not trainable (only in this model) to ensure they are not changed during this misleading process.

As such, we can create pairs of composite models, e.g. six pairs for the six levels of image growth, where each pair is comprised of a composite model for the normal or straight-through model, and the growth version of the model.

The *define_composite()* function implements this and is defined below.

# define composite models for training generators via discriminators def define_composite(discriminators, generators): model_list = list() # create composite models for i in range(len(discriminators)): g_models, d_models = generators[i], discriminators[i] # straight-through model d_models[0].trainable = False model1 = Sequential() model1.add(g_models[0]) model1.add(d_models[0]) model1.compile(loss=wasserstein_loss, optimizer=Adam(lr=0.001, beta_1=0, beta_2=0.99, epsilon=10e-8)) # fade-in model d_models[1].trainable = False model2 = Sequential() model2.add(g_models[1]) model2.add(d_models[1]) model2.compile(loss=wasserstein_loss, optimizer=Adam(lr=0.001, beta_1=0, beta_2=0.99, epsilon=10e-8)) # store model_list.append([model1, model2]) return model_list

Now that we have seen how to define the generator and discriminator models, let’s look at how we can fit these models on the celebrity faces dataset.

First, we need to define some convenience functions for working with samples of data.

The *load_real_samples()* function below loads our prepared celebrity faces dataset, then converts the pixels to floating point values and scales them to the range [-1,1], common to most GAN implementations.

# load dataset def load_real_samples(filename): # load dataset data = load(filename) # extract numpy array X = data['arr_0'] # convert from ints to floats X = X.astype('float32') # scale from [0,255] to [-1,1] X = (X - 127.5) / 127.5 return X

Next, we need to be able to retrieve a random sample of images used to update the discriminator.

The *generate_real_samples()* function below implements this, returning a random sample of images from the loaded dataset and their corresponding target value of *class=1* to indicate that the images are real.

# select real samples def generate_real_samples(dataset, n_samples): # choose random instances ix = randint(0, dataset.shape[0], n_samples) # select images X = dataset[ix] # generate class labels y = ones((n_samples, 1)) return X, y

Next, we need a sample of latent points used to create synthetic images with the generator model.

The *generate_latent_points()* function below implements this, returning a batch of latent points with the required dimensionality.

# generate points in latent space as input for the generator def generate_latent_points(latent_dim, n_samples): # generate points in the latent space x_input = randn(latent_dim * n_samples) # reshape into a batch of inputs for the network x_input = x_input.reshape(n_samples, latent_dim) return x_input

The latent points can be used as input to the generator to create a batch of synthetic images.

This is required to update the discriminator model. It is also required to update the generator model via the discriminator model with the composite models defined in the previous section.

The *generate_fake_samples()* function below takes a generator model and generates and returns a batch of synthetic images and the corresponding target for the discriminator of *class=-1* to indicate that the images are fake. The *generate_latent_points()* function is called to create the required batch worth of random latent points.

# use the generator to generate n fake examples, with class labels def generate_fake_samples(generator, latent_dim, n_samples): # generate points in latent space x_input = generate_latent_points(latent_dim, n_samples) # predict outputs X = generator.predict(x_input) # create class labels y = -ones((n_samples, 1)) return X, y

Training the models occurs in two phases: a fade-in phase that involves the transition from a lower-resolution to a higher-resolution image, and the normal phase that involves the fine-tuning of the models at a given higher resolution image.

During the phase-in, the *alpha* value of the *WeightedSum* layers in the discriminator and generator model at a given level requires linear transition from 0.0 to 1.0 based on the training step. The *update_fadein()* function below implements this; given a list of models (such as the generator, discriminator, and composite model), the function locates the *WeightedSum* layer in each and sets the value for the alpha attribute based on the current training step number.

Importantly, this alpha attribute is not a constant but is instead defined as a changeable variable in the *WeightedSum* class and whose value can be changed using the Keras backend *set_value()* function.

This is a clumsy but effective approach to changing the *alpha* values. Perhaps a cleaner implementation would involve a Keras Callback and is left as an exercise for the reader.

# update the alpha value on each instance of WeightedSum def update_fadein(models, step, n_steps): # calculate current alpha (linear from 0 to 1) alpha = step / float(n_steps - 1) # update the alpha for each model for model in models: for layer in model.layers: if isinstance(layer, WeightedSum): backend.set_value(layer.alpha, alpha)

Next, we can define the procedure for training the models for a given training phase.

A training phase takes one generator, discriminator, and composite model and updates them on the dataset for a given number of training epochs. The training phase may be a fade-in transition to a higher resolution, in which case the *update_fadein()* must be called each iteration, or it may be a normal tuning training phase, in which case there are no *WeightedSum* layers present.

The *train_epochs()* function below implements the training of the discriminator and generator models for a single training phase.

A single training iteration involves first selecting a half batch of real images from the dataset and generating a half batch of fake images from the current state of the generator model. These samples are then used to update the discriminator model.

Next, the generator model is updated via the discriminator with the composite model, indicating that the generated images are, in fact, real, and updating generator weights in an effort to better fool the discriminator.

A summary of model performance is printed at the end of each training iteration, summarizing the loss of the discriminator on the real (d1) and fake (d2) images and the loss of the generator (g).

# train a generator and discriminator def train_epochs(g_model, d_model, gan_model, dataset, n_epochs, n_batch, fadein=False): # calculate the number of batches per training epoch bat_per_epo = int(dataset.shape[0] / n_batch) # calculate the number of training iterations n_steps = bat_per_epo * n_epochs # calculate the size of half a batch of samples half_batch = int(n_batch / 2) # manually enumerate epochs for i in range(n_steps): # update alpha for all WeightedSum layers when fading in new blocks if fadein: update_fadein([g_model, d_model, gan_model], i, n_steps) # prepare real and fake samples X_real, y_real = generate_real_samples(dataset, half_batch) X_fake, y_fake = generate_fake_samples(g_model, latent_dim, half_batch) # update discriminator model d_loss1 = d_model.train_on_batch(X_real, y_real) d_loss2 = d_model.train_on_batch(X_fake, y_fake) # update the generator via the discriminator's error z_input = generate_latent_points(latent_dim, n_batch) y_real2 = ones((n_batch, 1)) g_loss = gan_model.train_on_batch(z_input, y_real2) # summarize loss on this batch print('>%d, d1=%.3f, d2=%.3f g=%.3f' % (i+1, d_loss1, d_loss2, g_loss))

Next, we need to call the *train_epochs()* function for each training phase.

This involves first scaling the training dataset to the required pixel dimensions, such as 4×4 or 8×8. The *scale_dataset()* function below implements this, taking the dataset and returning a scaled version.

These scaled versions of the dataset could be pre-computed and loaded instead of re-scaled on each run. This might be a nice extension if you intend to run the example many times.

# scale images to preferred size def scale_dataset(images, new_shape): images_list = list() for image in images: # resize with nearest neighbor interpolation new_image = resize(image, new_shape, 0) # store images_list.append(new_image) return asarray(images_list)

After each training run, we also need to save a plot of generated images and the current state of the generator model.

This is useful so that at the end of the run we can see the progression of the capability and quality of the model, and load and use a generator model at any point during the training process. A generator model could be used to create ad hoc images, or used as the starting point for continued training.

The *summarize_performance()* function below implements this, given a status string such as “*faded*” or “*tuned*“, a generator model, and the size of the latent space. The function will proceed to create a unique name for the state of the system using the “*status*” string such as “*04×04-faded*“, then create a plot of 25 generated images and save the plot and the generator model to file using the defined name.

# generate samples and save as a plot and save the model def summarize_performance(status, g_model, latent_dim, n_samples=25): # devise name gen_shape = g_model.output_shape name = '%03dx%03d-%s' % (gen_shape[1], gen_shape[2], status) # generate images X, _ = generate_fake_samples(g_model, latent_dim, n_samples) # normalize pixel values to the range [0,1] X = (X - X.min()) / (X.max() - X.min()) # plot real images square = int(sqrt(n_samples)) for i in range(n_samples): pyplot.subplot(square, square, 1 + i) pyplot.axis('off') pyplot.imshow(X[i]) # save plot to file filename1 = 'plot_%s.png' % (name) pyplot.savefig(filename1) pyplot.close() # save the generator model filename2 = 'model_%s.h5' % (name) g_model.save(filename2) print('>Saved: %s and %s' % (filename1, filename2))

The *train()* function below pulls this together, taking the lists of defined models as input as well as the list of batch sizes and the number of training epochs for the normal and fade-in phases at each level of growth for the model.

The first generator and discriminator model for 4×4 images are fit by calling *train_epochs()* and saved by calling *summarize_performance()*.

Then the steps of growth are enumerated, involving first scaling the image dataset to the preferred size, training and saving the fade-in model for the new image size, then training and saving the normal or fine-tuned model for the new image size.

# train the generator and discriminator def train(g_models, d_models, gan_models, dataset, latent_dim, e_norm, e_fadein, n_batch): # fit the baseline model g_normal, d_normal, gan_normal = g_models[0][0], d_models[0][0], gan_models[0][0] # scale dataset to appropriate size gen_shape = g_normal.output_shape scaled_data = scale_dataset(dataset, gen_shape[1:]) print('Scaled Data', scaled_data.shape) # train normal or straight-through models train_epochs(g_normal, d_normal, gan_normal, scaled_data, e_norm[0], n_batch[0]) summarize_performance('tuned', g_normal, latent_dim) # process each level of growth for i in range(1, len(g_models)): # retrieve models for this level of growth [g_normal, g_fadein] = g_models[i] [d_normal, d_fadein] = d_models[i] [gan_normal, gan_fadein] = gan_models[i] # scale dataset to appropriate size gen_shape = g_normal.output_shape scaled_data = scale_dataset(dataset, gen_shape[1:]) print('Scaled Data', scaled_data.shape) # train fade-in models for next level of growth train_epochs(g_fadein, d_fadein, gan_fadein, scaled_data, e_fadein[i], n_batch[i], True) summarize_performance('faded', g_fadein, latent_dim) # train normal or straight-through models train_epochs(g_normal, d_normal, gan_normal, scaled_data, e_norm[i], n_batch[i]) summarize_performance('tuned', g_normal, latent_dim)

We can then define the configuration, models, and call *train()* to start the training process.

The paper recommends using a batch size of 16 for images sized between 4×4 and 128×128 before reducing the size. It also recommends training each phase for about 800K images. The paper also recommends a latent space of 512 dimensions.

The models are defined with six levels of growth to meet the 128×128 pixel size of our dataset. We also shrink the latent space accordingly to 100 dimensions.

Instead of keeping the batch size and number of epochs constant, we vary it to speed up the training process, using larger batch sizes for early training phases and smaller batch sizes for later training phases for fine-tuning and stability. Additionally, fewer training epochs are used for the smaller models and more epochs for the larger models.

The choice of batch sizes and training epochs is somewhat arbitrary and you may want to experiment with different values and review their effects.

# number of growth phases, e.g. 6 == [4, 8, 16, 32, 64, 128] n_blocks = 6 # size of the latent space latent_dim = 100 # define models d_models = define_discriminator(n_blocks) # define models g_models = define_generator(latent_dim, n_blocks) # define composite models gan_models = define_composite(d_models, g_models) # load image data dataset = load_real_samples('img_align_celeba_128.npz') print('Loaded', dataset.shape) # train model n_batch = [16, 16, 16, 8, 4, 4] # 10 epochs == 500K images per training phase n_epochs = [5, 8, 8, 10, 10, 10] train(g_models, d_models, gan_models, dataset, latent_dim, n_epochs, n_epochs, n_batch)

We can tie all of this together.

The complete example of training a progressive growing generative adversarial network on the celebrity faces dataset is listed below.

# example of progressive growing gan on celebrity faces dataset from math import sqrt from numpy import load from numpy import asarray from numpy import zeros from numpy import ones from numpy.random import randn from numpy.random import randint from skimage.transform import resize from keras.optimizers import Adam from keras.models import Sequential from keras.models import Model from keras.layers import Input from keras.layers import Dense from keras.layers import Flatten from keras.layers import Reshape from keras.layers import Conv2D from keras.layers import UpSampling2D from keras.layers import AveragePooling2D from keras.layers import LeakyReLU from keras.layers import Layer from keras.layers import Add from keras.constraints import max_norm from keras.initializers import RandomNormal from keras import backend from matplotlib import pyplot # pixel-wise feature vector normalization layer class PixelNormalization(Layer): # initialize the layer def __init__(self, **kwargs): super(PixelNormalization, self).__init__(**kwargs) # perform the operation def call(self, inputs): # calculate square pixel values values = inputs**2.0 # calculate the mean pixel values mean_values = backend.mean(values, axis=-1, keepdims=True) # ensure the mean is not zero mean_values += 1.0e-8 # calculate the sqrt of the mean squared value (L2 norm) l2 = backend.sqrt(mean_values) # normalize values by the l2 norm normalized = inputs / l2 return normalized # define the output shape of the layer def compute_output_shape(self, input_shape): return input_shape # mini-batch standard deviation layer class MinibatchStdev(Layer): # initialize the layer def __init__(self, **kwargs): super(MinibatchStdev, self).__init__(**kwargs) # perform the operation def call(self, inputs): # calculate the mean value for each pixel across channels mean = backend.mean(inputs, axis=0, keepdims=True) # calculate the squared differences between pixel values and mean squ_diffs = backend.square(inputs - mean) # calculate the average of the squared differences (variance) mean_sq_diff = backend.mean(squ_diffs, axis=0, keepdims=True) # add a small value to avoid a blow-up when we calculate stdev mean_sq_diff += 1e-8 # square root of the variance (stdev) stdev = backend.sqrt(mean_sq_diff) # calculate the mean standard deviation across each pixel coord mean_pix = backend.mean(stdev, keepdims=True) # scale this up to be the size of one input feature map for each sample shape = backend.shape(inputs) output = backend.tile(mean_pix, (shape[0], shape[1], shape[2], 1)) # concatenate with the output combined = backend.concatenate([inputs, output], axis=-1) return combined # define the output shape of the layer def compute_output_shape(self, input_shape): # create a copy of the input shape as a list input_shape = list(input_shape) # add one to the channel dimension (assume channels-last) input_shape[-1] += 1 # convert list to a tuple return tuple(input_shape) # weighted sum output class WeightedSum(Add): # init with default value def __init__(self, alpha=0.0, **kwargs): super(WeightedSum, self).__init__(**kwargs) self.alpha = backend.variable(alpha, name='ws_alpha') # output a weighted sum of inputs def _merge_function(self, inputs): # only supports a weighted sum of two inputs assert (len(inputs) == 2) # ((1-a) * input1) + (a * input2) output = ((1.0 - self.alpha) * inputs[0]) + (self.alpha * inputs[1]) return output # calculate wasserstein loss def wasserstein_loss(y_true, y_pred): return backend.mean(y_true * y_pred) # add a discriminator block def add_discriminator_block(old_model, n_input_layers=3): # weight initialization init = RandomNormal(stddev=0.02) # weight constraint const = max_norm(1.0) # get shape of existing model in_shape = list(old_model.input.shape) # define new input shape as double the size input_shape = (in_shape[-2].value*2, in_shape[-2].value*2, in_shape[-1].value) in_image = Input(shape=input_shape) # define new input processing layer d = Conv2D(128, (1,1), padding='same', kernel_initializer=init, kernel_constraint=const)(in_image) d = LeakyReLU(alpha=0.2)(d) # define new block d = Conv2D(128, (3,3), padding='same', kernel_initializer=init, kernel_constraint=const)(d) d = LeakyReLU(alpha=0.2)(d) d = Conv2D(128, (3,3), padding='same', kernel_initializer=init, kernel_constraint=const)(d) d = LeakyReLU(alpha=0.2)(d) d = AveragePooling2D()(d) block_new = d # skip the input, 1x1 and activation for the old model for i in range(n_input_layers, len(old_model.layers)): d = old_model.layers[i](d) # define straight-through model model1 = Model(in_image, d) # compile model model1.compile(loss=wasserstein_loss, optimizer=Adam(lr=0.001, beta_1=0, beta_2=0.99, epsilon=10e-8)) # downsample the new larger image downsample = AveragePooling2D()(in_image) # connect old input processing to downsampled new input block_old = old_model.layers[1](downsample) block_old = old_model.layers[2](block_old) # fade in output of old model input layer with new input d = WeightedSum()([block_old, block_new]) # skip the input, 1x1 and activation for the old model for i in range(n_input_layers, len(old_model.layers)): d = old_model.layers[i](d) # define straight-through model model2 = Model(in_image, d) # compile model model2.compile(loss=wasserstein_loss, optimizer=Adam(lr=0.001, beta_1=0, beta_2=0.99, epsilon=10e-8)) return [model1, model2] # define the discriminator models for each image resolution def define_discriminator(n_blocks, input_shape=(4,4,3)): # weight initialization init = RandomNormal(stddev=0.02) # weight constraint const = max_norm(1.0) model_list = list() # base model input in_image = Input(shape=input_shape) # conv 1x1 d = Conv2D(128, (1,1), padding='same', kernel_initializer=init, kernel_constraint=const)(in_image) d = LeakyReLU(alpha=0.2)(d) # conv 3x3 (output block) d = MinibatchStdev()(d) d = Conv2D(128, (3,3), padding='same', kernel_initializer=init, kernel_constraint=const)(d) d = LeakyReLU(alpha=0.2)(d) # conv 4x4 d = Conv2D(128, (4,4), padding='same', kernel_initializer=init, kernel_constraint=const)(d) d = LeakyReLU(alpha=0.2)(d) # dense output layer d = Flatten()(d) out_class = Dense(1)(d) # define model model = Model(in_image, out_class) # compile model model.compile(loss=wasserstein_loss, optimizer=Adam(lr=0.001, beta_1=0, beta_2=0.99, epsilon=10e-8)) # store model model_list.append([model, model]) # create submodels for i in range(1, n_blocks): # get prior model without the fade-on old_model = model_list[i - 1][0] # create new model for next resolution models = add_discriminator_block(old_model) # store model model_list.append(models) return model_list # add a generator block def add_generator_block(old_model): # weight initialization init = RandomNormal(stddev=0.02) # weight constraint const = max_norm(1.0) # get the end of the last block block_end = old_model.layers[-2].output # upsample, and define new block upsampling = UpSampling2D()(block_end) g = Conv2D(128, (3,3), padding='same', kernel_initializer=init, kernel_constraint=const)(upsampling) g = PixelNormalization()(g) g = LeakyReLU(alpha=0.2)(g) g = Conv2D(128, (3,3), padding='same', kernel_initializer=init, kernel_constraint=const)(g) g = PixelNormalization()(g) g = LeakyReLU(alpha=0.2)(g) # add new output layer out_image = Conv2D(3, (1,1), padding='same', kernel_initializer=init, kernel_constraint=const)(g) # define model model1 = Model(old_model.input, out_image) # get the output layer from old model out_old = old_model.layers[-1] # connect the upsampling to the old output layer out_image2 = out_old(upsampling) # define new output image as the weighted sum of the old and new models merged = WeightedSum()([out_image2, out_image]) # define model model2 = Model(old_model.input, merged) return [model1, model2] # define generator models def define_generator(latent_dim, n_blocks, in_dim=4): # weight initialization init = RandomNormal(stddev=0.02) # weight constraint const = max_norm(1.0) model_list = list() # base model latent input in_latent = Input(shape=(latent_dim,)) # linear scale up to activation maps g = Dense(128 * in_dim * in_dim, kernel_initializer=init, kernel_constraint=const)(in_latent) g = Reshape((in_dim, in_dim, 128))(g) # conv 4x4, input block g = Conv2D(128, (3,3), padding='same', kernel_initializer=init, kernel_constraint=const)(g) g = PixelNormalization()(g) g = LeakyReLU(alpha=0.2)(g) # conv 3x3 g = Conv2D(128, (3,3), padding='same', kernel_initializer=init, kernel_constraint=const)(g) g = PixelNormalization()(g) g = LeakyReLU(alpha=0.2)(g) # conv 1x1, output block out_image = Conv2D(3, (1,1), padding='same', kernel_initializer=init, kernel_constraint=const)(g) # define model model = Model(in_latent, out_image) # store model model_list.append([model, model]) # create submodels for i in range(1, n_blocks): # get prior model without the fade-on old_model = model_list[i - 1][0] # create new model for next resolution models = add_generator_block(old_model) # store model model_list.append(models) return model_list # define composite models for training generators via discriminators def define_composite(discriminators, generators): model_list = list() # create composite models for i in range(len(discriminators)): g_models, d_models = generators[i], discriminators[i] # straight-through model d_models[0].trainable = False model1 = Sequential() model1.add(g_models[0]) model1.add(d_models[0]) model1.compile(loss=wasserstein_loss, optimizer=Adam(lr=0.001, beta_1=0, beta_2=0.99, epsilon=10e-8)) # fade-in model d_models[1].trainable = False model2 = Sequential() model2.add(g_models[1]) model2.add(d_models[1]) model2.compile(loss=wasserstein_loss, optimizer=Adam(lr=0.001, beta_1=0, beta_2=0.99, epsilon=10e-8)) # store model_list.append([model1, model2]) return model_list # load dataset def load_real_samples(filename): # load dataset data = load(filename) # extract numpy array X = data['arr_0'] # convert from ints to floats X = X.astype('float32') # scale from [0,255] to [-1,1] X = (X - 127.5) / 127.5 return X # select real samples def generate_real_samples(dataset, n_samples): # choose random instances ix = randint(0, dataset.shape[0], n_samples) # select images X = dataset[ix] # generate class labels y = ones((n_samples, 1)) return X, y # generate points in latent space as input for the generator def generate_latent_points(latent_dim, n_samples): # generate points in the latent space x_input = randn(latent_dim * n_samples) # reshape into a batch of inputs for the network x_input = x_input.reshape(n_samples, latent_dim) return x_input # use the generator to generate n fake examples, with class labels def generate_fake_samples(generator, latent_dim, n_samples): # generate points in latent space x_input = generate_latent_points(latent_dim, n_samples) # predict outputs X = generator.predict(x_input) # create class labels y = -ones((n_samples, 1)) return X, y # update the alpha value on each instance of WeightedSum def update_fadein(models, step, n_steps): # calculate current alpha (linear from 0 to 1) alpha = step / float(n_steps - 1) # update the alpha for each model for model in models: for layer in model.layers: if isinstance(layer, WeightedSum): backend.set_value(layer.alpha, alpha) # train a generator and discriminator def train_epochs(g_model, d_model, gan_model, dataset, n_epochs, n_batch, fadein=False): # calculate the number of batches per training epoch bat_per_epo = int(dataset.shape[0] / n_batch) # calculate the number of training iterations n_steps = bat_per_epo * n_epochs # calculate the size of half a batch of samples half_batch = int(n_batch / 2) # manually enumerate epochs for i in range(n_steps): # update alpha for all WeightedSum layers when fading in new blocks if fadein: update_fadein([g_model, d_model, gan_model], i, n_steps) # prepare real and fake samples X_real, y_real = generate_real_samples(dataset, half_batch) X_fake, y_fake = generate_fake_samples(g_model, latent_dim, half_batch) # update discriminator model d_loss1 = d_model.train_on_batch(X_real, y_real) d_loss2 = d_model.train_on_batch(X_fake, y_fake) # update the generator via the discriminator's error z_input = generate_latent_points(latent_dim, n_batch) y_real2 = ones((n_batch, 1)) g_loss = gan_model.train_on_batch(z_input, y_real2) # summarize loss on this batch print('>%d, d1=%.3f, d2=%.3f g=%.3f' % (i+1, d_loss1, d_loss2, g_loss)) # scale images to preferred size def scale_dataset(images, new_shape): images_list = list() for image in images: # resize with nearest neighbor interpolation new_image = resize(image, new_shape, 0) # store images_list.append(new_image) return asarray(images_list) # generate samples and save as a plot and save the model def summarize_performance(status, g_model, latent_dim, n_samples=25): # devise name gen_shape = g_model.output_shape name = '%03dx%03d-%s' % (gen_shape[1], gen_shape[2], status) # generate images X, _ = generate_fake_samples(g_model, latent_dim, n_samples) # normalize pixel values to the range [0,1] X = (X - X.min()) / (X.max() - X.min()) # plot real images square = int(sqrt(n_samples)) for i in range(n_samples): pyplot.subplot(square, square, 1 + i) pyplot.axis('off') pyplot.imshow(X[i]) # save plot to file filename1 = 'plot_%s.png' % (name) pyplot.savefig(filename1) pyplot.close() # save the generator model filename2 = 'model_%s.h5' % (name) g_model.save(filename2) print('>Saved: %s and %s' % (filename1, filename2)) # train the generator and discriminator def train(g_models, d_models, gan_models, dataset, latent_dim, e_norm, e_fadein, n_batch): # fit the baseline model g_normal, d_normal, gan_normal = g_models[0][0], d_models[0][0], gan_models[0][0] # scale dataset to appropriate size gen_shape = g_normal.output_shape scaled_data = scale_dataset(dataset, gen_shape[1:]) print('Scaled Data', scaled_data.shape) # train normal or straight-through models train_epochs(g_normal, d_normal, gan_normal, scaled_data, e_norm[0], n_batch[0]) summarize_performance('tuned', g_normal, latent_dim) # process each level of growth for i in range(1, len(g_models)): # retrieve models for this level of growth [g_normal, g_fadein] = g_models[i] [d_normal, d_fadein] = d_models[i] [gan_normal, gan_fadein] = gan_models[i] # scale dataset to appropriate size gen_shape = g_normal.output_shape scaled_data = scale_dataset(dataset, gen_shape[1:]) print('Scaled Data', scaled_data.shape) # train fade-in models for next level of growth train_epochs(g_fadein, d_fadein, gan_fadein, scaled_data, e_fadein[i], n_batch[i], True) summarize_performance('faded', g_fadein, latent_dim) # train normal or straight-through models train_epochs(g_normal, d_normal, gan_normal, scaled_data, e_norm[i], n_batch[i]) summarize_performance('tuned', g_normal, latent_dim) # number of growth phases, e.g. 6 == [4, 8, 16, 32, 64, 128] n_blocks = 6 # size of the latent space latent_dim = 100 # define models d_models = define_discriminator(n_blocks) # define models g_models = define_generator(latent_dim, n_blocks) # define composite models gan_models = define_composite(d_models, g_models) # load image data dataset = load_real_samples('img_align_celeba_128.npz') print('Loaded', dataset.shape) # train model n_batch = [16, 16, 16, 8, 4, 4] # 10 epochs == 500K images per training phase n_epochs = [5, 8, 8, 10, 10, 10] train(g_models, d_models, gan_models, dataset, latent_dim, n_epochs, n_epochs, n_batch)

**Note**: The example can be run on the CPU, although a GPU is recommended.

Running the example may take a number of hours to complete on modern GPU hardware.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

If loss values during the training iterations go to zero or very large/small numbers, this may be an example of a failure mode and may require a restart of the training process.

Running the example first reports the successful loading of the prepared dataset and the scaling of the dataset to the first image size, then reports the loss of each model for each step of the training process.

Loaded (50000, 128, 128, 3) Scaled Data (50000, 4, 4, 3) >1, d1=0.993, d2=0.001 g=0.951 >2, d1=0.861, d2=0.118 g=0.982 >3, d1=0.829, d2=0.126 g=0.875 >4, d1=0.774, d2=0.202 g=0.912 >5, d1=0.687, d2=0.035 g=0.911 ...

Plots of generated images and the generator model are saved after each fade-in training phase with filenames like:

*plot_008x008-faded.png**model_008x008-faded.h5*

Plots and models are also saved after each tuning phase, with filenames like:

*plot_008x008-tuned.png**model_008x008-tuned.h5*

Reviewing plots of the generated images at each point helps to see the progression both in the size of supported images and their quality before and after the tuning phase.

For example, below is a sample of images generated after the first 4×4 training phase (*plot_004x004-tuned.png*). At this point, we cannot see much at all.

Reviewing generated images after the fade-in training phase for 8×8 images shows more structure (*plot_008x008-faded.png*). The images are blocky but we can see faces.

Next, we can contrast the generated images for 16×16 after the fade-in training phase (*plot_016x016-faded.png*) and after the tuning training phase (*plot_016x016-tuned.png*).

We can see that the images are clearly faces and we can see that the fine-tuning phase appears to improve the coloring or tone of the faces and perhaps the structure.

Finally, we can review generated faces after tuning for the remaining 32×32, 64×64, and 128×128 resolutions. We can see that each step in resolution, the image quality is improved, allowing the model to fill in more structure and detail.

Although not perfect, the generated images show that the progressive growing GAN is capable of not only generating plausible human faces at different resolutions, but it is able to scale building upon what was learned at lower resolutions to generate plausible faces at higher resolutions.

Now that we have seen how the generator models can be fit, next we can see how we might load and use a saved generator model.

In this section, we will explore how to load a generator model and use it to generate synthetic images on demand.

The saved Keras models can be loaded via the *load_model()* function.

Because the generator models use custom layers, we must specify how to load the custom layers. This is achieved by providing a dict to the load_model() function that maps each of the custom layer names to the appropriate class.

... # load model cust = {'PixelNormalization': PixelNormalization, 'MinibatchStdev': MinibatchStdev, 'WeightedSum': WeightedSum} model = load_model('model_016x016-tuned.h5', cust)

We can then use the *generate_latent_points()* function from the previous section to generate points in latent space as input for the generator model.

... # size of the latent space latent_dim = 100 # number of images to generate n_images = 25 # generate images latent_points = generate_latent_points(latent_dim, n_images) # generate images X = model.predict(latent_points)

We can then plot the results by first scaling the pixel values to the range [0,1] and plotting each image, in this case in a square grid pattern.

# create a plot of generated images def plot_generated(images, n_images): # plot images square = int(sqrt(n_images)) # normalize pixel values to the range [0,1] images = (images - images.min()) / (images.max() - images.min()) for i in range(n_images): # define subplot pyplot.subplot(square, square, 1 + i) # turn off axis pyplot.axis('off') # plot raw pixel data pyplot.imshow(images[i]) pyplot.show()

Tying this together, the complete example of loading a saved progressive growing GAN generator model and using it to generate new faces is listed below.

In this case, we demonstrate loading the tuned model for generating 16×16 faces.

# example of loading the generator model and generating images from math import sqrt from numpy import asarray from numpy.random import randn from numpy.random import randint from keras.layers import Layer from keras.layers import Add from keras import backend from keras.models import load_model from matplotlib import pyplot # pixel-wise feature vector normalization layer class PixelNormalization(Layer): # initialize the layer def __init__(self, **kwargs): super(PixelNormalization, self).__init__(**kwargs) # perform the operation def call(self, inputs): # calculate square pixel values values = inputs**2.0 # calculate the mean pixel values mean_values = backend.mean(values, axis=-1, keepdims=True) # ensure the mean is not zero mean_values += 1.0e-8 # calculate the sqrt of the mean squared value (L2 norm) l2 = backend.sqrt(mean_values) # normalize values by the l2 norm normalized = inputs / l2 return normalized # define the output shape of the layer def compute_output_shape(self, input_shape): return input_shape # mini-batch standard deviation layer class MinibatchStdev(Layer): # initialize the layer def __init__(self, **kwargs): super(MinibatchStdev, self).__init__(**kwargs) # perform the operation def call(self, inputs): # calculate the mean value for each pixel across channels mean = backend.mean(inputs, axis=0, keepdims=True) # calculate the squared differences between pixel values and mean squ_diffs = backend.square(inputs - mean) # calculate the average of the squared differences (variance) mean_sq_diff = backend.mean(squ_diffs, axis=0, keepdims=True) # add a small value to avoid a blow-up when we calculate stdev mean_sq_diff += 1e-8 # square root of the variance (stdev) stdev = backend.sqrt(mean_sq_diff) # calculate the mean standard deviation across each pixel coord mean_pix = backend.mean(stdev, keepdims=True) # scale this up to be the size of one input feature map for each sample shape = backend.shape(inputs) output = backend.tile(mean_pix, (shape[0], shape[1], shape[2], 1)) # concatenate with the output combined = backend.concatenate([inputs, output], axis=-1) return combined # define the output shape of the layer def compute_output_shape(self, input_shape): # create a copy of the input shape as a list input_shape = list(input_shape) # add one to the channel dimension (assume channels-last) input_shape[-1] += 1 # convert list to a tuple return tuple(input_shape) # weighted sum output class WeightedSum(Add): # init with default value def __init__(self, alpha=0.0, **kwargs): super(WeightedSum, self).__init__(**kwargs) self.alpha = backend.variable(alpha, name='ws_alpha') # output a weighted sum of inputs def _merge_function(self, inputs): # only supports a weighted sum of two inputs assert (len(inputs) == 2) # ((1-a) * input1) + (a * input2) output = ((1.0 - self.alpha) * inputs[0]) + (self.alpha * inputs[1]) return output # generate points in latent space as input for the generator def generate_latent_points(latent_dim, n_samples): # generate points in the latent space x_input = randn(latent_dim * n_samples) # reshape into a batch of inputs for the network z_input = x_input.reshape(n_samples, latent_dim) return z_input # create a plot of generated images def plot_generated(images, n_images): # plot images square = int(sqrt(n_images)) # normalize pixel values to the range [0,1] images = (images - images.min()) / (images.max() - images.min()) for i in range(n_images): # define subplot pyplot.subplot(square, square, 1 + i) # turn off axis pyplot.axis('off') # plot raw pixel data pyplot.imshow(images[i]) pyplot.show() # load model cust = {'PixelNormalization': PixelNormalization, 'MinibatchStdev': MinibatchStdev, 'WeightedSum': WeightedSum} model = load_model('model_016x016-tuned.h5', cust) # size of the latent space latent_dim = 100 # number of images to generate n_images = 25 # generate images latent_points = generate_latent_points(latent_dim, n_images) # generate images X = model.predict(latent_points) # plot the result plot_generated(X, n_images)

Running the example loads the model and generates 25 faces that are plotted in a 5×5 grid.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

We can then change the filename to a different model, such as the tuned model for generating 128×128 faces.

... model = load_model('model_128x128-tuned.h5', cust)

Re-running the example generates a plot of higher-resolution synthetic faces.

This section lists some ideas for extending the tutorial that you may wish to explore.

**Change Alpha via Callback**. Update the example to use a Keras callback to update the alpha value for the WeightedSum layers during fade-in training.**Pre-Scale Dataset**. Update the example to pre-scale each dataset and save each version to file to be loaded when needed during training.**Equalized Learning Rate**. Update the example to implement the equalized learning rate weight scaling method described in the paper.**Progression in Number of Filters**. Update the example to decrease the number of filters with depth in the generator and increase the number of filters with depth in the discriminator to match the configuration in the paper.**Larger Image Size**. Update the example to generate large image sizes, such as 512×512.

If you explore any of these extensions, I’d love to know.

Post your findings in the comments below.

This section provides more resources on the topic if you are looking to go deeper.

- Progressive Growing of GANs for Improved Quality, Stability, and Variation, 2017.
- Progressive Growing of GANs for Improved Quality, Stability, and Variation, Official.
- progressive_growing_of_gans Project (official), GitHub.
- Progressive Growing of GANs for Improved Quality, Stability, and Variation. Open Review.
- Progressive Growing of GANs for Improved Quality, Stability, and Variation, YouTube.
- Progressive growing of GANs for improved quality, stability and variation, KeyNote, YouTube.

- Keras Datasets API.
- Keras Sequential Model API
- Keras Convolutional Layers API
- How can I “freeze” Keras layers?
- Keras Contrib Project
- skimage.transform.resize API

- Keras-progressive_growing_of_gans Project, GitHub.
- Hands-On-Generative-Adversarial-Networks-with-Keras Project, GitHub.

In this tutorial, you discovered how to implement and train a progressive growing generative adversarial network for generating celebrity faces.

Specifically, you learned:

- How to prepare the celebrity faces dataset for training a progressive growing GAN model.
- How to define and train the progressive growing GAN on the celebrity faces dataset.
- How to load saved generator models and use them for generating ad hoc synthetic celebrity faces.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Train a Progressive Growing GAN in Keras for Synthesizing Faces appeared first on MachineLearningMastery.com.

]]>The post How to Implement Progressive Growing GAN Models in Keras appeared first on MachineLearningMastery.com.

]]>It is an extension of the more traditional GAN architecture that involves incrementally growing the size of the generated image during training, starting with a very small image, such as a 4×4 pixels. This allows the stable training and growth of GAN models capable of generating very large high-quality images, such as images of synthetic celebrity faces with the size of 1024×1024 pixels.

In this tutorial, you will discover how to develop progressive growing generative adversarial network models from scratch with Keras.

After completing this tutorial, you will know:

- How to develop pre-defined discriminator and generator models at each level of output image growth.
- How to define composite models for training the generator models via the discriminator models.
- How to cycle the training of fade-in version and normal versions of models at each level of output image growth.

**Kick-start your project** with my new book Generative Adversarial Networks with Python, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

This tutorial is divided into five parts; they are:

- What Is the Progressive Growing GAN Architecture?
- How to Implement the Progressive Growing GAN Discriminator Model
- How to Implement the Progressive Growing GAN Generator Model
- How to Implement Composite Models for Updating the Generator
- How to Train Discriminator and Generator Models

GANs are effective at generating crisp synthetic images, although are typically limited in the size of the images that can be generated.

The Progressive Growing GAN is an extension to the GAN that allows the training of generator models capable of outputting large high-quality images, such as photorealistic faces with the size 1024×1024 pixels. It was described in the 2017 paper by Tero Karras, et al. from Nvidia titled “Progressive Growing of GANs for Improved Quality, Stability, and Variation.”

The key innovation of the Progressive Growing GAN is the incremental increase in the size of images output by the generator starting with a 4×4 pixel image and double to 8×8, 16×16, and so on until the desired output resolution.

Our primary contribution is a training methodology for GANs where we start with low-resolution images, and then progressively increase the resolution by adding layers to the networks.

— Progressive Growing of GANs for Improved Quality, Stability, and Variation, 2017.

This is achieved by a training procedure that involves periods of fine-tuning the model with a given output resolution, and periods of slowly phasing in a new model with a larger resolution.

When doubling the resolution of the generator (G) and discriminator (D) we fade in the new layers smoothly

— Progressive Growing of GANs for Improved Quality, Stability, and Variation, 2017.

All layers remain trainable during the training process, including existing layers when new layers are added.

All existing layers in both networks remain trainable throughout the training process.

— Progressive Growing of GANs for Improved Quality, Stability, and Variation, 2017.

Progressive Growing GAN involves using a generator and discriminator model with the same general structure and starting with very small images. During training, new blocks of convolutional layers are systematically added to both the generator model and the discriminator models.

The incremental addition of the layers allows the models to effectively learn coarse-level detail and later learn ever finer detail, both on the generator and discriminator side.

This incremental nature allows the training to first discover the large-scale structure of the image distribution and then shift attention to increasingly finer-scale detail, instead of having to learn all scales simultaneously.

— Progressive Growing of GANs for Improved Quality, Stability, and Variation, 2017.

The model architecture is complex and cannot be implemented directly.

In this tutorial, we will focus on how the progressive growing GAN can be implemented using the Keras deep learning library.

We will step through how each of the discriminator and generator models can be defined, how the generator can be trained via the discriminator model, and how each model can be updated during the training process.

These implementation details will provide the basis for you developing a progressive growing GAN for your own applications.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

The discriminator model is given images as input and must classify them as either real (from the dataset) or fake (generated).

During the training process, the discriminator must grow to support images with ever-increasing size, starting with 4×4 pixel color images and doubling to 8×8, 16×16, 32×32, and so on.

This is achieved by inserting a new input layer to support the larger input image followed by a new block of layers. The output of this new block is then downsampled. Additionally, the new image is also downsampled directly and passed through the old input processing layer before it is combined with the output of the new block.

During the transition from a lower resolution to a higher resolution, e.g. 16×16 to 32×32, the discriminator model will have two input pathways as follows:

- [32×32 Image] -> [fromRGB Conv] -> [NewBlock] -> [Downsample] ->
- [32×32 Image] -> [Downsample] -> [fromRGB Conv] ->

The output of the new block that is downsampled and the output of the old input processing layer are combined using a weighted average, where the weighting is controlled by a new hyperparameter called *alpha*. The weighted sum is calculated as follows:

- Output = ((1 – alpha) * fromRGB) + (alpha * NewBlock)

The weighted average of the two pathways is then fed into the rest of the existing model.

Initially, the weighting is completely biased towards the old input processing layer (*alpha=0*) and is linearly increased over training iterations so that the new block is given more weight until eventually, the output is entirely the product of the new block (*alpha=1*). At this time, the old pathway can be removed.

This can be summarized with the following figure taken from the paper showing a model before growing (a), during the phase-in of the larger resolution (b), and the model after the phase-in (c).

The *fromRGB* layers are implemented as a 1×1 convolutional layer. A block is comprised of two convolutional layers with 3×3 sized filters and the leaky ReLU activation function with a slope of 0.2, followed by a downsampling layer. Average pooling is used for downsampling, which is unlike most other GAN models that use transpose convolutional layers.

The output of the model involves two convolutional layers with 3×3 and 4×4 sized filters and Leaky ReLU activation, followed by a fully connected layer that outputs the single value prediction. The model uses a linear activation function instead of a sigmoid activation function like other discriminator models and is trained directly either by Wasserstein loss (specifically WGAN-GP) or least squares loss; we will use the latter in this tutorial. Model weights are initialized using He Gaussian (he_normal), which is very similar to the method used in the paper.

The model uses a custom layer called Minibatch standard deviation at the beginning of the output block, and instead of batch normalization, each layer uses local response normalization, referred to as pixel-wise normalization in the paper. We will leave out the minibatch normalization and use batch normalization in this tutorial for brevity.

One approach to implementing the progressive growing GAN would be to manually expand a model on demand during training. Another approach is to pre-define all of the models prior to training and carefully use the Keras functional API to ensure that layers are shared across the models and continue training.

I believe the latter approach might be easier and is the approach we will use in this tutorial.

First, we must define a custom layer that we can use when fading in a new higher-resolution input image and block. This new layer must take two sets of activation maps with the same dimensions (width, height, channels) and add them together using a weighted sum.

We can implement this as a new layer called *WeightedSum* that extends the *Add* merge layer and uses a hyperparameter ‘*alpha*‘ to control the contribution of each input. This new class is defined below. The layer assumes only two inputs: the first for the output of the old or existing layers and the second for the newly added layers. The new hyperparameter is defined as a backend variable, meaning that we can change it any time via changing the value of the variable.

# weighted sum output class WeightedSum(Add): # init with default value def __init__(self, alpha=0.0, **kwargs): super(WeightedSum, self).__init__(**kwargs) self.alpha = backend.variable(alpha, name='ws_alpha') # output a weighted sum of inputs def _merge_function(self, inputs): # only supports a weighted sum of two inputs assert (len(inputs) == 2) # ((1-a) * input1) + (a * input2) output = ((1.0 - self.alpha) * inputs[0]) + (self.alpha * inputs[1]) return output

The discriminator model is by far more complex than the generator to grow because we have to change the model input, so let’s step through this slowly.

Firstly, we can define a discriminator model that takes a 4×4 color image as input and outputs a prediction of whether the image is real or fake. The model is comprised of a 1×1 input processing layer (fromRGB) and an output block.

... # base model input in_image = Input(shape=(4,4,3)) # conv 1x1 g = Conv2D(64, (1,1), padding='same', kernel_initializer='he_normal')(in_image) g = LeakyReLU(alpha=0.2)(g) # conv 3x3 (output block) g = Conv2D(128, (3,3), padding='same', kernel_initializer='he_normal')(g) g = BatchNormalization()(g) g = LeakyReLU(alpha=0.2)(g) # conv 4x4 g = Conv2D(128, (4,4), padding='same', kernel_initializer='he_normal')(g) g = BatchNormalization()(g) g = LeakyReLU(alpha=0.2)(g) # dense output layer g = Flatten()(g) out_class = Dense(1)(g) # define model model = Model(in_image, out_class) # compile model model.compile(loss='mse', optimizer=Adam(lr=0.001, beta_1=0, beta_2=0.99, epsilon=10e-8))

Next, we need to define a new model that handles the intermediate stage between this model and a new discriminator model that takes 8×8 color images as input.

The existing input processing layer must receive a downsampled version of the new 8×8 image. A new input process layer must be defined that takes the 8×8 input image and passes it through a new block of two convolutional layers and a downsampling layer. The output of the new block after downsampling and the old input processing layer must be added together using a weighted sum via our new *WeightedSum* layer and then must reuse the same output block (two convolutional layers and the output layer).

Given the first defined model and our knowledge about this model (e.g. the number of layers in the input processing layer is 2 for the Conv2D and LeakyReLU), we can construct this new intermediate or fade-in model using layer indexes from the old model.

... old_model = model # get shape of existing model in_shape = list(old_model.input.shape) # define new input shape as double the size input_shape = (in_shape[-2].value*2, in_shape[-2].value*2, in_shape[-1].value) in_image = Input(shape=input_shape) # define new input processing layer g = Conv2D(64, (1,1), padding='same', kernel_initializer='he_normal')(in_image) g = LeakyReLU(alpha=0.2)(g) # define new block g = Conv2D(64, (3,3), padding='same', kernel_initializer='he_normal')(g) g = BatchNormalization()(g) g = LeakyReLU(alpha=0.2)(g) g = Conv2D(64, (3,3), padding='same', kernel_initializer='he_normal')(g) g = BatchNormalization()(g) g = LeakyReLU(alpha=0.2)(g) g = AveragePooling2D()(g) # downsample the new larger image downsample = AveragePooling2D()(in_image) # connect old input processing to downsampled new input block_old = old_model.layers[1](downsample) block_old = old_model.layers[2](block_old) # fade in output of old model input layer with new input g = WeightedSum()([block_old, g]) # skip the input, 1x1 and activation for the old model for i in range(3, len(old_model.layers)): g = old_model.layers[i](g) # define straight-through model model = Model(in_image, g) # compile model model.compile(loss='mse', optimizer=Adam(lr=0.001, beta_1=0, beta_2=0.99, epsilon=10e-8))

So far, so good.

We also need a version of the same model with the same layers without the fade-in of the input from the old model’s input processing layers.

This straight-through version is required for training before we fade-in the next doubling of the input image size.

We can update the above example to create two versions of the model. First, the straight-through version as it is simpler, then the version used for the fade-in that reuses the layers from the new block and the output layers of the old model.

The *add_discriminator_block()* function below implements this, returning a list of the two defined models (straight-through and fade-in), and takes the old model as an argument and defines the number of input layers as a default argument (3).

To ensure that the *WeightedSum* layer works correctly, we have fixed all convolutional layers to always have 64 filters, and in turn, output 64 feature maps. If there is a mismatch between the old model’s input processing layer and the new blocks output in terms of the number of feature maps (channels), then the weighted sum will fail.

# add a discriminator block def add_discriminator_block(old_model, n_input_layers=3): # get shape of existing model in_shape = list(old_model.input.shape) # define new input shape as double the size input_shape = (in_shape[-2].value*2, in_shape[-2].value*2, in_shape[-1].value) in_image = Input(shape=input_shape) # define new input processing layer d = Conv2D(64, (1,1), padding='same', kernel_initializer='he_normal')(in_image) d = LeakyReLU(alpha=0.2)(d) # define new block d = Conv2D(64, (3,3), padding='same', kernel_initializer='he_normal')(d) d = BatchNormalization()(d) d = LeakyReLU(alpha=0.2)(d) d = Conv2D(64, (3,3), padding='same', kernel_initializer='he_normal')(d) d = BatchNormalization()(d) d = LeakyReLU(alpha=0.2)(d) d = AveragePooling2D()(d) block_new = d # skip the input, 1x1 and activation for the old model for i in range(n_input_layers, len(old_model.layers)): d = old_model.layers[i](d) # define straight-through model model1 = Model(in_image, d) # compile model model1.compile(loss='mse', optimizer=Adam(lr=0.001, beta_1=0, beta_2=0.99, epsilon=10e-8)) # downsample the new larger image downsample = AveragePooling2D()(in_image) # connect old input processing to downsampled new input block_old = old_model.layers[1](downsample) block_old = old_model.layers[2](block_old) # fade in output of old model input layer with new input d = WeightedSum()([block_old, block_new]) # skip the input, 1x1 and activation for the old model for i in range(n_input_layers, len(old_model.layers)): d = old_model.layers[i](d) # define straight-through model model2 = Model(in_image, d) # compile model model2.compile(loss='mse', optimizer=Adam(lr=0.001, beta_1=0, beta_2=0.99, epsilon=10e-8)) return [model1, model2]

It is not an elegant function as we have some repetition, but it is readable and will get the job done.

We can then call this function again and again as we double the size of input images. Importantly, the function expects the straight-through version of the prior model as input.

The example below defines a new function called *define_discriminator()* that defines our base model that expects a 4×4 color image as input, then repeatedly adds blocks to create new versions of the discriminator model each time that expects images with quadruple the area.

# define the discriminator models for each image resolution def define_discriminator(n_blocks, input_shape=(4,4,3)): model_list = list() # base model input in_image = Input(shape=input_shape) # conv 1x1 d = Conv2D(64, (1,1), padding='same', kernel_initializer='he_normal')(in_image) d = LeakyReLU(alpha=0.2)(d) # conv 3x3 (output block) d = Conv2D(128, (3,3), padding='same', kernel_initializer='he_normal')(d) d = BatchNormalization()(d) d = LeakyReLU(alpha=0.2)(d) # conv 4x4 d = Conv2D(128, (4,4), padding='same', kernel_initializer='he_normal')(d) d = BatchNormalization()(d) d = LeakyReLU(alpha=0.2)(d) # dense output layer d = Flatten()(d) out_class = Dense(1)(d) # define model model = Model(in_image, out_class) # compile model model.compile(loss='mse', optimizer=Adam(lr=0.001, beta_1=0, beta_2=0.99, epsilon=10e-8)) # store model model_list.append([model, model]) # create submodels for i in range(1, n_blocks): # get prior model without the fade-on old_model = model_list[i - 1][0] # create new model for next resolution models = add_discriminator_block(old_model) # store model model_list.append(models) return model_list

This function will return a list of models, where each item in the list is a two-element list that contains first the straight-through version of the model at that resolution, and second the fade-in version of the model for that resolution.

We can tie all of this together and define a new “discriminator model” that will grow from 4×4, through to 8×8, and finally to 16×16. This is achieved by passing he *n_blocks* argument to 3 when calling the *define_discriminator()* function, for the creation of three sets of models.

The complete example is listed below.

# example of defining discriminator models for the progressive growing gan from keras.optimizers import Adam from keras.models import Model from keras.layers import Input from keras.layers import Dense from keras.layers import Flatten from keras.layers import Conv2D from keras.layers import AveragePooling2D from keras.layers import LeakyReLU from keras.layers import BatchNormalization from keras.layers import Add from keras.utils.vis_utils import plot_model from keras import backend # weighted sum output class WeightedSum(Add): # init with default value def __init__(self, alpha=0.0, **kwargs): super(WeightedSum, self).__init__(**kwargs) self.alpha = backend.variable(alpha, name='ws_alpha') # output a weighted sum of inputs def _merge_function(self, inputs): # only supports a weighted sum of two inputs assert (len(inputs) == 2) # ((1-a) * input1) + (a * input2) output = ((1.0 - self.alpha) * inputs[0]) + (self.alpha * inputs[1]) return output # add a discriminator block def add_discriminator_block(old_model, n_input_layers=3): # get shape of existing model in_shape = list(old_model.input.shape) # define new input shape as double the size input_shape = (in_shape[-2].value*2, in_shape[-2].value*2, in_shape[-1].value) in_image = Input(shape=input_shape) # define new input processing layer d = Conv2D(64, (1,1), padding='same', kernel_initializer='he_normal')(in_image) d = LeakyReLU(alpha=0.2)(d) # define new block d = Conv2D(64, (3,3), padding='same', kernel_initializer='he_normal')(d) d = BatchNormalization()(d) d = LeakyReLU(alpha=0.2)(d) d = Conv2D(64, (3,3), padding='same', kernel_initializer='he_normal')(d) d = BatchNormalization()(d) d = LeakyReLU(alpha=0.2)(d) d = AveragePooling2D()(d) block_new = d # skip the input, 1x1 and activation for the old model for i in range(n_input_layers, len(old_model.layers)): d = old_model.layers[i](d) # define straight-through model model1 = Model(in_image, d) # compile model model1.compile(loss='mse', optimizer=Adam(lr=0.001, beta_1=0, beta_2=0.99, epsilon=10e-8)) # downsample the new larger image downsample = AveragePooling2D()(in_image) # connect old input processing to downsampled new input block_old = old_model.layers[1](downsample) block_old = old_model.layers[2](block_old) # fade in output of old model input layer with new input d = WeightedSum()([block_old, block_new]) # skip the input, 1x1 and activation for the old model for i in range(n_input_layers, len(old_model.layers)): d = old_model.layers[i](d) # define straight-through model model2 = Model(in_image, d) # compile model model2.compile(loss='mse', optimizer=Adam(lr=0.001, beta_1=0, beta_2=0.99, epsilon=10e-8)) return [model1, model2] # define the discriminator models for each image resolution def define_discriminator(n_blocks, input_shape=(4,4,3)): model_list = list() # base model input in_image = Input(shape=input_shape) # conv 1x1 d = Conv2D(64, (1,1), padding='same', kernel_initializer='he_normal')(in_image) d = LeakyReLU(alpha=0.2)(d) # conv 3x3 (output block) d = Conv2D(128, (3,3), padding='same', kernel_initializer='he_normal')(d) d = BatchNormalization()(d) d = LeakyReLU(alpha=0.2)(d) # conv 4x4 d = Conv2D(128, (4,4), padding='same', kernel_initializer='he_normal')(d) d = BatchNormalization()(d) d = LeakyReLU(alpha=0.2)(d) # dense output layer d = Flatten()(d) out_class = Dense(1)(d) # define model model = Model(in_image, out_class) # compile model model.compile(loss='mse', optimizer=Adam(lr=0.001, beta_1=0, beta_2=0.99, epsilon=10e-8)) # store model model_list.append([model, model]) # create submodels for i in range(1, n_blocks): # get prior model without the fade-on old_model = model_list[i - 1][0] # create new model for next resolution models = add_discriminator_block(old_model) # store model model_list.append(models) return model_list # define models discriminators = define_discriminator(3) # spot check m = discriminators[2][1] m.summary() plot_model(m, to_file='discriminator_plot.png', show_shapes=True, show_layer_names=True)

Running the example first summarizes the fade-in version of the third model showing the 16×16 color image inputs and the single value output.

__________________________________________________________________________________________________ Layer (type) Output Shape Param # Connected to ================================================================================================== input_3 (InputLayer) (None, 16, 16, 3) 0 __________________________________________________________________________________________________ conv2d_7 (Conv2D) (None, 16, 16, 64) 256 input_3[0][0] __________________________________________________________________________________________________ leaky_re_lu_7 (LeakyReLU) (None, 16, 16, 64) 0 conv2d_7[0][0] __________________________________________________________________________________________________ conv2d_8 (Conv2D) (None, 16, 16, 64) 36928 leaky_re_lu_7[0][0] __________________________________________________________________________________________________ batch_normalization_5 (BatchNor (None, 16, 16, 64) 256 conv2d_8[0][0] __________________________________________________________________________________________________ leaky_re_lu_8 (LeakyReLU) (None, 16, 16, 64) 0 batch_normalization_5[0][0] __________________________________________________________________________________________________ conv2d_9 (Conv2D) (None, 16, 16, 64) 36928 leaky_re_lu_8[0][0] __________________________________________________________________________________________________ average_pooling2d_4 (AveragePoo (None, 8, 8, 3) 0 input_3[0][0] __________________________________________________________________________________________________ batch_normalization_6 (BatchNor (None, 16, 16, 64) 256 conv2d_9[0][0] __________________________________________________________________________________________________ conv2d_4 (Conv2D) (None, 8, 8, 64) 256 average_pooling2d_4[0][0] __________________________________________________________________________________________________ leaky_re_lu_9 (LeakyReLU) (None, 16, 16, 64) 0 batch_normalization_6[0][0] __________________________________________________________________________________________________ leaky_re_lu_4 (LeakyReLU) (None, 8, 8, 64) 0 conv2d_4[1][0] __________________________________________________________________________________________________ average_pooling2d_3 (AveragePoo (None, 8, 8, 64) 0 leaky_re_lu_9[0][0] __________________________________________________________________________________________________ weighted_sum_2 (WeightedSum) (None, 8, 8, 64) 0 leaky_re_lu_4[1][0] average_pooling2d_3[0][0] __________________________________________________________________________________________________ conv2d_5 (Conv2D) (None, 8, 8, 64) 36928 weighted_sum_2[0][0] __________________________________________________________________________________________________ batch_normalization_3 (BatchNor (None, 8, 8, 64) 256 conv2d_5[2][0] __________________________________________________________________________________________________ leaky_re_lu_5 (LeakyReLU) (None, 8, 8, 64) 0 batch_normalization_3[2][0] __________________________________________________________________________________________________ conv2d_6 (Conv2D) (None, 8, 8, 64) 36928 leaky_re_lu_5[2][0] __________________________________________________________________________________________________ batch_normalization_4 (BatchNor (None, 8, 8, 64) 256 conv2d_6[2][0] __________________________________________________________________________________________________ leaky_re_lu_6 (LeakyReLU) (None, 8, 8, 64) 0 batch_normalization_4[2][0] __________________________________________________________________________________________________ average_pooling2d_1 (AveragePoo (None, 4, 4, 64) 0 leaky_re_lu_6[2][0] __________________________________________________________________________________________________ conv2d_2 (Conv2D) (None, 4, 4, 128) 73856 average_pooling2d_1[2][0] __________________________________________________________________________________________________ batch_normalization_1 (BatchNor (None, 4, 4, 128) 512 conv2d_2[4][0] __________________________________________________________________________________________________ leaky_re_lu_2 (LeakyReLU) (None, 4, 4, 128) 0 batch_normalization_1[4][0] __________________________________________________________________________________________________ conv2d_3 (Conv2D) (None, 4, 4, 128) 262272 leaky_re_lu_2[4][0] __________________________________________________________________________________________________ batch_normalization_2 (BatchNor (None, 4, 4, 128) 512 conv2d_3[4][0] __________________________________________________________________________________________________ leaky_re_lu_3 (LeakyReLU) (None, 4, 4, 128) 0 batch_normalization_2[4][0] __________________________________________________________________________________________________ flatten_1 (Flatten) (None, 2048) 0 leaky_re_lu_3[4][0] __________________________________________________________________________________________________ dense_1 (Dense) (None, 1) 2049 flatten_1[4][0] ================================================================================================== Total params: 488,449 Trainable params: 487,425 Non-trainable params: 1,024 __________________________________________________________________________________________________

A plot of the same fade-in version of the model is created and saved to file.

**Note**: creating this plot assumes that the pygraphviz and pydot libraries are installed. If this is a problem, comment out the import statement and call to plot_model().

The plot shows the 16×16 input image that is downsampled and passed through the 8×8 input processing layers from the prior model (left). It also shows the addition of the new block (right) and the weighted average that combines both streams of input, before using the existing model layers to continue processing and outputting a prediction.

Now that we have seen how we can define the discriminator models, let’s look at how we can define the generator models.

The generator models for the progressive growing GAN are easier to implement in Keras than the discriminator models.

The reason for this is because each fade-in requires a minor change to the output of the model.

Increasing the resolution of the generator involves first upsampling the output of the end of the last block. This is then connected to the new block and a new output layer for an image that is double the height and width dimensions or quadruple the area. During the phase-in, the upsampling is also connected to the output layer from the old model and the output from both output layers is merged using a weighted average.

After the phase-in is complete, the old output layer is removed.

This can be summarized with the following figure, taken from the paper showing a model before growing (a), during the phase-in of the larger resolution (b), and the model after the phase-in (c).

The toRGB layer is a convolutional layer with 3 1×1 filters, sufficient to output a color image.

The model takes a point in the latent space as input, e.g. such as a 100-element or 512-element vector as described in the paper. This is scaled up to provided the basis for 4×4 activation maps, followed by a convolutional layer with 4×4 filters and another with 3×3 filters. Like the discriminator, LeakyReLU activations are used, as is pixel normalization, which we will substitute with batch normalization for brevity.

A block involves an upsample layer followed by two convolutional layers with 3×3 filters. Upsampling is achieved using a nearest neighbor method (e.g. duplicating input rows and columns) via a UpSampling2D layer instead of the more common transpose convolutional layer.

We can define the baseline model that will take a point in latent space as input and output a 4×4 color image as follows:

... # base model latent input in_latent = Input(shape=(100,)) # linear scale up to activation maps g = Dense(128 * 4 * 4, kernel_initializer='he_normal')(in_latent) g = Reshape((4, 4, 128))(g) # conv 4x4, input block g = Conv2D(128, (3,3), padding='same', kernel_initializer='he_normal')(g) g = BatchNormalization()(g) g = LeakyReLU(alpha=0.2)(g) # conv 3x3 g = Conv2D(128, (3,3), padding='same', kernel_initializer='he_normal')(g) g = BatchNormalization()(g) g = LeakyReLU(alpha=0.2)(g) # conv 1x1, output block out_image = Conv2D(3, (1,1), padding='same', kernel_initializer='he_normal')(g) # define model model = Model(in_latent, out_image)

Next, we need to define a version of the model that uses all of the same input layers, although adds a new block (upsample and 2 convolutional layers) and a new output layer (a 1×1 convolutional layer).

This would be the model after the phase-in to the new output resolution. This can be achieved by using own knowledge about the baseline model and that the end of the last block is the second last layer, e.g. layer at index -2 in the model’s list of layers.

The new model with the addition of a new block and output layer is defined as follows:

... old_model = model # get the end of the last block block_end = old_model.layers[-2].output # upsample, and define new block upsampling = UpSampling2D()(block_end) g = Conv2D(64, (3,3), padding='same', kernel_initializer='he_normal')(upsampling) g = BatchNormalization()(g) g = LeakyReLU(alpha=0.2)(g) g = Conv2D(64, (3,3), padding='same', kernel_initializer='he_normal')(g) g = BatchNormalization()(g) g = LeakyReLU(alpha=0.2)(g) # add new output layer out_image = Conv2D(3, (1,1), padding='same', kernel_initializer='he_normal')(g) # define model model = Model(old_model.input, out_image)

That is pretty straightforward; we have chopped off the old output layer at the end of the last block and grafted on a new block and output layer.

Now we need a version of this new model to use during the fade-in.

This involves connecting the old output layer to the new upsampling layer at the start of the new block and using an instance of our WeightedSum layer defined in the previous section to combine the output of the old and new output layers.

... # get the output layer from old model out_old = old_model.layers[-1] # connect the upsampling to the old output layer out_image2 = out_old(upsampling) # define new output image as the weighted sum of the old and new models merged = WeightedSum()([out_image2, out_image]) # define model model2 = Model(old_model.input, merged)

We can combine the definition of these two operations into a function named *add_generator_block()*, defined below, that will expand a given model and return both the new generator model with the added block (*model1*) and a version of the model with the fading in of the new block with the old output layer (*model2*).

# add a generator block def add_generator_block(old_model): # get the end of the last block block_end = old_model.layers[-2].output # upsample, and define new block upsampling = UpSampling2D()(block_end) g = Conv2D(64, (3,3), padding='same', kernel_initializer='he_normal')(upsampling) g = BatchNormalization()(g) g = LeakyReLU(alpha=0.2)(g) g = Conv2D(64, (3,3), padding='same', kernel_initializer='he_normal')(g) g = BatchNormalization()(g) g = LeakyReLU(alpha=0.2)(g) # add new output layer out_image = Conv2D(3, (1,1), padding='same', kernel_initializer='he_normal')(g) # define model model1 = Model(old_model.input, out_image) # get the output layer from old model out_old = old_model.layers[-1] # connect the upsampling to the old output layer out_image2 = out_old(upsampling) # define new output image as the weighted sum of the old and new models merged = WeightedSum()([out_image2, out_image]) # define model model2 = Model(old_model.input, merged) return [model1, model2]

We can then call this function with our baseline model to create models with one added block and continue to call it with subsequent models to keep adding blocks.

The *define_generator()* function below implements this, taking the size of the latent space and number of blocks to add (models to create).

The baseline model is defined as outputting a color image with the shape 4×4, controlled by the default argument *in_dim*.

# define generator models def define_generator(latent_dim, n_blocks, in_dim=4): model_list = list() # base model latent input in_latent = Input(shape=(latent_dim,)) # linear scale up to activation maps g = Dense(128 * in_dim * in_dim, kernel_initializer='he_normal')(in_latent) g = Reshape((in_dim, in_dim, 128))(g) # conv 4x4, input block g = Conv2D(128, (3,3), padding='same', kernel_initializer='he_normal')(g) g = BatchNormalization()(g) g = LeakyReLU(alpha=0.2)(g) # conv 3x3 g = Conv2D(128, (3,3), padding='same', kernel_initializer='he_normal')(g) g = BatchNormalization()(g) g = LeakyReLU(alpha=0.2)(g) # conv 1x1, output block out_image = Conv2D(3, (1,1), padding='same', kernel_initializer='he_normal')(g) # define model model = Model(in_latent, out_image) # store model model_list.append([model, model]) # create submodels for i in range(1, n_blocks): # get prior model without the fade-on old_model = model_list[i - 1][0] # create new model for next resolution models = add_generator_block(old_model) # store model model_list.append(models) return model_list

We can tie all of this together and define a baseline generator and the addition of two blocks, so three models in total, where a straight-through and fade-in version of each model is defined.

The complete example is listed below.

# example of defining generator models for the progressive growing gan from keras.models import Model from keras.layers import Input from keras.layers import Dense from keras.layers import Reshape from keras.layers import Conv2D from keras.layers import UpSampling2D from keras.layers import LeakyReLU from keras.layers import BatchNormalization from keras.layers import Add from keras.utils.vis_utils import plot_model from keras import backend # weighted sum output class WeightedSum(Add): # init with default value def __init__(self, alpha=0.0, **kwargs): super(WeightedSum, self).__init__(**kwargs) self.alpha = backend.variable(alpha, name='ws_alpha') # output a weighted sum of inputs def _merge_function(self, inputs): # only supports a weighted sum of two inputs assert (len(inputs) == 2) # ((1-a) * input1) + (a * input2) output = ((1.0 - self.alpha) * inputs[0]) + (self.alpha * inputs[1]) return output # add a generator block def add_generator_block(old_model): # get the end of the last block block_end = old_model.layers[-2].output # upsample, and define new block upsampling = UpSampling2D()(block_end) g = Conv2D(64, (3,3), padding='same', kernel_initializer='he_normal')(upsampling) g = BatchNormalization()(g) g = LeakyReLU(alpha=0.2)(g) g = Conv2D(64, (3,3), padding='same', kernel_initializer='he_normal')(g) g = BatchNormalization()(g) g = LeakyReLU(alpha=0.2)(g) # add new output layer out_image = Conv2D(3, (1,1), padding='same', kernel_initializer='he_normal')(g) # define model model1 = Model(old_model.input, out_image) # get the output layer from old model out_old = old_model.layers[-1] # connect the upsampling to the old output layer out_image2 = out_old(upsampling) # define new output image as the weighted sum of the old and new models merged = WeightedSum()([out_image2, out_image]) # define model model2 = Model(old_model.input, merged) return [model1, model2] # define generator models def define_generator(latent_dim, n_blocks, in_dim=4): model_list = list() # base model latent input in_latent = Input(shape=(latent_dim,)) # linear scale up to activation maps g = Dense(128 * in_dim * in_dim, kernel_initializer='he_normal')(in_latent) g = Reshape((in_dim, in_dim, 128))(g) # conv 4x4, input block g = Conv2D(128, (3,3), padding='same', kernel_initializer='he_normal')(g) g = BatchNormalization()(g) g = LeakyReLU(alpha=0.2)(g) # conv 3x3 g = Conv2D(128, (3,3), padding='same', kernel_initializer='he_normal')(g) g = BatchNormalization()(g) g = LeakyReLU(alpha=0.2)(g) # conv 1x1, output block out_image = Conv2D(3, (1,1), padding='same', kernel_initializer='he_normal')(g) # define model model = Model(in_latent, out_image) # store model model_list.append([model, model]) # create submodels for i in range(1, n_blocks): # get prior model without the fade-on old_model = model_list[i - 1][0] # create new model for next resolution models = add_generator_block(old_model) # store model model_list.append(models) return model_list # define models generators = define_generator(100, 3) # spot check m = generators[2][1] m.summary() plot_model(m, to_file='generator_plot.png', show_shapes=True, show_layer_names=True)

The example chooses the fade-in model for the last model to summarize.

Running the example first summarizes a linear list of the layers in the model. We can see that the last model takes a point from the latent space and outputs a 16×16 image.

This matches as our expectations as the baseline model outputs a 4×4 image, adding one block increases this to 8×8, and adding one more block increases this to 16×16.

__________________________________________________________________________________________________ Layer (type) Output Shape Param # Connected to ================================================================================================== input_1 (InputLayer) (None, 100) 0 __________________________________________________________________________________________________ dense_1 (Dense) (None, 2048) 206848 input_1[0][0] __________________________________________________________________________________________________ reshape_1 (Reshape) (None, 4, 4, 128) 0 dense_1[0][0] __________________________________________________________________________________________________ conv2d_1 (Conv2D) (None, 4, 4, 128) 147584 reshape_1[0][0] __________________________________________________________________________________________________ batch_normalization_1 (BatchNor (None, 4, 4, 128) 512 conv2d_1[0][0] __________________________________________________________________________________________________ leaky_re_lu_1 (LeakyReLU) (None, 4, 4, 128) 0 batch_normalization_1[0][0] __________________________________________________________________________________________________ conv2d_2 (Conv2D) (None, 4, 4, 128) 147584 leaky_re_lu_1[0][0] __________________________________________________________________________________________________ batch_normalization_2 (BatchNor (None, 4, 4, 128) 512 conv2d_2[0][0] __________________________________________________________________________________________________ leaky_re_lu_2 (LeakyReLU) (None, 4, 4, 128) 0 batch_normalization_2[0][0] __________________________________________________________________________________________________ up_sampling2d_1 (UpSampling2D) (None, 8, 8, 128) 0 leaky_re_lu_2[0][0] __________________________________________________________________________________________________ conv2d_4 (Conv2D) (None, 8, 8, 64) 73792 up_sampling2d_1[0][0] __________________________________________________________________________________________________ batch_normalization_3 (BatchNor (None, 8, 8, 64) 256 conv2d_4[0][0] __________________________________________________________________________________________________ leaky_re_lu_3 (LeakyReLU) (None, 8, 8, 64) 0 batch_normalization_3[0][0] __________________________________________________________________________________________________ conv2d_5 (Conv2D) (None, 8, 8, 64) 36928 leaky_re_lu_3[0][0] __________________________________________________________________________________________________ batch_normalization_4 (BatchNor (None, 8, 8, 64) 256 conv2d_5[0][0] __________________________________________________________________________________________________ leaky_re_lu_4 (LeakyReLU) (None, 8, 8, 64) 0 batch_normalization_4[0][0] __________________________________________________________________________________________________ up_sampling2d_2 (UpSampling2D) (None, 16, 16, 64) 0 leaky_re_lu_4[0][0] __________________________________________________________________________________________________ conv2d_7 (Conv2D) (None, 16, 16, 64) 36928 up_sampling2d_2[0][0] __________________________________________________________________________________________________ batch_normalization_5 (BatchNor (None, 16, 16, 64) 256 conv2d_7[0][0] __________________________________________________________________________________________________ leaky_re_lu_5 (LeakyReLU) (None, 16, 16, 64) 0 batch_normalization_5[0][0] __________________________________________________________________________________________________ conv2d_8 (Conv2D) (None, 16, 16, 64) 36928 leaky_re_lu_5[0][0] __________________________________________________________________________________________________ batch_normalization_6 (BatchNor (None, 16, 16, 64) 256 conv2d_8[0][0] __________________________________________________________________________________________________ leaky_re_lu_6 (LeakyReLU) (None, 16, 16, 64) 0 batch_normalization_6[0][0] __________________________________________________________________________________________________ conv2d_6 (Conv2D) multiple 195 up_sampling2d_2[0][0] __________________________________________________________________________________________________ conv2d_9 (Conv2D) (None, 16, 16, 3) 195 leaky_re_lu_6[0][0] __________________________________________________________________________________________________ weighted_sum_2 (WeightedSum) (None, 16, 16, 3) 0 conv2d_6[1][0] conv2d_9[0][0] ================================================================================================== Total params: 689,030 Trainable params: 688,006 Non-trainable params: 1,024 __________________________________________________________________________________________________

A plot of the same fade-in version of the model is created and saved to file.

**Note**: creating this plot assumes that the pygraphviz and pydot libraries are installed. If this is a problem, comment out the import statement and call to *plot_model()*.

We can see that the output from the last block passes through an UpSampling2D layer before feeding the added block and a new output layer as well as the old output layer before being merged via a weighted sum into the final output layer.

Now that we have seen how to define the generator models, we can review how the generator models may be updated via the discriminator models.

The discriminator models are trained directly with real and fake images as input and a target value of 0 for fake and 1 for real.

The generator models are not trained directly; instead, they are trained indirectly via the discriminator models, just like a normal GAN model.

We can create a composite model for each level of growth of the model, e.g. pair 4×4 generators and 4×4 discriminators. We can also pair the straight-through models together, and the fade-in models together.

For example, we can retrieve the generator and discriminator models for a given level of growth.

... g_models, d_models = generators[0], discriminators[0]

Then we can use them to create a composite model for training the straight-through generator, where the output of the generator is fed directly to the discriminator in order to classify.

# straight-through model d_models[0].trainable = False model1 = Sequential() model1.add(g_models[0]) model1.add(d_models[0]) model1.compile(loss='mse', optimizer=Adam(lr=0.001, beta_1=0, beta_2=0.99, epsilon=10e-8))

And do the same for the composite model for the fade-in generator.

# fade-in model d_models[1].trainable = False model2 = Sequential() model2.add(g_models[1]) model2.add(d_models[1]) model2.compile(loss='mse', optimizer=Adam(lr=0.001, beta_1=0, beta_2=0.99, epsilon=10e-8))

The function below, named *define_composite()*, automates this; given a list of defined discriminator and generator models, it will create an appropriate composite model for training each generator model.

# define composite models for training generators via discriminators def define_composite(discriminators, generators): model_list = list() # create composite models for i in range(len(discriminators)): g_models, d_models = generators[i], discriminators[i] # straight-through model d_models[0].trainable = False model1 = Sequential() model1.add(g_models[0]) model1.add(d_models[0]) model1.compile(loss='mse', optimizer=Adam(lr=0.001, beta_1=0, beta_2=0.99, epsilon=10e-8)) # fade-in model d_models[1].trainable = False model2 = Sequential() model2.add(g_models[1]) model2.add(d_models[1]) model2.compile(loss='mse', optimizer=Adam(lr=0.001, beta_1=0, beta_2=0.99, epsilon=10e-8)) # store model_list.append([model1, model2]) return model_list

Tying this together with the definition of the discriminator and generator models above, the complete example of defining all models at each pre-defined level of growth is listed below.

# example of defining composite models for the progressive growing gan from keras.optimizers import Adam from keras.models import Sequential from keras.models import Model from keras.layers import Input from keras.layers import Dense from keras.layers import Flatten from keras.layers import Reshape from keras.layers import Conv2D from keras.layers import UpSampling2D from keras.layers import AveragePooling2D from keras.layers import LeakyReLU from keras.layers import BatchNormalization from keras.layers import Add from keras.utils.vis_utils import plot_model from keras import backend # weighted sum output class WeightedSum(Add): # init with default value def __init__(self, alpha=0.0, **kwargs): super(WeightedSum, self).__init__(**kwargs) self.alpha = backend.variable(alpha, name='ws_alpha') # output a weighted sum of inputs def _merge_function(self, inputs): # only supports a weighted sum of two inputs assert (len(inputs) == 2) # ((1-a) * input1) + (a * input2) output = ((1.0 - self.alpha) * inputs[0]) + (self.alpha * inputs[1]) return output # add a discriminator block def add_discriminator_block(old_model, n_input_layers=3): # get shape of existing model in_shape = list(old_model.input.shape) # define new input shape as double the size input_shape = (in_shape[-2].value*2, in_shape[-2].value*2, in_shape[-1].value) in_image = Input(shape=input_shape) # define new input processing layer d = Conv2D(64, (1,1), padding='same', kernel_initializer='he_normal')(in_image) d = LeakyReLU(alpha=0.2)(d) # define new block d = Conv2D(64, (3,3), padding='same', kernel_initializer='he_normal')(d) d = BatchNormalization()(d) d = LeakyReLU(alpha=0.2)(d) d = Conv2D(64, (3,3), padding='same', kernel_initializer='he_normal')(d) d = BatchNormalization()(d) d = LeakyReLU(alpha=0.2)(d) d = AveragePooling2D()(d) block_new = d # skip the input, 1x1 and activation for the old model for i in range(n_input_layers, len(old_model.layers)): d = old_model.layers[i](d) # define straight-through model model1 = Model(in_image, d) # compile model model1.compile(loss='mse', optimizer=Adam(lr=0.001, beta_1=0, beta_2=0.99, epsilon=10e-8)) # downsample the new larger image downsample = AveragePooling2D()(in_image) # connect old input processing to downsampled new input block_old = old_model.layers[1](downsample) block_old = old_model.layers[2](block_old) # fade in output of old model input layer with new input d = WeightedSum()([block_old, block_new]) # skip the input, 1x1 and activation for the old model for i in range(n_input_layers, len(old_model.layers)): d = old_model.layers[i](d) # define straight-through model model2 = Model(in_image, d) # compile model model2.compile(loss='mse', optimizer=Adam(lr=0.001, beta_1=0, beta_2=0.99, epsilon=10e-8)) return [model1, model2] # define the discriminator models for each image resolution def define_discriminator(n_blocks, input_shape=(4,4,3)): model_list = list() # base model input in_image = Input(shape=input_shape) # conv 1x1 d = Conv2D(64, (1,1), padding='same', kernel_initializer='he_normal')(in_image) d = LeakyReLU(alpha=0.2)(d) # conv 3x3 (output block) d = Conv2D(128, (3,3), padding='same', kernel_initializer='he_normal')(d) d = BatchNormalization()(d) d = LeakyReLU(alpha=0.2)(d) # conv 4x4 d = Conv2D(128, (4,4), padding='same', kernel_initializer='he_normal')(d) d = BatchNormalization()(d) d = LeakyReLU(alpha=0.2)(d) # dense output layer d = Flatten()(d) out_class = Dense(1)(d) # define model model = Model(in_image, out_class) # compile model model.compile(loss='mse', optimizer=Adam(lr=0.001, beta_1=0, beta_2=0.99, epsilon=10e-8)) # store model model_list.append([model, model]) # create submodels for i in range(1, n_blocks): # get prior model without the fade-on old_model = model_list[i - 1][0] # create new model for next resolution models = add_discriminator_block(old_model) # store model model_list.append(models) return model_list # add a generator block def add_generator_block(old_model): # get the end of the last block block_end = old_model.layers[-2].output # upsample, and define new block upsampling = UpSampling2D()(block_end) g = Conv2D(64, (3,3), padding='same', kernel_initializer='he_normal')(upsampling) g = BatchNormalization()(g) g = LeakyReLU(alpha=0.2)(g) g = Conv2D(64, (3,3), padding='same', kernel_initializer='he_normal')(g) g = BatchNormalization()(g) g = LeakyReLU(alpha=0.2)(g) # add new output layer out_image = Conv2D(3, (1,1), padding='same', kernel_initializer='he_normal')(g) # define model model1 = Model(old_model.input, out_image) # get the output layer from old model out_old = old_model.layers[-1] # connect the upsampling to the old output layer out_image2 = out_old(upsampling) # define new output image as the weighted sum of the old and new models merged = WeightedSum()([out_image2, out_image]) # define model model2 = Model(old_model.input, merged) return [model1, model2] # define generator models def define_generator(latent_dim, n_blocks, in_dim=4): model_list = list() # base model latent input in_latent = Input(shape=(latent_dim,)) # linear scale up to activation maps g = Dense(128 * in_dim * in_dim, kernel_initializer='he_normal')(in_latent) g = Reshape((in_dim, in_dim, 128))(g) # conv 4x4, input block g = Conv2D(128, (3,3), padding='same', kernel_initializer='he_normal')(g) g = BatchNormalization()(g) g = LeakyReLU(alpha=0.2)(g) # conv 3x3 g = Conv2D(128, (3,3), padding='same', kernel_initializer='he_normal')(g) g = BatchNormalization()(g) g = LeakyReLU(alpha=0.2)(g) # conv 1x1, output block out_image = Conv2D(3, (1,1), padding='same', kernel_initializer='he_normal')(g) # define model model = Model(in_latent, out_image) # store model model_list.append([model, model]) # create submodels for i in range(1, n_blocks): # get prior model without the fade-on old_model = model_list[i - 1][0] # create new model for next resolution models = add_generator_block(old_model) # store model model_list.append(models) return model_list # define composite models for training generators via discriminators def define_composite(discriminators, generators): model_list = list() # create composite models for i in range(len(discriminators)): g_models, d_models = generators[i], discriminators[i] # straight-through model d_models[0].trainable = False model1 = Sequential() model1.add(g_models[0]) model1.add(d_models[0]) model1.compile(loss='mse', optimizer=Adam(lr=0.001, beta_1=0, beta_2=0.99, epsilon=10e-8)) # fade-in model d_models[1].trainable = False model2 = Sequential() model2.add(g_models[1]) model2.add(d_models[1]) model2.compile(loss='mse', optimizer=Adam(lr=0.001, beta_1=0, beta_2=0.99, epsilon=10e-8)) # store model_list.append([model1, model2]) return model_list # define models discriminators = define_discriminator(3) # define models generators = define_generator(100, 3) # define composite models composite = define_composite(discriminators, generators)

Now that we know how to define all of the models, we can review how the models might be updated during training.

Pre-defining the generator, discriminator, and composite models was the hard part; training the models is straight forward and much like training any other GAN.

Importantly, in each training iteration the alpha variable in each *WeightedSum* layer must be set to a new value. This must be set for the layer in both the generator and discriminator models and allows for the smooth linear transition from the old model layers to the new model layers, e.g. alpha values set from 0 to 1 over a fixed number of training iterations.

The *update_fadein()* function below implements this and will loop through a list of models and set the alpha value on each based on the current step in a given number of training steps. You may be able to implement this more elegantly using a callback.

# update the alpha value on each instance of WeightedSum def update_fadein(models, step, n_steps): # calculate current alpha (linear from 0 to 1) alpha = step / float(n_steps - 1) # update the alpha for each model for model in models: for layer in model.layers: if isinstance(layer, WeightedSum): backend.set_value(layer.alpha, alpha)

We can define a generic function for training a given generator, discriminator, and composite model for a given number of training epochs.

The *train_epochs()* function below implements this where first the discriminator model is updated on real and fake images, then the generator model is updated, and the process is repeated for the required number of training iterations based on the dataset size and the number of epochs.

This function calls helper functions for retrieving a batch of real images via *generate_real_samples()*, generating a batch of fake samples with the generator *generate_fake_samples()*, and generating a sample of points in latent space *generate_latent_points()*. You can define these functions yourself quite trivially.

# train a generator and discriminator def train_epochs(g_model, d_model, gan_model, dataset, n_epochs, n_batch, fadein=False): # calculate the number of batches per training epoch bat_per_epo = int(dataset.shape[0] / n_batch) # calculate the number of training iterations n_steps = bat_per_epo * n_epochs # calculate the size of half a batch of samples half_batch = int(n_batch / 2) # manually enumerate epochs for i in range(n_steps): # update alpha for all WeightedSum layers when fading in new blocks if fadein: update_fadein([g_model, d_model, gan_model], i, n_steps) # prepare real and fake samples X_real, y_real = generate_real_samples(dataset, half_batch) X_fake, y_fake = generate_fake_samples(g_model, latent_dim, half_batch) # update discriminator model d_loss1 = d_model.train_on_batch(X_real, y_real) d_loss2 = d_model.train_on_batch(X_fake, y_fake) # update the generator via the discriminator's error z_input = generate_latent_points(latent_dim, n_batch) y_real2 = ones((n_batch, 1)) g_loss = gan_model.train_on_batch(z_input, y_real2) # summarize loss on this batch print('>%d, d1=%.3f, d2=%.3f g=%.3f' % (i+1, d_loss1, d_loss2, g_loss))

The images must be scaled to the size of each model. If the images are in-memory, we can define a simple scale_dataset() function to scale the loaded images.

In this case, we are using the skimage.transform.resize function from the scikit-image library to resize the NumPy array of pixels to the required size and use nearest neighbor interpolation.

# scale images to preferred size def scale_dataset(images, new_shape): images_list = list() for image in images: # resize with nearest neighbor interpolation new_image = resize(image, new_shape, 0) # store images_list.append(new_image) return asarray(images_list)

First, the baseline model must be fit for a given number of training epochs, e.g. the model that outputs 4×4 sized images.

This will require that the loaded images be scaled to the required size defined by the shape of the generator models output layer.

# fit the baseline model g_normal, d_normal, gan_normal = g_models[0][0], d_models[0][0], gan_models[0][0] # scale dataset to appropriate size gen_shape = g_normal.output_shape scaled_data = scale_dataset(dataset, gen_shape[1:]) print('Scaled Data', scaled_data.shape) # train normal or straight-through models train_epochs(g_normal, d_normal, gan_normal, scaled_data, e_norm, n_batch)

We can then process each level of growth, e.g. the first being 8×8.

This involves first retrieving the models, scaling the data to the appropriate size, then fitting the fade-in model followed by training the straight-through version of the model for fine tuning.

We can repeat this for each level of growth in a loop.

# process each level of growth for i in range(1, len(g_models)): # retrieve models for this level of growth [g_normal, g_fadein] = g_models[i] [d_normal, d_fadein] = d_models[i] [gan_normal, gan_fadein] = gan_models[i] # scale dataset to appropriate size gen_shape = g_normal.output_shape scaled_data = scale_dataset(dataset, gen_shape[1:]) print('Scaled Data', scaled_data.shape) # train fade-in models for next level of growth train_epochs(g_fadein, d_fadein, gan_fadein, scaled_data, e_fadein, n_batch) # train normal or straight-through models train_epochs(g_normal, d_normal, gan_normal, scaled_data, e_norm, n_batch)

We can tie this together and define a function called *train()* to train the progressive growing GAN function.

# train the generator and discriminator def train(g_models, d_models, gan_models, dataset, latent_dim, e_norm, e_fadein, n_batch): # fit the baseline model g_normal, d_normal, gan_normal = g_models[0][0], d_models[0][0], gan_models[0][0] # scale dataset to appropriate size gen_shape = g_normal.output_shape scaled_data = scale_dataset(dataset, gen_shape[1:]) print('Scaled Data', scaled_data.shape) # train normal or straight-through models train_epochs(g_normal, d_normal, gan_normal, scaled_data, e_norm, n_batch) # process each level of growth for i in range(1, len(g_models)): # retrieve models for this level of growth [g_normal, g_fadein] = g_models[i] [d_normal, d_fadein] = d_models[i] [gan_normal, gan_fadein] = gan_models[i] # scale dataset to appropriate size gen_shape = g_normal.output_shape scaled_data = scale_dataset(dataset, gen_shape[1:]) print('Scaled Data', scaled_data.shape) # train fade-in models for next level of growth train_epochs(g_fadein, d_fadein, gan_fadein, scaled_data, e_fadein, n_batch, True) # train normal or straight-through models train_epochs(g_normal, d_normal, gan_normal, scaled_data, e_norm, n_batch)

The number of epochs for the normal phase is defined by the *e_norm* argument and the number of epochs during the fade-in phase is defined by the *e_fadein* argument.

The number of epochs must be specified based on the size of the image dataset and the same number of epochs can be used for each phase, as was used in the paper.

We start with 4×4 resolution and train the networks until we have shown the discriminator 800k real images in total. We then alternate between two phases: fade in the first 3-layer block during the next 800k images, stabilize the networks for 800k images, fade in the next 3-layer block during 800k images, etc.

— Progressive Growing of GANs for Improved Quality, Stability, and Variation, 2017.

We can then define our models as we did in the previous section, then call the training function.

# number of growth phase, e.g. 3 = 16x16 images n_blocks = 3 # size of the latent space latent_dim = 100 # define models d_models = define_discriminator(n_blocks) # define models g_models = define_generator(100, n_blocks) # define composite models gan_models = define_composite(d_models, g_models) # load image data dataset = load_real_samples() # train model train(g_models, d_models, gan_models, dataset, latent_dim, 100, 100, 16)

This section provides more resources on the topic if you are looking to go deeper.

- Progressive Growing of GANs for Improved Quality, Stability, and Variation, 2017.
- Progressive Growing of GANs for Improved Quality, Stability, and Variation, Official.
- progressive_growing_of_gans Project (official), GitHub.
- Progressive Growing of GANs for Improved Quality, Stability, and Variation. Open Review.
- Progressive Growing of GANs for Improved Quality, Stability, and Variation, YouTube.
- Progressive growing of GANs for improved quality, stability and variation, KeyNote, YouTube.

- Keras Datasets API.
- Keras Sequential Model API
- Keras Convolutional Layers API
- How can I “freeze” Keras layers?
- Keras Contrib Project
- skimage.transform.resize API

- Keras-progressive_growing_of_gans Project, GitHub.
- Hands-On-Generative-Adversarial-Networks-with-Keras Project, GitHub.

In this tutorial, you discovered how to develop progressive growing generative adversarial network models from scratch with Keras.

Specifically, you learned:

- How to develop pre-defined discriminator and generator models at each level of output image growth.
- How to define composite models for training the generator models via the discriminator models.
- How to cycle the training of fade-in version and normal versions of models at each level of output image growth.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Implement Progressive Growing GAN Models in Keras appeared first on MachineLearningMastery.com.

]]>The post A Gentle Introduction to the Progressive Growing GAN appeared first on MachineLearningMastery.com.

]]>It involves starting with a very small image and incrementally adding blocks of layers that increase the output size of the generator model and the input size of the discriminator model until the desired image size is achieved.

This approach has proven effective at generating high-quality synthetic faces that are startlingly realistic.

In this post, you will discover the progressive growing generative adversarial network for generating large images.

After reading this post, you will know:

- GANs are effective at generating sharp images, although they are limited to small image sizes because of model stability.
- Progressive growing GAN is a stable approach to training GAN models to generate large high-quality images that involves incrementally increasing the size of the model during training.
- Progressive growing GAN models are capable of generating photorealistic synthetic faces and objects at high resolution that are remarkably realistic.

**Kick-start your project** with my new book Generative Adversarial Networks with Python, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

This tutorial is divided into five parts; they are:

- GANs Are Generally Limited to Small Images
- Generate Large Images by Progressively Adding Layers
- How to Progressively Grow a GAN
- Images Generated by the Progressive Growing GAN
- How to Configure Progressive Growing GAN Models

Generative Adversarial Networks, or GANs for short, are an effective approach for training deep convolutional neural network models for generating synthetic images.

Training a GAN model involves two models: a generator used to output synthetic images, and a discriminator model used to classify images as real or fake, which is used to train the generator model. The two models are trained together in an adversarial manner, seeking an equilibrium.

Compared to other approaches, they are both fast and result in crisp images.

A problem with GANs is that they are limited to small dataset sizes, often a few hundred pixels and often less than 100-pixel square images.

GANs produce sharp images, albeit only in fairly small resolutions and with somewhat limited variation, and the training continues to be unstable despite recent progress.

— Progressive Growing of GANs for Improved Quality, Stability, and Variation, 2017.

Generating high-resolution images is believed to be challenging for GAN models as the generator must learn how to output both large structure and fine details at the same time.

The high resolution makes any issues in the fine detail of generated images easy to spot for the discriminator and the training process fails.

The generation of high-resolution images is difficult because higher resolution makes it easier to tell the generated images apart from training images …

— Progressive Growing of GANs for Improved Quality, Stability, and Variation, 2017.

Large images, such as 1024-pixel square images, also require significantly more memory, which is in relatively limited supply on modern GPU hardware compared to main memory.

As such, the batch size that defines the number of images used to update model weights each training iteration must be reduced to ensure that the large images fit into memory. This, in turn, introduces further instability into the training process.

Large resolutions also necessitate using smaller minibatches due to memory constraints, further compromising training stability.

— Progressive Growing of GANs for Improved Quality, Stability, and Variation, 2017.

Additionally, the training of GAN models remains unstable, even in the presence of a suite of empirical techniques designed to improve the stability of the model training process.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

A solution to the problem of training stable GAN models for larger images is to progressively increase the number of layers during the training process.

This approach is called Progressive Growing GAN, Progressive GAN, or PGGAN for short.

The approach was proposed by Tero Karras, et al. from Nvidia in the 2017 paper titled “Progressive Growing of GANs for Improved Quality, Stability, and Variation” and presented at the 2018 ICLR conference.

Our primary contribution is a training methodology for GANs where we start with low-resolution images, and then progressively increase the resolution by adding layers to the networks.

— Progressive Growing of GANs for Improved Quality, Stability, and Variation, 2017.

Progressive Growing GAN involves using a generator and discriminator model with the same general structure and starting with very small images, such as 4×4 pixels.

During training, new blocks of convolutional layers are systematically added to both the generator model and the discriminator models.

The incremental addition of the layers allows the models to effectively learn coarse-level detail and later learn ever finer detail, both on the generator and discriminator side.

This incremental nature allows the training to first discover large-scale structure of the image distribution and then shift attention to increasingly finer scale detail, instead of having to learn all scales simultaneously.

— Progressive Growing of GANs for Improved Quality, Stability, and Variation, 2017.

This approach allows the generation of large high-quality images, such as 1024×1024 photorealistic faces of celebrities that do not exist.

Progressive Growing GAN requires that the capacity of both the generator and discriminator model be expanded by adding layers during the training process.

This is much like the greedy layer-wise training process that was common for developing deep learning neural networks prior to the development of ReLU and Batch Normalization.

For example, see the post:

Unlike greedy layer-wise pretraining, progressive growing GAN involves adding blocks of layers and phasing in the addition of the blocks of layers rather than adding them directly.

When new layers are added to the networks, we fade them in smoothly […] This avoids sudden shocks to the already well-trained, smaller-resolution layers.

— Progressive Growing of GANs for Improved Quality, Stability, and Variation, 2017.

Further, all layers remain trainable during the training process, including existing layers when new layers are added.

All existing layers in both networks remain trainable throughout the training process.

— Progressive Growing of GANs for Improved Quality, Stability, and Variation, 2017.

The phasing in of a new block of layers involves using a skip connection to connect the new block to the input of the discriminator or output of the generator and adding it to the existing input or output layer with a weighting. The weighting controls the influence of the new block and is achieved using a parameter alpha (a) that starts at zero or a very small number and linearly increases to 1.0 over training iterations.

This is demonstrated in the figure below, taken from the paper.

It shows a generator that outputs a 16×16 image and a discriminator that takes a 16×16 pixel image. The models are grown to the size of 32×32.

Let’s take a closer look at how to progressively add layers to the generator and discriminator when going from 16×16 to 32×32 pixels.

For the generator, this involves adding a new block of convolutional layers that outputs a 32×32 image.

The output of this new layer is combined with the output of the 16×16 layer that is upsampled using nearest neighbor interpolation to 32×32. This is different from many GAN generators that use a transpose convolutional layer.

… doubling […] the image resolution using nearest neighbor filtering

— Progressive Growing of GANs for Improved Quality, Stability, and Variation, 2017.

The contribution of the upsampled 16×16 layer is weighted by (1 – alpha), whereas the contribution of the new 32×32 layer is weighted by alpha.

Alpha is small initially, giving the most weight to the scaled-up version of the 16×16 image, although slowly transitions to giving more weight and then all weight to the new 32×32 output layers over training iterations.

During the transition we treat the layers that operate on the higher resolution like a residual block, whose weight alpha increases linearly from 0 to 1.

— Progressive Growing of GANs for Improved Quality, Stability, and Variation, 2017.

For the discriminator, this involves adding a new block of convolutional layers for the input of the model to support image sizes with 32×32 pixels.

The input image is downsampled to 16×16 using average pooling so that it can pass through the existing 16×16 convolutional layers. The output of the new 32×32 block of layers is also downsampled using average pooling so that it can be provided as input to the existing 16×16 block. This is different from most GAN models that use a 2×2 stride in the convolutional layers to downsample.

… halving the image resolution using […] average pooling

— Progressive Growing of GANs for Improved Quality, Stability, and Variation, 2017.

The two downsampled versions of the input are combined in a weighted manner, starting with a full weighting to the downsampled raw input and linearly transitioning to a full weighting for the interpreted output of the new input layer block.

In this section, we can review some of the impressive results achieved with the Progressive Growing GAN described in the paper.

Many example images are provided in the appendix of the paper and I recommend reviewing it. Additionally, a YouTube video was also created summarizing the impressive results of the model.

Perhaps the most impressive accomplishment of the Progressive Growing GAN is the generation of large 1024×1024 pixel photorealistic generated faces.

The model was trained on a high-quality version of the celebrity faces dataset, called CELEBA-HQ. As such, the faces look familiar as they contain elements of many real celebrity faces, although none of the people actually exist.

Interestingly, the model required to generate the faces was trained on 8 GPUs for 4 days, perhaps out of the range of most developers.

We trained the network on 8 Tesla V100 GPUs for 4 days, after which we no longer observed qualitative differences between the results of consecutive training iterations. Our implementation used an adaptive minibatch size depending on the current output resolution so that the available memory budget was optimally utilized.

— Progressive Growing of GANs for Improved Quality, Stability, and Variation, 2017.

The model was also demonstrated on generating 256×256-pixel photorealistic synthetic objects from the LSUN dataset, such as bikes, buses, and churches.

The paper describes the configuration details of the model used to generate the 1024×1024 synthetic photographs of celebrity faces.

Specifically, the details are provided in Appendix A.

Although we may not be interested or have the resources to develop such a large model, the configuration details may be useful when implementing a Progressive Growing GAN.

Both the discriminator and generator models were grown using blocks of convolutional layers, each using a specific number of filters with the size 3×3 and the LeakyReLU activation layer with the slope of 0.2. Upsampling was achieved via nearest neighbor sampling and downsampling was achieved using average pooling.

Both networks consist mainly of replicated 3-layer blocks that we introduce one by one during the course of the training. […] We use leaky ReLU with leakiness 0.2 in all layers of both networks, except for the last layer that uses linear activation.

— Progressive Growing of GANs for Improved Quality, Stability, and Variation, 2017.

The generator used a 512-element latent vector of Gaussian random variables. It also used an output layer with a 1×1-sized filters and a linear activation function, instead of the more common hyperbolic tangent activation function (tanh). The discriminator also used an output layer with 1×1-sized filters and a linear activation function.

The Wasserstein GAN loss was used with the gradient penalty, so-called WGAN-GP as described in the 2017 paper titled “Improved Training of Wasserstein GANs.” The least squares loss was tested and showed good results, but not as good as WGAN-GP.

The models start with a 4×4 input image and grow until they reach the 1024×1024 target.

Tables were provided that list the number of layers and number of filters used in each layer for the generator and discriminator models, reproduced below.

Batch normalization is not used; instead, two other techniques are added, including minibatch standard deviation pixel-wise normalization.

The standard deviation of activations across images in the mini-batch is added as a new channel prior to the last block of convolutional layers in the discriminator model. This is referred to as “*Minibatch standard deviation*.”

We inject the across-minibatch standard deviation as an additional feature map at 4×4 resolution toward the end of the discriminator

— Progressive Growing of GANs for Improved Quality, Stability, and Variation, 2017.

A pixel-wise normalization is performed in the generator after each convolutional layer that normalizes each pixel value in the activation map across the channels to a unit length. This is a type of activation constraint that is more generally referred to as “*local response normalization*.”

The bias for all layers is initialized as zero and model weights are initialized as a random Gaussian rescaled using the He weight initialization method.

We initialize all bias parameters to zero and all weights according to the normal distribution with unit variance. However, we scale the weights with a layer-specific constant at runtime …

— Progressive Growing of GANs for Improved Quality, Stability, and Variation, 2017.

The models are optimized using the Adam version of stochastic gradient descent with a small learning rate and low momentum.

We train the networks using Adam with a = 0.001, B1=0, B2=0.99, and eta = 10^−8.

— Progressive Growing of GANs for Improved Quality, Stability, and Variation, 2017.

Image generation uses a weighted average of prior models rather a given model snapshot, much like a horizontal ensemble.

… visualizing generator output at any given point during the training, we use an exponential running average for the weights of the generator with decay 0.999

— Progressive Growing of GANs for Improved Quality, Stability, and Variation, 2017.

This section provides more resources on the topic if you are looking to go deeper.

- Progressive Growing of GANs for Improved Quality, Stability, and Variation, 2017.
- Progressive Growing of GANs for Improved Quality, Stability, and Variation, Official.
- progressive_growing_of_gans Project (official), GitHub.
- Progressive Growing of GANs for Improved Quality, Stability, and Variation. Open Review.
- Progressive Growing of GANs for Improved Quality, Stability, and Variation, YouTube.

In this post, you discovered the progressive growing generative adversarial network for generating large images.

Specifically, you learned:

- GANs are effective at generating sharp images, although they are limited to small image sizes because of model stability.
- Progressive growing GAN is a stable approach to training GAN models to generate large high-quality images that involves incrementally increasing the size of the model during training.
- Progressive growing GAN models are capable of generating photorealistic synthetic faces and objects at high resolution that are remarkably realistic.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to the Progressive Growing GAN appeared first on MachineLearningMastery.com.

]]>