Generative Adversarial Networks, or GANs for short, are effective at generating large high-quality images.
Most improvement has been made to discriminator models in an effort to train more effective generator models, although less effort has been put into improving the generator models.
The Style Generative Adversarial Network, or StyleGAN for short, is an extension to the GAN architecture that proposes large changes to the generator model, including the use of a mapping network to map points in latent space to an intermediate latent space, the use of the intermediate latent space to control style at each point in the generator model, and the introduction to noise as a source of variation at each point in the generator model.
The resulting model is capable not only of generating impressively photorealistic high-quality photos of faces, but also offers control over the style of the generated image at different levels of detail through varying the style vectors and noise.
In this post, you will discover the Style Generative Adversarial Network that gives control over the style of generated synthetic images.
After reading this post, you will know:
- The lack of control over the style of synthetic images generated by traditional GAN models.
- The architecture of StyleGAN model that introduces control over the style of generated images at different levels of detail.
- Impressive results achieved with the StyleGAN architecture when used to generate synthetic human faces.
Kick-start your project with my new book Generative Adversarial Networks with Python, including step-by-step tutorials and the Python source code files for all examples.
Let’s get started.
Overview
This tutorial is divided into four parts; they are:
- Lacking Control Over Synthesized Images
- Control Style Using New Generator Model
- What Is the StyleGAN Model Architecture
- Examples of StyleGAN Generated Images
Lacking Control Over Synthesized Images
Generative adversarial networks are effective at generating high-quality and large-resolution synthetic images.
The generator model takes as input a point from latent space and generates an image. This model is trained by a second model, called the discriminator, that learns to differentiate real images from the training dataset from fake images generated by the generator model. As such, the two models compete in an adversarial game and find a balance or equilibrium during the training process.
Many improvements to the GAN architecture have been achieved through enhancements to the discriminator model. These changes are motivated by the idea that a better discriminator model will, in turn, lead to the generation of more realistic synthetic images.
As such, the generator has been somewhat neglected and remains a black box. For example, the source of randomness used in the generation of synthetic images is not well understood, including both the amount of randomness in the sampled points and the structure of the latent space.
Yet the generators continue to operate as black boxes, and despite recent efforts, the understanding of various aspects of the image synthesis process, […] is still lacking. The properties of the latent space are also poorly understood …
— A Style-Based Generator Architecture for Generative Adversarial Networks, 2018.
This limited understanding of the generator is perhaps most exemplified by the general lack of control over the generated images. There are few tools to control the properties of generated images, e.g. the style. This includes high-level features such as background and foreground, and fine-grained details such as the features of synthesized objects or subjects.
This requires both disentangling features or properties in images and adding controls for these properties to the generator model.
Want to Develop GANs from Scratch?
Take my free 7-day email crash course now (with sample code).
Click to sign-up and also get a free PDF Ebook version of the course.
Control Style Using New Generator Model
The Style Generative Adversarial Network, or StyleGAN for short, is an extension to the GAN architecture to give control over the disentangled style properties of generated images.
Our generator starts from a learned constant input and adjusts the “style” of the image at each convolution layer based on the latent code, therefore directly controlling the strength of image features at different scales
— A Style-Based Generator Architecture for Generative Adversarial Networks, 2018.
The StyleGAN is an extension of the progressive growing GAN that is an approach for training generator models capable of synthesizing very large high-quality images via the incremental expansion of both discriminator and generator models from small to large images during the training process.
In addition to the incremental growing of the models during training, the style GAN changes the architecture of the generator significantly.
The StyleGAN generator no longer takes a point from the latent space as input; instead, there are two new sources of randomness used to generate a synthetic image: a standalone mapping network and noise layers.
The output from the mapping network is a vector that defines the styles that is integrated at each point in the generator model via a new layer called adaptive instance normalization. The use of this style vector gives control over the style of the generated image.
Stochastic variation is introduced through noise added at each point in the generator model. The noise is added to entire feature maps that allow the model to interpret the style in a fine-grained, per-pixel manner.
This per-block incorporation of style vector and noise allows each block to localize both the interpretation of style and the stochastic variation to a given level of detail.
The new architecture leads to an automatically learned, unsupervised separation of high-level attributes (e.g., pose and identity when trained on human faces) and stochastic variation in the generated images (e.g., freckles, hair), and it enables intuitive, scale-specific control of the synthesis
— A Style-Based Generator Architecture for Generative Adversarial Networks, 2018.
What Is the StyleGAN Model Architecture
The StyleGAN is described as a progressive growing GAN architecture with five modifications, each of which was added and evaluated incrementally in an ablative study.
The incremental list of changes to the generator are:
- Baseline Progressive GAN.
- Addition of tuning and bilinear upsampling.
- Addition of mapping network and AdaIN (styles).
- Removal of latent vector input to generator.
- Addition of noise to each block.
- Addition Mixing regularization.
The image below summarizes the StyleGAN generator architecture.
We can review each of these changes in more detail.
1. Baseline Progressive GAN
The StyleGAN generator and discriminator models are trained using the progressive growing GAN training method.
This means that both models start with small images, in this case, 4×4 images. The models are fit until stable, then both discriminator and generator are expanded to double the width and height (quadruple the area), e.g. 8×8.
A new block is added to each model to support the larger image size, which is faded in slowly over training. Once faded-in, the models are again trained until reasonably stable and the process is repeated with ever-larger image sizes until the desired target image size is met, such as 1024×1024.
For more on the progressive growing GAN, see the paper:
2. Bilinear Sampling
The progressive growing GAN uses nearest neighbor layers for upsampling instead of transpose convolutional layers that are common in other generator models.
The first point of deviation in the StyleGAN is that bilinear upsampling layers are unused instead of nearest neighbor.
We replace the nearest-neighbor up/downsampling in both networks with bilinear sampling, which we implement by lowpass filtering the activations with a separable 2nd order binomial filter after each upsampling layer and before each downsampling layer.
— A Style-Based Generator Architecture for Generative Adversarial Networks, 2018.
3. Mapping Network and AdaIN
Next, a standalone mapping network is used that takes a randomly sampled point from the latent space as input and generates a style vector.
The mapping network is comprised of eight fully connected layers, e.g. it is a standard deep neural network.
For simplicity, we set the dimensionality of both [the latent and intermediate latent] spaces to 512, and the mapping f is implemented using an 8-layer MLP …
— A Style-Based Generator Architecture for Generative Adversarial Networks, 2018.
The style vector is then transformed and incorporated into each block of the generator model after the convolutional layers via an operation called adaptive instance normalization or AdaIN.
The AdaIN layers involve first standardizing the output of feature map to a standard Gaussian, then adding the style vector as a bias term.
Learned affine transformations then specialize [the intermediate latent vector] to styles y = (ys, yb) that control adaptive instance normalization (AdaIN) operations after each convolution layer of the synthesis network g.
— A Style-Based Generator Architecture for Generative Adversarial Networks, 2018.
The addition of the new mapping network to the architecture also results in the renaming of the generator model to a “synthesis network.”
4. Removal of Latent Point Input
The next change involves modifying the generator model so that it no longer takes a point from the latent space as input.
Instead, the model has a constant 4x4x512 constant value input in order to start the image synthesis process.
5. Addition of Noise
The output of each convolutional layer in the synthesis network is a block of activation maps.
Gaussian noise is added to each of these activation maps prior to the AdaIN operations. A different sample of noise is generated for each block and is interpreted using per-layer scaling factors.
These are single-channel images consisting of uncorrelated Gaussian noise, and we feed a dedicated noise image to each layer of the synthesis network. The noise image is broadcasted to all feature maps using learned per-feature scaling factors and then added to the output of the corresponding convolution …
— A Style-Based Generator Architecture for Generative Adversarial Networks, 2018.
This noise is used to introduce style-level variation at a given level of detail.
6. Mixing regularization
Mixing regularization involves first generating two style vectors from the mapping network.
A split point in the synthesis network is chosen and all AdaIN operations prior to the split point use the first style vector and all AdaIN operations after the split point get the second style vector.
… we employ mixing regularization, where a given percentage of images are generated using two random latent codes instead of one during training.
— A Style-Based Generator Architecture for Generative Adversarial Networks, 2018.
This encourages the layers and blocks to localize the style to specific parts of the model and corresponding level of detail in the generated image.
Examples of StyleGAN Generated Images
The StyleGAN is both effective at generating large high-quality images and at controlling the style of the generated images.
In this section, we will review some examples of generated images.
A video demonstrating the capability of the model was released by the authors of the paper, providing a useful overview.
High-Quality Faces
The image below taken from the paper shows synthetic faces generated with the StyleGAN with the sizes 4×4, 8×8, 16×16, and 32×32.
Varying Style by Level of Detail
The use of different style vectors at different points of the synthesis network gives control over the styles of the resulting image at different levels of detail.
For example, blocks of layers in the synthesis network at lower resolutions (e.g. 4×4 and 8×8) control high-level styles such as pose and hairstyle. Blocks of layers in the model of the network (e.g. as 16×16 and 32×32) control hairstyles and facial expression. Finally, blocks of layers closer to the output end of the network (e.g. 64×64 to 1024×1024) control color schemes and very fine details.
The image below taken from the paper shows generated images on the left and across the top. The two rows of intermediate images are examples of the style vectors used to generate the images on the left, where the style vectors used for the images on the top are used only in the lower levels. This allows the images on the left to adopt high-level styles such as pose and hairstyle from the images on the top in each column.
Copying the styles corresponding to coarse spatial resolutions (4^2 – 8^2) brings high-level aspects such as pose, general hair style, face shape, and eyeglasses from source B, while all colors (eyes, hair, lighting) and finer facial features resemble A.
— A Style-Based Generator Architecture for Generative Adversarial Networks, 2018.
Use of Noise to Control Level of Detail
The authors varied the use of noise at different levels of detail in the model (e.g. fine, middle, coarse), much like the previous example of varying style.
The result is that noise gives control over the generation of detail, from broader structure when noise is used in the coarse blocks of layers to the generation of fine detail when noise is added to the layers closer to the output of the network.
We can see that the artificial omission of noise leads to featureless “painterly” look. Coarse noise causes large-scale curling of hair and appearance of larger background features, while the fine noise brings out the finer curls of hair, finer background detail, and skin pores.
— A Style-Based Generator Architecture for Generative Adversarial Networks, 2018.
Further Reading
This section provides more resources on the topic if you are looking to go deeper.
- A Style-Based Generator Architecture for Generative Adversarial Networks, 2018.
- Progressive Growing of GANs for Improved Quality, Stability, and Variation, 2017.
- StyleGAN – Official TensorFlow Implementation, GitHub.
- StyleGAN Results Video, YouTube.
Summary
In this post, you discovered the Style Generative Adversarial Network that gives control over the style of generated synthetic images.
Specifically, you learned:
- The lack of control over the style of synthetic images generated by traditional GAN models.
- The architecture of StyleGAN model GAN model that introduces control over the style of generated images at different levels of detail
- Impressive results achieved with the StyleGAN architecture when used to generate synthetic human faces.
Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.
Hey Jason, I really love your work. Can you please help me by clarifying the internal architecture of AdaIN? (Just give me a brief overview). Thanks in advance.
Thanks for the suggestion, I may cover it in the future.
How can you validate if the image is real or fake programmatically ?
The discriminator will make this prediction.
Hi Mr.Jason could you please make a tutorial for styleGAN and style mixing using keras and tensorflow
Thanks for the suggestion.
Hey Jason, I’m appreciate your intorduction about the GANs. In the styleGan, I am confused about the latent space disentanglement? what ‘s that. Can you give us more detailed explanation
Thanks for the suggestion.
Thanks for the effort of the high-level explanation of many GAN-papers.
A minor question. You say “The mapping network is comprised of eight fully connected layers, e.g. it is a standard deep convolutional neural network.” I think the mapping network is a standard feed-forward network (MLP) without convolutional layers so I think there is a small mistake here?
Thanks. Yes, looks like a typo. Fixed.
Anyone know what activation is used in this MLP?
It’s not an MLP, it’s a GAN.
ReLU remain popular in GANs, you can get started here:
https://machinelearningmastery.com/start-here/#gans
The mapping network (which converts the latent space to w) is basically an MLP. From StyleGAN code it looks like they’re using Leaky-ReLU for its activations (if anyone is looking for the same).
Thanks.
Hi, thank you for good tutorial. I have a question about images that generated using sours A and sours B. For example i want using my sours of images A to generate B, for this how i can input A sours. Or the sours A and B taken from training data?
Perhaps this will help you load your images:
https://machinelearningmastery.com/how-to-load-and-manipulate-images-for-deep-learning-in-python-with-pil-pillow/
hello
thank you for great tutorial
i have two questions about adding input noise to output of conv layers
i understand that we generate a guassian noise for each conv layer
1-
we use different (new random samples) noise during training
or keep it same and just train B variables (per-channel scale factors)
2-
and what in inference time (using random or deterministic noise)?
Good question, from the tutorial:
Can you tell me how to test custom image on stylemixing using styleGAN architecture.
I don’t have a tutorial on this topic, perhaps in the future.
Topic “Varying Style by Level of Detail” is duplicated, i guess second one should be “level of noise” 🙂
Thanks, fixed!
Hi Jason
Thank for the article. Any chances of you showing how to implement this from scratch?
Perhaps in the future.
Hi Jason,
Pretty interesting article unveiling the Stylegan network. Question, if one had thousands of real aircraft trajectory data and were to generate hiperrealistic (and diverse) synthetic aircraft trajectories (altitude, latitude, longitude; multivariate time series), would you use:
1 – Wasserstein GAN with GP using LSTMs/GRUs.
2 – Try to modify the stylegan architecture to use LSTMs/GRUs cells and generate sequencies?
3 – other..
It would be great to have your opinion,
Thank you and good job!
Thanks.
For time series, I would not recommend a GAN as they are for images, I’d recommend checking the literature for generative models specific to time series.
I’m trying to come up with a way to use a GAN to generate textures for 3D models.
Additionally, it should be possible to build 3D shapes the same way, as a 3D shape can be encoded in a 2D image using vector displacement.
That sounds like a fun project, let me know how you go!
Thanks for the lovely article on styleGAN. Learned a lot
You’re welcome.
Hi Jason, are samples A and B both taken from previous fake generated images or does either one of them have to come from a real life human? I’m just curious to know if the AI could potentially create a face out of nothing if it just knew, by help of the discriminator, what a real face should look like.
Thanks,
Samples A and B are real photos from two different domains.
Thanks for the great tutorial. What is the “Latent Z” in styleGAN architecture and how we can obtain it. Is it sampled from Gaussian distribution as in normal GANs or obtained by feeding images to some pre-trained network?
You’re welcome.
From the tutorial “The StyleGAN generator no longer takes a point from the latent space as input”
I recommend re-reading.
Is it possible to direct StyleGAN to make a specific sheaf of images, eg., ‘males, about age 25’ rather than random faces?
If so, how would one specify these parameters?
Perhaps – I believe so. It really depends on the specific framing of the problem and the training data you have available – where you can associate images with the specific input variables. A straight stylegan might not be the best fit, it might be better to use a variation that gives you trainable control variables.
I don’t have much on this, you may need to dive into the literature to discover the latest.
Hello Jason, thanks for the tutorial. In the paper, the authors have mentioned that they have used a learned constant, rather than a random latent vector. Is this learned constant, a randomly initialized vector which is later updated during the training time through backpropagation(like weights of a convolution layer) and then treated as a constant during the inference time?
Not sure off the cuff, I assume it is a learned/adapted vector input.
Are the Learned affine transformations(before the AdaIN block) just fully connected layers, whose weights are learned over the training period or something else?
From memory, I believe so – you can check the paper and linked code project to be sure.
Hi Jason,
I wonder if styleGAN can be used for feature extraction to be used for feature rehearsal in incremental learning scenarios? Generally, is styleGAN the best choice among variations of GANs for feature extraction?
Perhaps try it and see?
how conv 3*3 works or what is its algorithm used in synthesis network of styleGAN?
This tutorial explains how convolutional layers work:
https://machinelearningmastery.com/convolutional-layers-for-deep-learning-neural-networks/
Hej Jason,
I am trying to build a GAN that is transferring a style from a piece of art to a fashion item like a t-shirt in a picture. What type of GAN would you recommend for it?
sir can you please give brief explanation about how to make custom dataset so we can train on that please give it help to my project work
Hi deepa…The following resource may be of interest:
https://machinelearningmastery.com/a-guide-to-getting-datasets-for-machine-learning-in-python/