87: DALL-E 2 (unCLIP)

Tue, 07 Jun 2022 16:00:00 +0000

Hierarchical Text-Conditional Image Generation with CLIP Latents by Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu et al. explained in 5 minutes

⭐️Paper difficulty: 🌕🌕🌕🌕🌑

🎯 At a glance:

Does this paper even need an introduction? This is DALL-E 2 and if you have somehow not heard of it yet, sit down. Your mind is about to be blown.

Enough said. Let’s dive in, shall we?

⌛️ Prerequisites:

(Highly recommended reading to understand the core contributions of this paper):

🚀 Motivation:

DALL-E 2 feels like the pinnacle of evolution of text-to-image models, the final boss and if you have been reading Casual GAN Papers, then you are in luck, because we have covered all of the prerequisites in previous posts. No, seriously, if you missed any, go back and read them. Now! Done? Good. The pitch is simple, let’s make the ultimate text-to-image model using all of the recent breakthroughs from diffusion in combination with the monstrous dataset of combined 650 million text-image pairs from CLIP and DALL-E.

🔍 Main Ideas:

1. Overview:
DALL-E 2 or unCLIP, as it referred to here, consists of a prior that maps the CLIP text embedding to a CLIP image embedding and a diffusion decoder that outputs the final image, conditioned on the predicted CLIP image embedding.

2. Decoder:
The decoder is based on GLIDE with classifier-free guidance. It additionally receives projected CLIP embeddings added to the tilmestep embedding and four more tokens of separately projected CLIP embeddings concatenated as context to the outputs of the GLIDE text encoder. The original text conditioning pathway from GLIDE is not used.

In addition to the base decoder, two upsampling diffusion models are trained to take images from 64 to 256 and from 256 to 1024 resolution respectively. These models are unconditional and do not use attention layers, only convolutions. This helps speed up training and reduce the compute requirements.

3. Prior:
Two types of prior models were considered: autoregressive and diffusion. The first converts a CLIP embedding into a sequence of discrete codes predicted autoregressively conditioned on the caption. The second models the CLIP image embedding vector directly with a Gaussian diffusion model conditioned on the caption.

To train the autoregressive prior the authors use PCA on the CLIP image embeddings to select the 319 most important dimensions and quantize them into 1024 discrete tokens. The transformer prior then learns to model the sequence of these tokens. The conditions are prepended to the sequence along with a quantized token that represents how good the input caption matches the image. This enables explicit control of how much leeway the model has when generating an image for a caption at inference.

The diffusion prior is a decoder-only transformer that is trained on a sequence consisting of the encoded text, the CLIP text embedding, the diffusion tilmestep embedding, the noised CLIP image embedding and the final token that holds the unnoised CLIP image embedding after diffusion. The diffusion prior is not conditioned on the text-image matching score. Instead, two samples are produced at inference, and the one with the better score is selected.

4. Image Manipulations:
DALL-E two is not a boring model and it can create variations of input images, interpolate between images and perform “text diffs” among other things. Such a feat is achieved by encoding images into the latent space of the decoder represented as two vectors, one for describing what the CLIP models sees, the other for all of the residual information needed to reconstruct the image. The first is obtained from the CLIP image encoder, the other by inverting the decoder (obtaining the latent vector that best reconstructs the image).

Circling back to the beautiful use cases for these embeddings we have:

Semantically similar variations of the input image with remixed content are obtained by freezing the CLIP image embedding and resampling the later steps of the diffusion decoder for the other latent.

Interpolations between two images by applying SLERP to both types of embeddings.

Text diffs for text-guided image editing by subtracting the original CLIP text embedding from the CLIP text embedding of the new caption and SLERPing this normalized difference to the CLIP image embedding of the input image to make it more like the edit prompt.

5. Probing the CLIP space:
Another set of insightful, albeit less fun (in my opinion) experiments consist of applying PCA to the CLIP embedding to see how much and what type of information each component contains and generating variations of input images containing typographic attacks (the word IPOD is written over an Apple). Interestingly, the neighboring images still contain an apple despite the class label “IPOD” has a much higher probability. I believe this is also the source of the insane “secret language” of DALL-E 2

📈 Experiment insights / Key takeaways:

Baselines: GLIDE, AttnGAN, DM-GAN, DF-GAN, XMC-GAN, LAFITTE, Make-A-Scene
Datasets: MS-COCO
Metrics: FID, User Study
Qualitative: Yes, the images are as awesome as the authors claim they are
Quantitative: The diffusion prior gets several points over the autoregressive prior, while GLIDE wins on photorealism and caption similarity and unCLIP takes home the diversity trophy from the user study. Playing with the clasisfier guidance scale let’s unclip achieve GLIDE’s levels of photorealism at the cost of diversity and caption similarity.
Additional: The model can be trained without the CLIP text to CLIP image prior, although with a significant drop in quality.

🖼️ Paper Poster:

🛠 Possible Improvements:

unCLIP struggles with binding different attributes to different objects in the prompt. Supposedly, using an attribute-focused CLIP might help.
unCLIP struggles to produce coherent text, which is worsened by the BPE encoding that obscures spelling
unCLIp is not very good at producing details in complex scenes due to several upsamples

✏️My Notes:

(Naming: 5/5) DALL-E 2 or unCLIP, both are great and very memeable!
(Reader Experience - RX: 3/5) The figures are great, the captions tell a story, the architecture overview is clear, the samples are diverse, the contributions are obvious, the text is pithy and well written, the math is … tbh I don’t know, I am reading that crap again. Intuitively understanding diffusion is enough for my satisfaction.
Can’t wait to compare this to Imagen, obviously
I did not know that GPT could be used for diffusion, had to do a double take there
I think the most interesting takeaway here is that CLIP is far from perfect when it comes to aligning the image and text latent spaces. I wonder if a better/reworked CLIP model is the next step to improving this pipeline.
Not really sure what to take away from the fact that GLIDE achieves more or less the same results as unCLIP. Is it just better marketing this time around?
I think that DALL-E 2 is best suited for illustrations and crazy prompts, not for photorealism. I can’t wait for something like AI-dungeon to incorporate DALL-E 2 into it.

🔗 Links:

DALL-E 2 arxiv / DALL-E 2 GitHub (unofficial) / DALL-E 2 Demo

💸 If you enjoy Casual GAN Papers, consider tipping on KoFi:

🔥 Read More Popular AI Paper Summaries:

👋 Thanks for reading!

If you found this paper digest useful, subscribe and share the post with your friends and colleagues to support Casual GAN Papers!

Discuss the paper

By: @casual_gan

P.S. Send me paper suggestions for future posts @KirillDemochkin!

86: LDMs

Tue, 03 May 2022 16:00:00 +0000

High-Resolution Image Synthesis with Latent Diffusion Models by Robin Rombach, Andreas Blattmann et al.

⭐️Paper difficulty: 🌕🌕🌕🌑🌑

🎯 At a glance:

One of the cleanest pitches for a paper I have seen: diffusion models are way too expensive to train in terms of memory, time and compute, therefore let’s make them lighter, faster, and cheaper.

As for the details, let’s dive in, shall we?

⌛️ Prerequisites:

(Highly recommended reading to understand the core contributions of this paper):

🚀 Motivation:

Diffusion models (DMs) have a more stable training phase than GANs and less parameters than autoregressive models, yet they are just really resource intensive. The most powerful DMs require up to a 1000 V100 days to train (that’s a lot of $$$ for compute) and about a day per 1000 inference samples. The authors of Latent Diffusion Models (LDMs) pinpoint this problem to the high dimensionality of the pixel space, in which the diffusion process occurs and propose to perform it in a more compact latent space instead. In short, they achieve this feat by pertaining an autoencoder model that learns an efficient compact latent space that is perceptually equivalent to the pixel space. A DM sandwiched between the convolutional encoder-decoder is then trained inside the latent space in a more computationally-efficient way.

In other words, this is a VQGAN with a DM instead of a transformer (and without a discriminator).

🔍 Main Ideas:

1. Perceptual Image Compression:
Authors train an autoencoder that outputs a tensor of latent codes. This latent embedding is regularized with vector quantization within the decoder. This is a slight but important change from VGQAN that means the underlying diffusion model works with continuous latent codes and the quantization happens afterwards.

2. Latent Diffusion Models:
As the second part of the two-stage training approach a diffusion model is trained inside the learned latent space of the autoencoder. I won’t go into details about how the diffusion itself works as I have covered it before in a previous post. What you need to know here is that the denoising model is a UNet that predicts the noise that was added to the latent codes in the previous step of the diffusion process.

3. Conditioning Mechanisms:
Authors utilize domain-specific encoders and cross-attention layers to control the generative model with additional information. The conditions of various modalities such as text, are passed through their own encoders. The results get incorporated in the generative process via cross attention with flattened features from the intermediate layers of the UNet.

📈 Experiment insights / Key takeaways:

Baselines: LSGM, ADM, StyleGAN, ProjectedGAN, DC-VAE
Datasets: ImageNet, CelebA-HQ, LSUN-Churches, LSUN-Bedrooms, MS-COCO
Metrics: FID, Perception-Recall
Qualitative: x4-x8 compression is the sweet point for ImageNet
Quantitative: LDMS > LSGM, new SOTA FID on CelebA-HQ - 5.11, all scores (with models with 1/2 model size and 1/4 compute) are better (vs other diffusion models) except for LSUN-Bedrooms, where ADM is better
Additional: the model can get up to 1024x1024, can be used for inpainting, super-resolution, and semantic synthesis. There are a lot of details about the experiments, but that is the 5-minute gist.

🖼️ Paper Poster:

🛠 Possible Improvements:

LDMs still much slower than GANs
Pixel-perfect accuracy is a bottleneck for LDMs in certain tasks (Which ones?).

✏️My Notes:

(Naming: 3.5/5) The name “LDM” is as straightforward as the problem that the paper that is discussed in the paper. It is an easy-to-pronounce acronym, but not a word and definitely not a meme.
(Reader Experience - RX: 3/5) Right away, kudos for explicitly listing all of the core contributions of the paper right where they belong - at the end of the introduction. I am going to duck a point for visually-inconsistent figures. They are all over the place. Moreover, the small font size in the tables is very hard to read, especially with how packed the tables appear. Finally, why are the images so tiny? Can you even make out what is on Figure 8? What is the purpose of putting in figures that you can’t read? It would probably be better to cut one or two out to make the rest more readable. Finally, the results table is very hard to read, because different baselines in different order are used for different datasets.
I can’t help but draw parallels between Latent Diffusion and StyleNeRF papers - sandwiching an expensive operation (Diffusion & Ray Marching) between a convolutional encoder-decoder to reduce computational costs and memory requirements by performing the operation in spatially-condensed latent space.
Let’s think for a second: what other ideas from DNR & styleNeRF could further improve diffusion models ? One idea I can see being useful is the “NeRF path regularization”, which means, in terms of DMs, training a low-resolution DM alongside a high-resolution LDM, and adding a loss that matches subsampled pixels of the LDM to the pixels in the DM
It should be possible to interpolate between codes in the learned latent space. Not sure how exactly this could be used, but it is probably worth looking into

🔗 Links:

Latent Diffusion Models arxiv / Latent Diffusion Models GitHub

💸 If you enjoy Casual GAN Papers, consider tipping on KoFi:

🔥 Read More Popular AI Paper Summaries:

👋 Thanks for reading!

If you found this paper digest useful, subscribe and share the post with your friends and colleagues to support Casual GAN Papers!

Discuss the paper

By: @casual_gan

P.S. Send me paper suggestions for future posts @KirillDemochkin!

85: GLIDE

Mon, 25 Apr 2022 16:00:00 +0000

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models by Alex Nichol et al. explained in 5 minutes

⭐️Paper difficulty: 🌕🌕🌕🌕🌑

🎯 At a glance:

“Diffusion models beat GANs”. While true, the statement comes with several ifs and buts, not to say that the math behind diffusion models is not for the faint of heart. Alas, GLIDE, an OpenAI paper from last December took a big step towards making it true in every sense. Specifically, it introduced a new guidance method for diffusion models that produces higher quality images than even DALL-E, which uses expensive CLIP reranking. And if that wasn’t impressive enough, GLIDE models can be fine-tuned for various downstream tasks such a inpainting and and text-based editing.

As for the details, let’s dive in, shall we?

⌛️ Prerequisites:

(Highly recommended reading to understand the core contributions of this paper):

🚀 Motivation:

It used to be with diffusion models that you could boost quality at the cost of some diversity with the classifier guidance technique. However, vanilla classifier guidance requires a pertained classifier that outputs class labels, which is not very suitable for text. Recently though, a new classifier-free guidance approach was introduced. It came with two advantages: the model uses its own knowledge for guidance instead of relying on an external classifier, and it greatly simplifies guidance, when it isn’t possible to directly predict a label, which is should sound familiar for fans of text-to-image models.

🔍 Main Ideas:

1. Model:
The authors take the ADM (the standard diffusion model) architecture and extend it with text conditioning information. This is done by passing the caption through a transformer model and using the encoded vector in place of the class embedding. Additionally, the last layer of token embeddings is projected to the dimensionality of every attention layer of the ADM model, and concatenated to the attention context of that layer.

2. Classifier-free guidance:
After the initial training routine, the model is fine-tuned for classifier-free guidance with 20% of the captions replaced with empty sequences. This enables the model to synthesize both conditional and unconditional samples. Hence, during inference two outputs are sampled from the model in parallel. One is conditioned on the text prompt, while the other is unconditional and gets extrapolated towards the conditional sample at each diffusion step with a predetermined magnitude.

3. Image Inpainting:
The simplest way to do inpainting with diffusion models is to replace the known region of the image with a noised version of the input after each sampling step. However this approach is prone to edge artifacts, since the model cannot see the entire context during the sampling process. To combat this effect, the authors of GLIDE fine-tune the model by erasing random regions of the input images and concatenating the remaining regions with a mask channel.

4. Noised CLIP:
The authors of GLIDE noticed that CLIP-guided diffusion can not handle intermediary samples very well since they are noised and most likely fall out of the distribution of the pertained, publicly available CLIP model. As a pretty simple fix they train a new CLIP model on noised images to make the diffusion-guiding process more robust.

📈 Experiment insights / Key takeaways:

Baselines: DALL-E, LAFITE, XMC-GAN (second best), DF-GAN, DM-GAN, AttnGAN
Datasets: MS-COCO
Metrics: Human perception, CLIP score, FID, Precision-Recall
Qualitative: Classifier-free guided samples look visually more appealing than CLIP-guided images. GLIDE has compositional and object-centric properties.
Quantitative: Classifier-free guidance is nearly Paretto optimal in terms of FID vs IS, Precision vs Recall, and CLIP score vs FID. The takeaway is that CLIP-guidance finds adversarial samples for CLIP instead of the most realistic ones.

🖼️ Paper Poster:

🛠 Possible Improvements:

From the model card: “Despite the dataset filtering applied before training, GLIDE (filtered) continues to exhibit biases that extend beyond those found in images of people.”
Unrealistic and out-of-distribution prompts are not handled well, meaning that GLIDE samples are limited by the concepts present in the training data.

✏️My Notes:

(Naming: 4/5) Memorable but not a meme.
(Reader Experience - RX: 3/5) While the samples are presented in a very clean and consistent manner (except for figure 5 that does not fit on the screen, which is an issue because the models are arranged row-wise and you will need to scroll back and forth to compare samples across models), the strange order and naming of the paper section and lack of an architecture overview figure threw me for a loop. Moreover, the structure of the paper is quite unorthodox as most of the information about the proposed method is actually hidden in the background section, not in the typical “The Proposed Method” section, which is simply called “Training” here, and contains configuration details I would expect to see in the beginning of the “Experiments” section.
Classifier-free guidance reminds me of the good ol’ truncation trick from StyleGAN
Props to the authors for citing Katherine Crowson
TBH I wonder, how the heck does 64x64 CLIP even work? I don’t think I could compare images to captions at that resolution with my eyes not to even mention a model
Not sure how I feel about the whole “this model is not safe, hence we won’t release it” narrative that OpenAI is trying to spin since they clearly intend to monetize these huge AI models.

🔗 Links:

GLIDE arxiv / GLIDE GitHub

💸 If you enjoy Casual GAN Papers, consider tipping on KoFi:

🔥 Read More Popular AI Paper Summaries:

👋 Thanks for reading!

If you found this paper digest useful, subscribe and share the post with your friends and colleagues to support Casual GAN Papers!

Discuss the paper

By: @casual_gan

P.S. Send me paper suggestions for future posts @KirillDemochkin!

84: Make-A-Scene

Tue, 12 Apr 2022 16:00:00 +0000

Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors by Oran Gafni et al. explained in 5 minutes

⭐️Paper difficulty: 🌕🌕🌕🌕🌑

🎯 At a glance:

The authors of Make-A-Scene propose a novel text-to-image method that leverages the information from an additional input condition called a “scene” in the form of segmentation tokens to improve the quality of generated images and enable scene editing, out-of-distribution prompts, and text-editing of anchor scenes.

As for the details, let’s dive in, shall we?

⌛️ Prerequisites:

(Highly recommended reading to understand the core contributions of this paper):

VQGAN
GLIDE

🚀 Motivation:

Existing methods for text-to-image synthesis are good, yet they lack in several key ares. First, controllability. Only so much information about the desired image can be passed to the model through a text prompt alone. For example, structural composition is notoriously difficult to control with text. Second, existing models are not very good at generating faces and coherent humans in general. Finally, the proposed approach works in 512 resolution as opposed to previous SOTA’s 256 by 256 pixels.

🔍 Main Ideas:

1. Overview:
The model is essentially 3 encoders with discrete tokens, a transformer that learns to sample sequences of tokens conditioned on the provided context and a decoder that generates images from a sequence of tokens sampled from the transformer. The entire pipeline works in a feed-forward manner with classifier-free guidance to avoid expensive reranking.

2. Scene Representation and Tokenization:
The input semantic scene map is encoded by a VQ-VAE-like model called VQ-SEG that works on segmentation maps. The scene map is made up of 3 groups of channels: panoptic (objects), humans, and faces, as well as 1 extra channel for the edges between classes and instances.

3. Face-aware VQ:
The authors duly note that text-to-image models struggle to synthesize coherent human faces, hence additional attention is required to smooth things out a bit. On the one hand, authors use a feature matching loss in the regions of the image containing faces comparing the activations of a pertained face-embedding network. On the other hand, a weighted binary cross-entropy loss is added to the segmentation predictions of the face regions along with the edges to promote the importance of small facial details.

4. Scene Transformer:
The regular objects in the panoptic regions also get the feature-matching treatment similar to the face regions. The tokens for the text, the panoptic and human region embeddings along with the face embeddings are passed to an autoregressive transformer (GPT) that outputs a sequence of tokens for the decoder.

5. Classifier-free Guidance:
This neat little trick is used to improve fidelity and diversity of generated samples. Essentially, instead of using a classifier to “guide” the sample towards a cluster of samples from a class, which tends to reduce diversity of generated samples, a technique similar to truncation is used. During training some of the text inputs are replaced with blanks, representing the unconditional samples . During inference, we generate two token sequences in parallel. One with the blank text, and the other with the text prompt as a condition. The unconditioned token is then pushed towards the conditioned one with a predefined weight, then both streams receive the new token and the process is repeated until all tokens have been generated. More on this in the GLIDE post.s

📈 Experiment insights / Key takeaways:

Baselines: DALL-E, GLIDE, CogView, LAFITE, XMC0GAN, DF-GAN, DM-GAN, AttnGAN
Datasets: CC12M, CC, YFCC100m, Redcaps, MS-COCO
Metrics: FID, human evaluation
Quantitative: Lowest FID, second best is GLIDE with 0.4 difference
Qualitative: Generating images out of distribution, Scene editing and resampling with different prompts
Ablations: biggest contribution to the score is from classifier-free guidance and object-aware vector quantization

🖼️ Paper Poster:

🛠 Possible Improvements:

It would be nice to provide an image as a style reference in addition to the text and a scene segmentation

✏️My Notes:

(Naming: 4.5/5) - Funny name, easy to pronounce
(Reader Expirience - RX: 3.5/5) My main issue with this paper as a reader is the absence of a detailed model architecture diagram and the choice of colors throughout the paper. The choses color scheme is all over the place and just doesn’t look clean and professional. The figure captions are good. The math is fairly easy to follow. Overall the paper has a nice flow.
This approach would probably mix well with Duplex Attention from GANsfromer
I still find it befuddling that for image-based tasks authors insist on processing tokens in scan order without any regard to their spatial positions. In other words, check out MaskGIT.

🔗 Links:

Make-A-Scene arxiv / Make-A-Scene GitHub - by Casual GAN Papers Community

💸 If you enjoy Casual GAN Papers, consider tipping on KoFi:

🔥 Read More Popular AI Paper Summaries:

👋 Thanks for reading!

If you found this paper digest useful, subscribe and share the post with your friends and colleagues to support Casual GAN Papers!

Discuss the paper

By: @casual_gan

P.S. Send me paper suggestions for future posts @KirillDemochkin!

Reader Experience Design - 1

Thu, 31 Mar 2022 16:00:00 +0000

RX1: How to write papers that are easy to read? by Casual GAN Papers

⭐Tutorial difficulty: 🌕🌕🌑🌑🌑

🎯 At a glance:

There only exist two types of papers: ones that are fun to read, and ones that are are a chore to get to the end of. You can image that I very much prefer one to the other, and from what I can see, so does everyone else.

Now, whether you have published multiple papers or you are a first time author working on your thesis, you can follow some simple tips to make your paper much easier on the readers. You already already got 80% of the way there by doing the research, however nailing the last 20% and making a well-written paper is crucial to getting the attention your research deserves.

Here are 5 easy tips that anyone can follow to write better papers.

🚀 Motivation:

I’ll be blunt. Reading even a couple of pages from a poorly written paper is agonizing. Don’t torture the reviewers who have to go through your paper, make their job easy. A happy reviewer is much more likely to give you a good score, and get you one step closer to getting that “A” on a school project, defending your thesis or getting accepted to a conference.

🔍 Main Ideas:

1) Check your spelling and grammar.
There are simply no excuses for lousy grammar and spelling errors in papers. They make your text look unprofessional, lacking in polish, and most important - hard for the reader to follow.

2) Explicitly list your contributions in the introduction and conclusion.
Don’t make the reader guess what the point of your paper is. Make it obvious by clearly listing out the main contributions at the end of the introduction and again in the conclusion. Focus on 2-3 biggest problems you solved. This should not be a boring changelog of all the things you tried and minor issues you overcame, this should be a knockout punch that makes you look like David vanquishing Goliath.

3) Assume your reader knows nothing.
The easiest way to confuse your reader is by throwing around model names and abbreviations without first introducing them. Unless you are talking about a convolutional layer or matrix multiplication assume that the reader does not know what any of the acronyms mean. Even the good old MLP should be first explicitly written out as “multi-layer perceptron (MLP)” before defaulting to the short-hand notation. It goes without saying that this rule is even more important for more exotic concepts and acronyms. Remember that unlike you, the reader most likely did not spend the last three to six months going on a deep dive into the subject of the paper.

4) Write self-sufficient descriptions for figures.
Every figure and table should tell a story. If the reader can understand every figure just by reading the descriptions, you are doing a good job. Writing descriptions that are self-sufficient is essential to hook the reader, since people tend to start with the figures when reading papers. However, writing a good description for each figure and table alone is not enough, always reference each figure and table in the text at least once and reiterate what is written in the description. Do not assume that the reader will remember what is on “Figure 4” and “Table 3” when they are mentioned in the text just because you wrote it in the corresponding description.

5) Explain every formula in the text.
Just dropping raw math on your reader is a sure way to lose her attention and especially bad if it involves multiple new variables. Explain every formula and the meaning of all of the new greek letters in it like you are talking to a five year old. Put it in the text right before or after introducing the math. Surprisingly, omitting the mathematical notation altogether and explaining every concept and loss function with words is also a bad idea as the entire purpose of formulas is to give a concise representation of an idea in a canonical representation that is agreed upon.

✏️Key takeaways:

Check your spelling, clearly state your contributions, explain things, write good captions, support formulas with text, and you will make any paper you write that much better!

💸 If you enjoy my posts, consider donating ETH to support Casual GAN Papers:

0x7c2a650437f58664BED719D680F2BE120bE5623b

🔥 Read More Popular AI Paper Summaries:

👋 Thanks for reading!

Join Patreon for Exclusive Perks!

If you found this paper digest useful, subscribe and share the post with your friends and colleagues to support Casual GAN Papers!

Join the Casual GAN Papers telegram channel to stay up to date with new AI Papers!

Join the Casual GAN Papers Discord to become part of the Casual GAN Papers community

Discuss the paper

By: @casual_gan

P.S. Send me paper suggestions for future posts @KirillDemochkin!

83: CLIP-GEN

Wed, 09 Mar 2022 16:00:00 +0000

CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP by Zihao Wang et al. explained in 5 minutes

⭐️Paper difficulty: 🌕🌕🌕🌑🌑

🎯 At a glance:

Text-to-image generation models have been in the spotlight since last year, with the VQGAN+CLIP combo garnering perhaps the most attention from the generative art community. Zihao Wang and the team at ByteDance present a clever twist on that idea. Instead of doing iterative optimization, the authors leverage CLIP’s shared text-image latent space to generate an image from text with a VQGAN decoder guided by CLIP in just a single step! The resulting images are diverse and on par with the SOTA text-to-image generators such as DALL-e and CogView.

As for the details, let’s dive in, shall we?

⌛️ Prerequisites:

(Highly recommended reading to understand the core contributions of this paper):

VQGAN
CLIP

🚀 Motivation:

As demonstrated on twitter by some very smart people, the generative capabilities of VQGAN+CLIP are limited only by the artist’s imagination. That, and the fact that the iterative update process is very slow, which is exactly the problem that the authors of CLIP-GEN attempt to solve. Their pitch is simple: encode a text query with CLIP, turn the encoded feature vector into a series of latent codes from a pretrained VQGAN codebook, and decode these tokens into an image with a pretrained generator (VGQAN decoder). There are two main caveats with this approach: how to map CLIP embeddings to VQGAN tokens? Where to find a big enough text-image dataset to learn this mapping?

Luckily, the first can be easily solved with a transformer model that is well known for its sequence-modeling ability, and the second - with CLIP’s shared latent space. Meaning that the entire pipeline is trained as an autoencoder using the image encoder from CLIP, and teaching the transformer to map the image embeddings to VQGAN tokens. Meanwhile, during inference, the input to CLIP is switched to text, yet from the perspective of the latent transformer nothing changes, since the text embeddings share the same latent space as the image embeddings. What an elegant and clever idea, very impressive.

🔍 Main Ideas:

This is a bit strange, because the “Approach” section of the paper basically summarizes CLIP and VQGAN papers, both of which I already covered, hence I will leave just a single point here, explaining the novelty, and ask you to refresh the ideas from the “Required Reading” section above.

The authors for some reason decided to train the VQGAN from scratch, to which they refer as “the first stage”. Whereas the conditional transformer and the pretrained CLIP model are introduces in “the second stage”. The conditional transformer learns from the sum of the rmaximum likelihood loss on the predicted VQGAN tokens, and the embedding reconstruction loss. The second loss is somewhat unclear from the text, but from what I understand it is some sort of a log-distance between the CLIP embedding of the input image and the CLIP embedding of the generated image. Sorry, I honestly have no idea, what “log s of (thing1, thing2)” could mean, and why they didn’t just use an MSE loss like everybody else.

📈 Experiment insights / Key takeaways:

Baselines: DF-GAN, DM-GAN, AttnGAN, CogView, VQGAN-CLIP, BigGAN-CLIP
Datasets: MS-COCO, ImageNet
Not sure, what is different between FID-0, FID-1, and FID-2, but CLIP-GEN beats all other baselines in terms of FID-0, and FID-1 on MS-COCO, and in terms of FID on ImageNet
CLIP-GEN captures semantic concepts from text but fails to understand numeric concepts
CLIP-GEN can generalize to various drawing styles without any augmentations during training

🖼️ Paper Poster:

🛠 Possible Improvements:

Nothing from the authors, unfortunately.

✏️My Notes:

(Naming: 3/5) - Not bad, but not memorable either.
(Paper UX: 3.5/5) - The figures are mostly well designed, even if the two architecture diagrams are redundant. I could have used a stronger pitch in the abstract introduction. The setup lacks the punch, and reads more like a “we did this fun thing, and we are surprised it even works at all, but check it out anyway!”. I do appreciate the clear contribution statement, however. The required reading is summarized concisely, while not comprehensive, it is enough to remind oneself as to what the VQGAN and CLIP papers are about. The amount of different variables, on the other hand, is overwhelming. I have to hand it to the authors for trying to unify the naming across various papers, but there is a lower case “e” that stands for the embedddings of CLIP, and a capital case “E” that represents the encoder in VQGAN. Overall, you gotta think to keep track of everything, and that breaks the first rule of good UX - don’t make people think. Tables are well formatted and easy to follow. Overall, the paper reads smoothly without confusing wording and strange word choice.
What an awesome idea to use CLIP instead of a paired dataset, I think using pretrained cross-modal models for tasks, where paired data is hard to come by could become a big trend.
As a fan of deep learning, I am always happy to see these “fun” ideas that make you go “what, how has nobody tried this before”.
I wish the authors included some more fun and artistic samples in addition to the photorealistic images.

🔗 Links:

CLIP-GEN arxiv / CLIP-GEN GitHub (Not released at the time of writing)

💸 If you enjoy my posts, consider donating ETH to support Casual GAN Papers:

0x7c2a650437f58664BED719D680F2BE120bE5623b

🔥 Read More Popular AI Paper Summaries:

👋 Thanks for reading!

Join Patreon for Exclusive Perks!

If you found this paper digest useful, subscribe and share the post with your friends and colleagues to support Casual GAN Papers!

Join the Casual GAN Papers telegram channel to stay up to date with new AI Papers!

Discuss the paper

By: @casual_gan

P.S. Send me paper suggestions for future posts @KirillDemochkin!

82: FILM

Fri, 18 Feb 2022 16:00:00 +0000

FILM: Frame Interpolation for Large Motion by Fitsum Reda et al. et al. explained in 5 minutes

⭐️Paper difficulty: 🌕🌕🌕🌑🌑

🎯 At a glance:

Motion interpolation between two images surely sounds like an exciting task to solve. Not in the least, because it has many real-world applications, from framerate upsampling in TVs and gaming to image animations derived from near-duplicate images from the user’s gallery. Specifically, Fitsum Reda and the team at Google Research propose a model that can handle large scene motion, which is a common point of failure for existing methods. Additionally, existing methods often rely on multiple networks for depth or motion estimations, for which the training data is often hard to come by. FILM, on the contrary, learns directly from frames with a single multi-scale model. Last but not least, the results produced by FILM look quite a bit sharper and more visually appealing than those of existing alternatives.

As for the details, let’s dive in, shall we?

⌛️ Prerequisites:

(Highly recommended reading to understand the core contributions of this paper):
Relax!

🚀 Motivation:

The authors identify three significant vectors of improvement over existing baselines: using a single unified model that learns in an end-to-end fashion, handling large motion between frames, improving image quality for the interpolated frames. More specifically, FILM does not rely on any additional priors, such as depth estimation for occlusion detection or motion estimation from an external model. Hence, FILM consists of just a single module that achieves state-of-the-art results. Moreover, existing models struggle to generalize to novel data when the range of motion differs from the training data, and the authors use a multi-scale approach to learn both large and small motion. Finally, instead of using a more obvious adversarial (GAN) loss, the team picked the Gram matrix loss, which is an evolution of everyone’s favorite perceptual loss.

🔍 Main Ideas:

The model consists of three distinct sequential modules.

1) The Feature Extractor:
is a convolutional encoder that processes a 7-scale input image pyramid. The crux of the method is in the way the weights are shared between scales, and how the extracted features are stacked across scales in a grid-like manner. The same encoder is applied to each scale, and the resulting features are concatenated with the other feature maps of the same spatial size from all scales in a clever twist on the sliding window idea. Interestingly, the resulting stacks have a different number of feature maps in them. For example, the first stack will have just the output of the first layer of the encoder applied to the largest scale image, the second stack will have the outputs of the second layer of the encoder applied to the largest scale, and the feature map from the first layer of the encoder applied to the second-largest scale image, and so on. This part is applied to both input frames, to obtain two feature pyramids.

2)The Flow Estimator(this part is awesome but slightly tricky, so pay attention!):
the driving idea of this module is that in order for the large motion to work in frame interpolation the motion flow should be similar across scales. How can we ensure this? The authors take a recursive approach by computing the flow of each pyramid level as the sum of the upscaled flow from the next smaller (coarser) level a predicted residual vector. This vector is computed with a small convolution model that takes in the input feature map for one frame and the feature map extracted from the second frame that is backward warped using the same upsampled estimate of the flow from the next smaller scale. (and no, this is not a leaked synopsis of the tenet sequel). Note that this process is repeated for the other input frame as well.

3) The Fused Decoder:
finally, a UNet like decoder is used to upsample and concatenate warped features of matching spatial dimensions from both frames to predict the intermediate frame. Note that this process can be done recursively, to insert more frames between the input and generated frames.

4) Loss Functions:
One, two, three, no GANs to see. Anyways, the authors use a simple combination of L1, perceptual, and Gram matrix losses. With the later computed as the L2 norm of the autocorrelation matrix of VGG-19 features from various layers. Think of it as the more snobbish perceptual loss. Apparently, this loss helps significantly with inpainting disocluded areas of the image.

📈 Experiment insights / Key takeaways:

Baselines: DAIN, AdaCoF, BMBC, SoftSplat, ABME
Datasets: Vimeo-90k, UCF101, Middlebury, Xiph
Kind of a cheat here, the reported results say that quantitatively ABME outperforms FILM by a small margin (which is still an impressive result, considering all of the additional data and external modules the other models have), but! FILM is better when trained with perceptual quality in mind.
FILM puts all other models to shame (qualitatively) on the large motion dataset Xiph
FILM produces sharper images, without ghosting, and with better-preserved structure
Ablations look good

🖼️ Paper Poster:

🛠 Possible Improvements:

Reduce the number of artifacts in images (Lame)

✏️My Notes:

(Naming: 5/5) An excellent example of a mode/paper name, catchy, easy to pronounce, thematically relevant, it has it all! The name is honestly on the level of CLIP in terms of catchiness.
(Paper UX: 4/5) The math is surprisingly easy to follow, the big diagram with the model architecture is a great visual aid, and I especially love the color-coding to show, which parts of the model share weights. Additionally, the diagram has this almost skeuomorphic feel to it, very stylish! The samples all have zoom-in right next to the big image, which is important, since it is not always obvious, what part of the image to pay attention to when studying the qualitative results. One improvement I would suggest is to give a “zoomed-in” view of the flow estimation module, because it was not immediately clear to me how the math for backward flow estimation connected with the big diagram.
All of the figure and table descriptions are self-sufficient, which is essential for the first glance-through read of the paper. That is it for the visual aspect of the paper, and as for the text, I appreciate the clearly defined contributions at the end of section one, and how section two elaborates on each of the main contributions. This is vital to give the reader the necessary context to understand, the goal of the paper, and how each architecture design decision helps to achieve this goal. Now for some criticism on the text: the future work section is total BS (which is a shame, as it is one of the best ways to get traction for your paper, IMO). Moreover, I think the ablations section is pretty lazy. Yes, the authors check every part of their pipeline, but what is the point of taking up the valuable page space if all you are going to say is that each part of the model significantly contributes to the final result. Obviously, the parts that didn’t contribute got cut out of the final model. At least provide some conjectures as to why the contribution is significant, or just leave a table with the results and summarize them in a single paragraph.
I wonder, how this would work with a GANsformer, where it would generate the layouts for frames, and then FILM would be used to animate them
Honestly, I could not think of a way to incorporate CLIP into FILM, write your suggestions in the comments!
I wonder if it is somehow possible to use two frames from different videos so that the interpolated frame is the first frame, driven by the second frame.

🔗 Links:

FILM arxiv / FILM Github

💸 If you enjoy my posts, consider donating ETH here to support the website:

0x7c2a650437f58664BED719D680F2BE120bE5623b

🔥 Check Out These Popular AI Paper Summaries:

👋 Thanks for reading!

Join Patreon for Exclusive Perks!

If you found this paper digest useful, subscribe and share the post with your friends and colleagues to support Casual GAN Papers!

Join the Casual GAN Papers telegram channel to stay up to date with new AI Papers!

Discuss the paper

By: @casual_gan

P.S. Send me paper suggestions for future posts @KirillDemochkin!

81: MaskGIT

Fri, 18 Feb 2022 16:00:00 +0000

MaskGIT: Masked Generative Image Transformer by Huiwen Chang et al. explained in 5 minutes

⭐️Paper difficulty: 🌕🌕🌕🌑🌑

🎯 At a glance:

This is one of those papers with an idea that is so simple yet powerful that it really makes you wonder, how nobody has tried it yet! What I am talking about is of course changing the strange and completely unintuitive way that image transformers handle the token sequence to one that logically makes much more sense. First introduced in ViT, the left-to-right, line-by-line token processing and later generation in VQGAN (the second part of the training pipeline, the transformer prior that generates the latent code sequence from the codebook for the decoder to synthesize an image from) just worked and sort of became the norm.

The authors of MaskGIT say that generating two–dimentional images in this way makes little to no sense, and I could not agree more with them. What they propose instead is to start with a sequence of MASK tokens and process the entire sequence with a bidirectional transfer by iteratively predicting, which MASK tokens should be replaced with which latent vector from the pretrained codebook. The proposed approach greatly speeds-up inference and improves performance on various image editing tasks.

As for the details, let’s dive in, shall we?

⌛️ Prerequisites:

(Highly recommended reading to understand the core contributions of this paper):
1) VQGAN

🚀 Motivation:

second part of the VQGAN, i.e. the token sampling prior. If you recall, VQGAN first learns an encoder-decoder model along with a codebook of latent vectors by reconstructing input images, and after the codebook is frozen a transformer is trained to predict sequences of tokens one-by-one left-to-right, line-by-line that are passed to the decoder to form the final image. Crudely, this is equivalent to predicting the next patch of the image from all of the previous patches.

Does this approach make much sense? The authors certainly do not think so (and I concur). Why should the patches be generated sequentially in this manner, when it makes much more sense to start with a rough sketch and fill in the details iteratively (StyleGAN says “Hi!”). Apparently, there are no prominent papers that explore the upgrade to a Bidirectional Transformer despite the apparent advantages such as the ability to regenerate parts of an image, do inpainting and outpainting and perform class-conditioned editing.

🔍 Main Ideas:

1) MVTM in Training:
During each training iteration the input images are passed through a pretrained VQVAE to obtain a set of tokens that are treated as the ground truth. Some of the tokens are left intact, while a portion of them is replaced with a MASK token. The probability that a token is replaced changes overtime as discussed in Maksing Design. The entire sequence of tokens is processed by a bidirectional transformer that learns to predicts the probabilities for each masked token with a simple negative log likelihood (multiclass cross-entropy) loss.

2) Iterative Decoding:
Unlike the regular transformer from VQGAN that uses information from previously generated tokens and selects the next tokens one-by-one, the bidirectional transformer used in MaskGIT is able to generate the entire image in a single forward pass. However, this does not really work, since the model was trained to fill in a portion of the tokens, not generate all of them at once. Hence, a progressive sampling scheme is employed: the model starts with all MASK tokens, generates the probability distribution and samples a value for each of them. Then, only the ones above a certain confidence threshold are kept. The rest are replaced with the MASK token, and the process repeats until all of the tokens are kept. The authors find that just 7-8 iterations is enough to produce high–quality results. After the first iteration only a single token is always kept, and for the rest of the iterations the required confidence level is decreased until all of the tokens are filled.

3) Masking Design:
Unlike BERT’s constant 15% masking, the authors propose to vary this number during training according to a schedule. Multiple approaches to mask scheduling design are considered: linear function, concave function, convex function, but the important thing is that the authors used the cosine function, which worked best in all their experiments.

📈 Experiment insights / Key takeaways:

Datasets: ImageNet 256, 512; Places2
Baselines: BigGAN, VQGAN, DCTransformer, VQVAE-2, improved DDPM, ADM
MaskGIT outperforms significantly both VQGAN and BigGAN on ImageNet in 256 and 512 resolutions. New SOTA FID for 512x512 - 7.32
MaskGIT accelerates VQGAN by 30-64x
MaskGIT establishes a new SOTA ClassificationAccuracy Score for ImageNet generation
MaskGIT achieves impressive results for class-conditioned editing (resampling parts of an image with a different class), inpainting and outpainting. With all tasks being trivial to set up with the masking token approach.
From ablations: Doing too many iterations at inference leads to less diversity in the generated images, because the model is discouraged from keeping less confident tokens

🖼️ Paper Poster:

🛠 Possible Improvements:

When outpainting, the model has trouble coordinating distant parts of the image, which causes strange semantic and color shifts between different edges of the image.
There are cases of oversmoothing and various artifacts that can be fixed
Further design on mask design

✏️My Notes:

(4.5/5) A solid shorthand way to refer to the model and a “mask it” pun
(5/5) for the Paper UX, the figures deserve a separate shoutout! (Yay a new rubric! 🎉)
Love these simple-yet-elegant papers that don’t require me to connect to the cosmos to parse through a mountain of complex formulas
After reading this paper I felt as if I finally scratched an itch that was bothering me for a while
No CLIP experiments?! I feel slightly offended by that XD
I like how this idea echoes diffusion models with the progressive sampling scheme that gradually removes the noise (MASK tokens) from the image
Well, we started with transformers in CV last year, now we got a sort of a vision-BERT, what NLP paper will migrate to CV next? GPTImage, anyone? Not sure how it would work, but it does sound cool
What do you think about MaskGIT? Let’s discuss in the comments!

🔗 Links:

MaskGIT arxiv / MaskGIT Github (Unofficial)

🔥 Check Out These Popular AI Paper Summaries:

👋 Thanks for reading!

Join Patreon for Exclusive Perks!

If you found this paper digest useful, subscribe and share the post with your friends and colleagues to support Casual GAN Papers!

Join the Casual GAN Papers telegram channel to stay up to date with new AI Papers!

Discuss the paper

By: @casual_gan

P.S. Send me paper suggestions for future posts @KirillDemochkin!

80: FOMM

Thu, 10 Feb 2022 16:00:00 +0000

First Order Motion Model for Image Animation by Aliaksandr Siarohin et al. explained in 5 minutes

⭐️Paper difficulty: 🌕🌕🌕🌕🌕

🎯 At a glance:

If you have ever used a face animation app, you have probably interacted with First Order Motion Model. Perhaps the reason that this method became ubiquitous is due to its ability to animate arbitrary objects. Aliaksandr Siarohin and the team from DISI, University of Trento, and Snap leverage a self-supervised approach to learn a specialized keypoint detector for a class of similar objects from a set of videos that warps the source frame according to a motion field from a reference frame.

From the birds-eye view, the pipeline works like this: first, a set of keypoints is predicted for each of the two frames along with local affine transforms around the keypoints (this was the most confusing part for me, luckily we will cover it in detail later in the post). This information from two frames is combined to predict the motion field that tells where each pixel in the source frame should move to line up with the driving frame along with an occlusion mask that shows the image areas that need to be inpainted. As for the details.

Let’s dive in, and learn, shall we?

⌛️ Prerequisites:

(Highly recommended reading to understand the core contributions of this paper):

🚀 Motivation:

Animating still images has many applications in various industries, yet most existing approaches are object-specific, meaning they require a specialized annotated dataset that might be difficult or costly to obtain. On the contrary, we have object-free methods that animate any type of object that the model is trained on. The only data that is required by object-agnostic models is a set of videos of objects from that category without any need for annotations. The main weakness of such existing approaches is their poor quality since they lack the complexity to model the object deformations precisely.

Another key contribution of the proposed approach is the explicit occlusion model that indicates which areas of the image can be warped, and what missing regions require inpainting.

🔍 Main Ideas:

1) Overview:
The method, composed of the motion estimation module and the image generation module, learns from pairs of frames extracted from the same video in a self-supervised manner by learning to “encode motion as a combination of motion-specific keypoint displacements and local affine transformations”. An important detail in the proposed approach is the assumption that some sort of a reference frame exists that both the source and the target frames are warped to. Even though the reference frame is never explicitly rendered its existence allows the source and driver frames to be processed independently.

2) Local Affine Transformations for Approximate Motion Description:
Alright, if you have been following Casual GAN Papers for a while (if not, welcome!) I don’t even attempt to explain the math in the papers I cover, hence we will just pretend that this entire setup works automagically.

With that out of the way, the authors pitch this idea that there is some intermediary reference frame that helps connect the source and driving frames. In practice though, they simply apply the self-supervised keypoint detector (a U-Net) that predicts a normalized heatmap for each keypoint along with 4 parameters for an affine transform corresponding to that keypoint. Intuitively the affine transform contains information on how moving the corresponding keypoint moves pixels near it, while the heatmap shows which pixels are more or less affected by moving each keypoint. In other words, FOMM assumes that the animated object is made of a number of rigid parts that move together, e.g. the entire arm of an animated human moves as a single object, as does each of the legs, the torso, and the head. These local motions are combined with a convolutional network into a single motion field mapping where each pixel in the source image should move to line up with the pose in the driving frame.

3) Occlusion-aware Image Generation:
Since the source image gets deformed to resemble the driving frame, some discontinuities are bound to happen, for example, if the face moves to the side slightly, the background behind it needs to be inpainted. FOMM handles these occlusions via an explicit occlusion mask that is predicted along with the dense motion field. The mask is used to block out the regions that the final generator needs to inpaint from scratch.

4) Training Losses:
Standard stuff (LPIPS) and an equivariance constrain, which simply means that the motion deformations should have the transitive property (the deformation X -> Y should be the same as the composition of transformations X -> R -> Y)

📈 Experiment insights / Key takeaways:

Datasets: VoxCeleb, UvA-Nemo, BAIR Robot, Tai-Chi-HD
The models are evaluated on the task of video reconstruction from the first frame and the sparse motion data as well as a user study for comparing the full animated videos
Reported Metrics: L1, Average Keypoint Distance (with a pretrained keypoint detector), Missing keypoint rate, Average euclidian distance between the embeddings of the target and source frames
Ablations: the addition of pyramid loss and occlusion mask boost almost all reported metrics, without the equivariance constraint, the Jacobians become unstable, which leads to poor motion estimation
FOMM is miles ahead of X2Face and MonkeNet on all considered datasets

🖼️ Paper Poster:

🛠 Possible Improvements:

None mentioned in the paper, hence we will need to study the new paper by the authors that seem to be focused on animating more complex articulated objects
I sorta-kinda feel like there is a self-attention-shaped hole somewhere in this method. Maybe it could be possible to swap the softmax heatmaps for the duplex attention from GANsformer to predict which pixels are affected by which keypoints. Although a point could be made that the locality assumption (that the motion of each keypoint only affects the pixels in the direct vicinity of the keypoint) is enough for adequate animation.

✏️My Notes:

(3/5) So-so model name, FOMM is really awkward to pronounce and is not meme-able at all. Try harder!
Does a better image animation model exist? Because I could not find anything besides the new paper from the same team
Overall, the core intuition turned out to be very elegant, but I spent two days trying to dig through the math and confusing figures to figure it out. The paper UX could really use some work.
To be honest, the image depicting the pipeline did more harm than good in helping me understand the intuition when I read the paper. The rectangles representing the affine transforms confused the heck out of me.
It still blows my mind how good the results look even a couple of years after the paper came out
I was particularly surprised by how little FOMM cares about alignment between the target and source frames, it is curious that this self-supervised keypoint detector idea is not used in GANs that aim to decouple position from style (hello Alias-free GAN)
I am curious how well FOMM handles out of domain samples. Can it animate animal faces or other images that look like a face, but aren’t a photorealistic depiction of a face. Because it seems that the animated cartoons are somewhat of a cheat as they simply use inversion and StyleGAN blending to transform the animated frames into their cartoon version.

🔗 Links:

First Order Motion Model arxiv / First Order Motion Model Github

🔥 Check Out These Popular AI Paper Summaries:

👋 Thanks for reading!

Join Patreon for Exclusive Perks!

If you found this paper digest useful, subscribe and share the post with your friends and colleagues to support Casual GAN Papers!

Join the Casual GAN Papers telegram channel to stay up to date with new AI Papers!

Discuss the paper

By: @casual_gan

P.S. Send me paper suggestions for future posts @KirillDemochkin!

79: StyleGAN3 Inversion & Editing

Wed, 02 Feb 2022 16:00:00 +0000

Third Time’s the Charm? Image and Video Editing with StyleGAN3 by Yuval Alaluf, Or Patashnik et al. explained in 5 minutes

⭐️Paper difficulty: 🌕🌕🌕🌕🌑

🎯 At a glance:

Alias-free GAN more commonly known as StyleGAN3, the successor to the legendary StyleGAN2, came out last year, and … Well, and nothing really, despite the initial pique of interest and promising first results, StyleGAN3 did not set the world on fire, and the research community pretty quickly went back to the old but good StyleGAN2 for its well known latent space disentanglement and numerous other killer features, leaving its successor mostly in the shrinkwrap up on the bookshelf as an interesting, yet confusing toy.

Now, some 6 months later the team at the Tel-Aviv University, Hebrew University of Jerusalem, and Adobe Research finally released a comprehensive study of StyleGAN3’s applications in popular inversion and editing tasks, its pitfalls, and potential solutions, as well as highlights of the power of the Alias-free generator in tasks, where traditional image generators commonly underperform.

Let’s dive in, and learn, shall we?

⌛️ Prerequisites:

(Highly recommended reading to understand the core contributions of this paper):
1) Alias-free GAN
2) ReStyle
3) StyleCLIP
4) PTI

🚀 Motivation:

StyleGAN3 promised to solve numerous issues plaguing the StyleGAN 1 and 2 generators, namely texture-sticking, a nasty artifact that makes high-frequency textures like hair stay in one place, while the rest of the generated face moves around the image. Moreover, Alias-free GAN has a built-in explicit way to control the translation and rotation of generated faces in a way that is disentangled from the latent style vector that determines the semantic attributes of the synthesized faces. Yet, despite these improvements, StyleGAN3 is not a trivial drop-in replacement for StyleGAN2 in most tasks such as image and video inversion and editing.

🔍 Main Ideas:

1) StyleGAN3 Overview:
At this point stop and read the Alias-free digest if have not yet done so.

Key differences from StyleGAN2: Fixed 16-layer generator, fourier features instead of the constant 4x4 input, the fourier features can be rotated and translated with four parameters predicted from the first style vector via a learned affine layer. The generator can be forced to output an aligned FFHQ-like image by setting the rotation and translation to 0, and the first style code to the average style code.

2) Analysis:
At this point the paper goes into great detail about a series of experiments that I will cover in the next section. Aaaand, let’s get to it!

📈 Experiment insights / Key takeaways:

1) Rotation Control:
The rotation of the synthesized image is primarily controlled by the first two style codes. Note that the first two latent vectors also determine various other entangled attributes.

2) Disentanglement Analysis:
To compute attribute scores for generated images a pseudo-alignment procedure is used as mentioned earlier. Using the Style Space (outputs of the learned affine transforms at each of the generator’s layers) for editing StyleGAN3 images is preferred to both the noise and W spaces in terms of the DCI (disentanglement, completeness, informativeness) metrics. Unfortunately, using the popular W+ style codes with StyleGAN3 generates unnatural images.

3) Image Editing:
It is not obvious how to use an unaligned generator for image editing since the learned latent space directions are only available for an aligned generator, and attribute classifiers typically work best with aligned images. While using the pseudo-aligned images is one way to solve this issue, a better way is to generate all images with an aligned generator and apply user-defined rotation and translation to change the orientation of the image. Linear directions appear to be more entangled for the unaligned generator.

Using the style space for editing images with StyleCLIP results in disentangled edits for both the aligned and unaligned generators, as opposed to edits performed in the noise and W spaces.

4) StyleGAN3 Inversion:
The authors leverage the insight that unaligned images can be edited with an aligned generator and train an encoder on aligned images only. At inference, they align the input image using an off-the-shelf facial landmarks detector and predict the translation and rotation between the unaligned input image and its pseudo-aligned variant. The aligned image is encoded, inverted, reconstructed into the unaligned version using the predicted translation and rotation parameters. The encoder is either the pSp or e4e version of ReStyle, with the later producing more editable reconstructions.

The authors notice quick deterioration in the quality of StyleGAN3 generated images as they move awat from the W space towards W+, as it is not as well behaved as in StyleGAN2, hence the challenges in training inversion encoders for StyleGAN3.

5) Inverting and Editing Videos:
The pipeline for inverting and editing videos is as follows: All video frames are cropped and aligned before being inverted with a trained encoder. Additionally, the rotation and translation parameters are saved for each frame’s unaligned image. Optionallty, the edits are performed at this stage. The obtained vectors and input fourrier features are temporally smoothed to remove jittering between subsequent video frames.

Next, Pivotal Tuning Inversion is performed on the original unaligned images to further improve the reconstructions for each frame.

As if that was not impressive enough, the authors suggest a way to expand the field of view of videos, where a subject’s head moves out of the frame. They do this by predicting the offset transforms to fit the entire head into the frame, generating a second image with the missing parts of the head inside the frame, and combining the two images into a single one with an expanded FOV. This is not possible with StyleGAN2 as it is unable to handle unaligned images. Furthermore, coherent edits on the expanded images are possible with StyleGAN3 without boundary artifacts.

🖼️ Paper Poster:

🛠 Possible Improvements:

Figure out why the latent space of StyleGAN3 is entangled, and how to make it more disentangled
Do more experiments with non-facial domains
Try to design encoder architectures around 1x1 convolutions and Fourier Features.

✏️My Notes:

(5/5) Awesome, I love the wordplay around the number three! I usually take off a point here for not including a simple acronym to refer to the proposed method, but that does not make much sense in this paper
An interesting paper format, you don’t see this exploration-style research too often
Kinda sucks that there is not a definitive answer on whether StyleGAN3 is better than StyleGAN2 or not
If you look closely at the samples provided on the website, you can see a weird “bouncing” effect on some of the images, as if the texture is moving at different speeds at different coordinates, which makes some of the images appear to be made of Jell-O. I guess this could be a side effect of using 1x1 convolutions in the generator.
At the rate, this team is releasing papers they might as well just do an official takeover of Casual GAN Papers soon
“I am very intrigued, what the next couple of papers from this lab will be (besides the obvious fast video editing follow up), something in 3D maybe?” - well, this aged well in two weeks
What do you think about Third Time is the Charm? Leave a comment, and let’s discuss!

🔗 Links:

Stitch it in Time arxiv / Stitch it in Time Github

🔥 Check Out These Popular AI Paper Summaries:

👋 Thanks for reading!

Join Patreon for Exclusive Perks!

If you found this paper digest useful, subscribe and share the post with your friends and colleagues to support Casual GAN Papers!

Join the Casual GAN Papers telegram channel to stay up to date with new AI Papers!

Discuss the paper

By: @casual_gan

P.S. Send me paper suggestions for future posts @KirillDemochkin!