Casual GAN Papers: VQGAN Explained

24: VQGAN Explained

Taming Transformers for High-Resolution Image Synthesis by Patrick Esser et al. explained in 5 minutes.

⭐️Paper difficulty: 🌕🌕🌕🌕🌑

VQGAN: Taming Transformers for High-Resolution Image Synthesis teaser

🎯 At a glance:

It is a lucrative idea to combine the effectiveness of the inductive bias of CNNs with the expressiveness of transformers, yet only recently such an approach was proven to be not only possible but extremely powerful as well. I am of course talking about “Taming Transformers” - a paper from 2020 that proposes a novel generator architecture where a CNN learns a context rich vocabulary of discrete codes and a transformer learns to model their composition as high resolution images in both conditional and unconditional generation settings.

⌛️ Prerequisites:

(Highly recommended reading to understand the core ideas in this paper):

  1. Transformer
  2. VQVAE

🔍 Main Ideas:

1) Learning an effective codebook of image components for transformers: Transformers require sequences of tokens to generate images. In Taming Transformers the authors use a CNN encoder-decoder model with an element-wise quantization of the encoded image that learns a discrete spatial codebook. See VQVAE for more details.

2) Learning a perceptually rich codebook: To keep good perceptual quality at increased compression rate the authors propose VQGAN, a variant of VQVAE with an added patch discriminator and perceptual loss. The weight for the GAN loss is computed as the gradient of the perceptual loss divided by the gradient of the GAN loss. A single attention layer is used on the lowest resolution to aggregate global context. This training procedure reduces the memory requirements for using transformers (which are quadratic in the number of input tokens) by shortening the length of the code sequence in the next step.

3) Latent transformers: The transformer works with a sequence of quantized tokens from the learned discrete codebook in an autoregressive way, i.e. it learns to predict the distribution for next possible tokens given a sequence of previous tokens. The transformer is trained with a cross entropy loss.

4) Conditional synthesis: To pass spatial conditioning information to the transformer a second VQGAN is learned to obtain additional tokens that are simply prepended to the main tokens before going into the transformer.

5) High resolution synthesis: Since transformers are extremely memory intensive, they are not straightforward to use at high resolutions. To circumvent this limitation and generate megapixel-sized images the authors of VQGAN propose to use the transformer in a sliding window manner by reshaping the sequence of tokens to HxW and processing patches of tokens at each step instead of the whole sequence (only above and to the left of the current token). It is claimed in the paper that the VQGAN insures that the available context is sufficient to model the high range interactions vital for high-resolution image synthesis.

📈 Experiment insights / Key takeaways:
  • the latent codes are 1024 elements long, and transformers predict patches of 16x16 tokens during training
  • The transformer consistently outperforms PixelSNAIL on all considered tasks
  • VQGAN can be reused across different tasks
  • Context-rich vocabulary is absolutely necessary for high quality image synthesis.
  • VQGAN provides a unified approach (with l10x less parameters than VQVAE2) that is does well across a number of various tasks, only marginally outperformed by task specific models on some of them.

🖼️ Paper Poster:

VQGAN: Taming Transformers for High-Resolution Image Synthesis paper poster

✏️My Notes:
  • 4/5 VQGAN sounds sleek, not very tongue-in-cheek though :( BTW, still waiting for someone to make a “BakuGAN” paper
  • I have recently seen a ton of VQGAN+CLIP images generated from text prompts that look insanely cool, want to look into that
  • Interesting, how this can work together with the whole MLP-only trend in computer vision right now
  • I think this was an important paper to cover for learning about Dall-e, CogView, VQGAN+CLIP, etc. Comment bellow, which of those papers you want to see first!

VQGAN arxiv / VQGAN github

👋 Thanks for reading!

Join Patreon for Exclusive Perks!

If you found this paper digest useful, subscribe and share the post with your friends and colleagues to support Casual GAN Papers!

Join the Casual GAN Papers telegram channel to stay up to date with new AI Papers!

By: @casual_gan

P.S. Send me paper suggestions for future posts @KirillDemochkin!