StyleGAN-V: A Continuous Video Generator with the Price, Image Quality and Perks of StyleGAN2 by Ivan Skorokhodov et al. explained in 5 minutes
⭐️Paper difficulty: 🌕🌕🌕🌕🌑
🎯 At a glance:
While we have seen several new SOTA image generation models pop up over the last year, video generation still remains lackluster, to say the least. But does it have to be? The authors of StyleGAN-V certainly don’t think so! By adapting the generator from StyleGAN2 to work with motion conditions, developing a hypernetwork-based discriminator, and designing a clever acyclic positional encoding, Ivan Skorohodov and the team at KAUST and Snap Inc. deliver a model that generates videos of arbitrary length with arbitrary framerate, is just 5% more expensive to train than a vanilla StyleGAN2, and beats multiple baseline models on 256 and 1024 resolution. Oh, and it only needs to see about 2 frames from a video during training to do so!
And if that wasn’t impressive enough, StyleGAN-V is CLIP-compatible for first-ever text-based consistent video editing
(Highly recommended reading to understand the core contributions of this paper):
Conv3D layers that are commonly used for video generation come with many limitations such as being rather computationally expensive. One way to circumvent the need for 3D convolutions is to treat the video as a continuous signal with a time coordinate. In order for such an approach to work, several issues must be solved first: existing sine/cosine-based positional encodings are cyclic and do not depend on the input, which is detrimental for videos because we want different motion between frames for different videos, and the videos should not loop. Next, training on full videos is computationally expensive, hence the generator must be able to learn from sparse inputs of just a couple of frames per clip. Finally, the discriminator needs to handle frames sampled at varying time distances to process sparse inputs.
To summarize, contrary to other works StyleGAN-V is not autoregressive, does not use Conv3D, trains on sparse inputs, and uses a single discriminator instead of two separate ones for images and videos.
🔍 Main Ideas:
1) Generator structure:
The generator consists of three submodules: the mapping network and the generator which is lifted from StyleGAN2 with a slight modification - the constant input to the generator is concatenated with the motion codes that are obtained from the third submodule, titled the motion mapping network.
A sample video is generated by sampling some content noise and passing it through the mapping network to get the style code for the video. Then, for each timestep sample a sequence of noise vectors that correspond to equidistant timesteps long enough to cover the target timestep, pass it through two paddingless Conv1D layers, and compute the acyclic positional encoding from two vectors in the output sequence that correspond to two random timesteps from the sequence to the left and right of the target timestep. The resulting motion code is plugged into the generator.
2) Acyclic positional encoding:
The proposed positional encoding is basically a transformed sine function with a learnable amplitude, period, phase of the wave that first predicts “raw” motion codes from the left side of the target timestep. This alone, however, leads to disjoint motion codes, which is why they stitched together by subtracting the linear interpolation between the left and right code to zero out the embeddings at each discrete timestep (0, 1, 2, …). This somewhat limits the expressiveness of the positional encoding, so to compensate for it a linear interpolation between the left and right motion vectors multiplied by a learnable matrix is added back. Just think of this as BatchNorm’s weird cousin. It normalizes the vectors and then denormalizes them with a learned parameter, nothing more complicated intuitively.
3) Discriminator structure:
The discriminator independently extracts features from each frame, concatenates the results, and predicts a single real/false logit from that tensor. To be able to handle sparse inputs, the discriminator is conditioned on the time distances between frames. The distances are preprocessed by a positional encoding, followed by an MLP, and concatenated into a single vector that is used to modulate the weights in the first layer of each discriminator block along with the projection condition (dot product) in the very last layer.
4) Implicit assumptions of sparse training:
The intuition is very simple: frames do not change much in a single video (faces, timelapse, etc). Hence, enough information is contained in just a few frames to get the idea about the entire video. If you have seen two frames, you have seen them all.
📈 Experiment insights / Key takeaways:
- Datasets: Forensics, SkyTimelapse, UCF101, RainbowJelly, MEAD 1024
Baselines: MoCoGAN, VideoGPT, DIGAN
- The main metric is Frechet Video Distance that is implemented from scratch and will be released with the code
Replacing motion codes with LSTM hurts performance due to unnaturally abrupt transitions between frames; removing any conditioning also hurts performance; adding more frames does not improve quality
- StyleGAN-V has the same latent space properties as StyleGAN2
- StyleGAN-V is the first video generator that can be directly trained on 1024 resolution
- StyleGAN-V has almost the same training efficiency and quality as StyleGAN2
- StyleGAN-V separates content and motion with the ability to change either one without affecting the other
🖼️ Paper Poster:
🛠 Possible Improvements:
- Get more data
- Use more datasets with more complex motion patterns
- Remove periodic artifacts (apparently, they still happen sometimes)
- Improve handling of new content moving into a video frame
- Reduce sensitivity to hyperparameters
- Improve texture sticking (the teeth look really creepy on 1024x1024)
(4/5) While StyleGAN-V sounds cyberpunk-esque, I cannot in good faith give it a 5 for a one-letter-name-change
- CLIP guided videos coming soon? I sure hope so! BTW, bonus Casual GAN points for including CLIP-related experiments in the paper.
- Truth be told, I have never worked with videos, but this paper looked pretty interesting, hence it got the inaugural 2022 first paper post of the year spot.
- I definitely want to see if layer swapping will work with StyleGAN-V, preferably with layers from a finetuned StyleGAN-2, or maybe just finetune the video generator on clips from cartoons… I see the AnimeGAN, ArcaneGAN, etc popping on twitter, this idea is somewhere in the ballpark of this trend.
TBH, StyleGAN-V at least on paper seems like a killer combo with GANformer2, as the later can handle and generate complex layouts with multiple object, while the former takes care of cohesively moving the objects around a scene between frames.
- What do you think about StyleGAN-V? Share your thoughts in the chat!
StyleGAN-V arxiv / StyleGAN-V Github (Coming Soon)
🔥 Check Out These Popular AI Paper Summaries:
- 100x Faster than NeRF - Plenoxels explained
- NeRF + CLIP = Insanity - DreamFields explained
- Guided Diffusion explained
👋 Thanks for reading!
Join Patreon for Exclusive Perks!
If you found this paper digest useful, subscribe and share the post with your friends and colleagues to support Casual GAN Papers!
Join the Casual GAN Papers telegram channel to stay up to date with new AI Papers!
Discuss the paper
P.S. Send me paper suggestions for future posts @KirillDemochkin!