Neural Body: Implicit Neural Representations with Structured Latent Codes for Novel View Synthesis of Dynamic Humans by Sida Peng et al explained in 5 minutes.
⭐️Paper difficulty: 🌕🌕🌕🌑🌑
🎯 At a glance:
Want to dance like a pro? Just fit a neural body to a sparse set of shots from different camera poses and animate it to your heart’s desire! This new human body representation is proposed in a CVPR 2021 best paper candidate work by Sida Peng and his teammates. At the core of the paper is the insight that the neural representations of different frames share the same set of latent codes anchored to a deformable mesh. Neural Body outperforms prior works by a wide margin.
(Check these out if you have extra time):
Previously, free view video required a specialized array of depth sensors and cameras. This work, however focuses on the problem of novel view synthesis from a sparse multi-view video with just a few cameras. Traditional rendering methods require dense input views, while reconstruction-based ones struggle with occlusion. The recently proposed NeRF works alright, but does not handle well highly sparse input views. The authors pitch that the key to solving the sparse view problem lies in aggregating all observations over different video frames. Instead of learning different implicit fields for each frame, Neural Body anchors latent vectors to keypoints on a defformable human model (SMPL) and learns to predict denstity and color of each point based on the corresponding latent codes, and their spatial location. Everything is learned jointly from all video frames.
🔍 Main Ideas:
Neural body starts from a bunch of latent codes attached to the surface of a deformable SMPL. The code for any location is obtained with diffusion, decoded into density and color by two models, and the entire image is generated with volume rendering. The model as well as the codes are jointly learned by minimizing the difference between the input and rendered images.
2) Structured latent codes:
SMPL outputs a posed 3D mesh with 6890 vertices, and Neural Body has a latent vector for each vertex. The spatial locations of the codes are transformed based on the mesh pose.
3) Code diffusion:
Since implicit fields require sampling codes at continuous 3d locations, and trilinear interpolation leads to 0 vectors in most points due to sparsity, the Neural Body latent codes are diffused to nearby 3d space. The bounding box around the human body mesh is split into small voxels, and the codes inside voxels are averaged for non-empty voxels. Next, a SparseConvNet uses 3D convolutions to output 2x, 4x, 8x, 16x downsampled volumes, and interpolated codes from across all scales are concatenated into the vectors that form the latent code volume. The code for any 3D point is trilinearly interpolated from the latent code volume after transforming its coordinates to SMPL coordinate system.
4) Density and color regression:
Density and Color fields are represented by MLP networks NeRF style. In addition to using positional encoding a temporal encoding is used for each fram in a video to account for small changes in lighting.
5) Volume rendering:
Standard volume rendering to a 2D image to compute a pixel-wise loss.
📈 Experiment insights / Key takeaways:
- All experiments are conducted on ZJU-Mocap
- Neural Body (NB) outperforms Neural Texture, Neural Volume, and NHR on novel view synthesis
- NB beats PIFuHD on 3D reconstruction
🖼️ Paper Poster:
- (4/5) Neural body and Latent Code Volume give me strong Evangelion vibes for some reason 😂. Definitely a like!
- Have you tried Neural Body? Let me know in the comments!
Neural Body arxiv / Neural Body github
👋 Thanks for reading!
< Join Patreon for Exclusive Perks!
If you found this paper digest useful, subscribe and share the post with your friends and colleagues to support Casual GAN Papers!
Join the Casual GAN Papers telegram channel to stay up to date with new AI Papers!
Discuss the paper
P.S. Send me paper suggestions for future posts @KirillDemochkin!