MetaCap: Meta-learning Priors from Multi-View Imagery for Sparse-view Human Performance Capture and Rendering
Faithful human performance capture and free-view rendering from sparse RGB observations is a long-standing problem in Vision and Graphics. The main challenges are the lack of observations and the inherent ambiguities of the setting, e.g. occlusions and depth ambiguity. As a result, radiance fields,...
Saved in:
| Main Authors | , , , , |
|---|---|
| Format | Journal Article |
| Language | English |
| Published |
27.03.2024
|
| Subjects | |
| Online Access | Get full text |
| DOI | 10.48550/arxiv.2403.18820 |
Cover
| Summary: | Faithful human performance capture and free-view rendering from sparse RGB
observations is a long-standing problem in Vision and Graphics. The main
challenges are the lack of observations and the inherent ambiguities of the
setting, e.g. occlusions and depth ambiguity. As a result, radiance fields,
which have shown great promise in capturing high-frequency appearance and
geometry details in dense setups, perform poorly when naively supervising them
on sparse camera views, as the field simply overfits to the sparse-view inputs.
To address this, we propose MetaCap, a method for efficient and high-quality
geometry recovery and novel view synthesis given very sparse or even a single
view of the human. Our key idea is to meta-learn the radiance field weights
solely from potentially sparse multi-view videos, which can serve as a prior
when fine-tuning them on sparse imagery depicting the human. This prior
provides a good network weight initialization, thereby effectively addressing
ambiguities in sparse-view capture. Due to the articulated structure of the
human body and motion-induced surface deformations, learning such a prior is
non-trivial. Therefore, we propose to meta-learn the field weights in a
pose-canonicalized space, which reduces the spatial feature range and makes
feature learning more effective. Consequently, one can fine-tune our field
parameters to quickly generalize to unseen poses, novel illumination conditions
as well as novel and sparse (even monocular) camera views. For evaluating our
method under different scenarios, we collect a new dataset, WildDynaCap, which
contains subjects captured in, both, a dense camera dome and in-the-wild sparse
camera rigs, and demonstrate superior results compared to recent
state-of-the-art methods on, both, public and WildDynaCap dataset. |
|---|---|
| DOI: | 10.48550/arxiv.2403.18820 |