GeoWizard: Unleashing the Diffusion Priors for 3D Geometry Estimation from a Single Image

1 CUHK 2 The University of Adelaide 3 HKUST 4 ShanghaiTech University 5 HKU 6 Light Illusions

* Equal contribution; † Corresponding authors

Teaser image demonstrating Marigold depth estimation.

GeoWizard generates high-frequency and SOTA depth and normal maps in the wild.

Depth & Normal Comparison Gallery

3D Reconstruction Comparison

We utilize BiNI algorithm to reconstruct 3D mesh from estimated normal map.
3d reconstruction comparison

Novel View Synthesis Comparison

We utilize depth guided 3D-photo-inpainting to render novel view.
3d reconstruction comparison

Overall Framework

During fine-tuning, GeoWizard encodes the image, GT depth, and GT normal through the frozen VAE into latent space and forms two concatenated geometric groups. Each group is fed into the U-Net to generate the output in depth or normal domain under the guide of a geometry switcher. Additionally, the scene prompt is introduced to produce results with one of three possible scene layouts (indoor/outdoor/object). During inference, given an image, a scene prompt, initial depth noise and normal noise, GeoWizard can generate high-quality depth and normal jointly.

3d reconstruction comparison


          title={GeoWizard: Unleashing the Diffusion Priors for 3D Geometry Estimation from a Single Image},
          author={Fu, Xiao and Yin, Wei and Hu, Mu and Wang, Kaixuan and Ma, Yuexin and Tan, Ping and Shen, Shaojie and Lin, Dahua and Long, Xiaoxiao},
          journal = {arxiv},
          year = {2024}