During fine-tuning, GeoWizard encodes the image, GT depth, and GT normal through the frozen VAE into latent space and forms two concatenated geometric groups. Each group is fed into the U-Net to generate the output in depth or normal domain under the guide of a geometry switcher. Additionally, the scene prompt is introduced to produce results with one of three possible scene layouts (indoor/outdoor/object). During inference, given an image, a scene prompt, initial depth noise and normal noise, GeoWizard can generate high-quality depth and normal jointly.
@inproceedings{fu2024geowizard,
title={GeoWizard: Unleashing the Diffusion Priors for 3D Geometry Estimation from a Single Image},
author={Fu, Xiao and Yin, Wei and Hu, Mu and Wang, Kaixuan and Ma, Yuexin and Tan, Ping and Shen, Shaojie and Lin, Dahua and Long, Xiaoxiao},
booktitle={ECCV},
year={2024}
}