NU-MCC: Multiview Compressive Coding with Neighborhood Decoder and Repulsive UDF

Abstract

Remarkable progress has been made in 3D reconstruction from single-view RGB-D inputs. MCC is the current state-of-the-art method in this field, which achieves unprecedented success by combining vision Transformers with large-scale training. However, we identified two key limitations of MCC: 1) The Transformer decoder is inefficient in handling large number of query points; 2) The 3D representation struggles to recover high-fidelity details. In this paper, we propose a new approach called NU-MCC that addresses these limitations. NU-MCC includes two key innovations: a Neighborhood decoder and a Repulsive Unsigned Distance Function (Repulsive UDF). First, our Neighborhood decoder introduces center points as an efficient proxy of input visual features, allowing each query point to only attend to a small neighborhood. This design not only results in much faster inference speed but also enables the exploitation of finer-scale visual features for improved recovery of 3D textures. Second, our Repulsive UDF is a novel alternative to the occupancy field used in MCC, significantly improving the quality of 3D object reconstruction. Compared to standard UDFs that suffer from holes in results, our proposed Repulsive UDF can achieve more complete surface reconstruction. Experimental results demonstrate that NU-MCC is able to learn a strong 3D representation, significantly advancing the state of the art in single-view 3D reconstruction. Particularly, it outperforms MCC by 9.7% in terms of the F1-score on the CO3D-v2 dataset with more than 5x faster running speed.

Approach

Overview

Given an input single-view RGB-D image, we first unproject the pixels into the 3D world frame, resulting in a textured partial point cloud. We then employ a standard ViT to extract visual features from the partial point cloud. Next, we introduce our Neighborhood decoder, which utilizes the extracted visual features to estimate the UDF and RGB values of each query point in 3D space. The Neighborhood decoder allows each query point (blue star in the figure) to attend to only a small set of features in its neighborhood and incorporate fine-scale information directly from the input, significantly improving the efficiency and reconstruction quality. During inference, the predicted UDF is used to shift the query points to the surface of the 3D object, leading to high-quality 3D reconstruction. Repulsive forces between query points are employed to fix the hole artifacts in the standard UDF.

Repulsive UDF

The standard UDF formulation allows the generation of dense points around surfaces by iteratively shifting query points guided by their UDFs and UDF gradients. However, the standard formulation favors regions with high-curvature and thus results in hole artifacts. We introduce repulsion forces among the query points to produce uniform point distribution on the surface.

Results

We show our 3D reconstruction results from single-view RGB-D images and compare them with MCC. Click the (3D) button for an interactive viewing of the output points!

iPhone
Generative AI
CO3D
ImageNet
Hypersim
Taskonomy

Trained on the CO3D dataset, NU-MCC generalizes to in-the-wild novel object categories not seen during training. Here, the RGB-D images were captured with an iPhone 14 Pro using Record3D app.

	Input	Seen	MCC	Ours

We show the 3D reconstruction of the images generated using Stable Diffusion (utilizing DreamStudio). The depth is estimated using an off-the-shelf depth prediction model and the mask using Segment Anything.

	Input (Hover to see prompts)	Seen	MCC	Ours
	a cavachon sitting on a park in singapore
	a gudetama on a street in zurich
	an otter posing full body
	a marshmallow resembling a face on a table

Here, we show the results from the CO3D validation set.

	Input	Seen	MCC	Ours

NU-MCC generalizes to novel object categories from ImageNet. We show challenging object categories, such as lawnmower and cannon can be reconstructed reasonably. The depth is estimated using an off-the-shelf depth prediction model and the mask using Segment Anything.

	Input	Seen	MCC	Ours

We demonstrate NU-MCC's capability to reconstruct scenes. Here, the model is trained on the Hypersim dataset and we show the reconstructions on novel scenes not seen during training.

Note: In the turntable animations, we trim the walls such that two sides of the scenes are open for better visualizations of the internals. Additionally, we do not superimpose the seen points to the reconstructions as done in MCC.

	Input	Seen	MCC	Ours

Here, we show of generalization capability of NU-MCC trained on Hypersim dataset to novel scenes from Taskonomy.

Note: In the turntable animations, we trim the walls such that two sides of the scenes are open for better visualizations of the internals. Additionally, we do not superimpose the seen points to the reconstructions as done in MCC.