DUSt3R: Geometric 3D Vision Made Easy

Which base task does DUSt3R perform from which it can directly solve downstream tasks such as camera pose estimation, depth estimation, 3D reconstruction, etc..


Dense and Unconstrained Stereo 3D Reconstruction of arbitrary image collections. Given a pair of images they regress the pointmaps. Where a pointmap is the a dense 2D field of 3D points associated with its corresponding RGB image.

paste-4c84bd7800a38e28425fc5d64999659a5f97db04.jpg

Give an overview of the DUSt3R architecture.


Refer to caption Two views of a scene $(I^1, I^2)$ are first encoded in a Siamese manner with a **shared ViT encoder**. The resulting token representations $F^1$ and $F^2$ are then passed to **two transformer decoders** that constantly exchange information **via cross-attention**. Finally, **two regression heads output the two corresponding pointmaps** and associated **confidence maps**. Importantly, the two pointmaps are expressed in the same coordinate frame of the first image $I^1$.

What are the inputs and outputs of the DUSt3R network and what data do you need to setup this input/output?


The input is two input RGB images that correspond to two views of a scene: I1,I2RW×H×3I^1, I^2 \in \mathbb{R}^{W\times H \times 3} The outputs are the 2 corresponding pointmaps, expressed in the coordinate frame of I1I^1: X1,1,X2,1RW×H×3X^{1,1}, X^{2,1}\in \mathbb{R}^{W\times H \times 3} with associated confidence maps C1,1,C2,1RW×H×3C^{1,1}, C^{2,1}\in \mathbb{R}^{W\times H \times 3}.

To construct these outputs, you need to know the camera intrinsics KR3×3K \in \mathbb{R}^{3 \times 3} , camera extrinsics (world-to-camera) PR4×4P \in \mathbb{R}^{4 \times 4} and depthmap DRW×HD \in \mathbb{R}^{W\times H}. pointmap XX can be obtained by Xi,j=K1([iDi,j,jDi,j,Di,j)X_{i,j} = K^{-1} ([i D_{i,j}, j D_{i,j}, D_{i,j}), where XX is expressed in the camera coordinate frame. To express pointmap XnX^n from camera nn in camera mm's coordinate frame: Xn,m=PmPn1XnX^{n,m} = P_m P_n^{-1} X^n.

Which loss function is used to train DUSt3R?


Confidence-aware 3D Regression loss. Given the ground-truth pointmaps Xˉ1,1\bar{X}^{1,1} and Xˉ2,1\bar{X}^{2,1} along with two corresponding sets of valid pixels D1,D2{1W}×{1H}\mathcal{D}^1,\mathcal{D}^2 \subseteq \{1\ldots W\}\times\{1\ldots H\} on which the ground-truth is defined. The regression loss for a valid pixel iDvi\in\mathcal{D}^v in view v{1,2}v\in\{1,2\} is simply defined as the Euclidean distance: lreg(v,i)=(1zXiv,11zˉXˉiv,1).\mathcal{l}_{\text{reg}}(v,i) = (\| \frac{1}{z}X^{v,1}_{i} - \frac{1}{\bar{z}}\bar{X}^{v,1}_{i} )\|. To handle the scale ambiguity between prediction and ground-truth, the predicted and ground-truth pointmaps are normalized by scaling factors z=norm(X1,1,X2,1)z=\operatorname{norm}(X^{1,1},X^{2,1}) and zˉ=norm(Xˉ1,1,Xˉ2,1)\bar{z}=\operatorname{norm}(\bar{X}^{1,1},\bar{X}^{2,1}), respectively, which simply represent the average distance of all valid points to the origin: norm(X1,X2)=1D1+D2v{1,2}iDvXiv\operatorname{norm}(X^1,X^2) = \frac{1}{|\mathcal{D}^1| + |\mathcal{D}^2|} \sum_{v \in \{1,2\}} \sum_{i \in \mathcal{D}^v} \| X^v_{i} \|As some parts of the image are harder to predict than others, the network also predicts a score for each pixel which represents the confidence that the network has about this particular pixel. The final training objective is the confidence-weighted regression loss over all valid pixels: Lconf=v{1,2}iDvCiv,1lreg(v,i)αlogCiv,1\mathcal{L}_{\text{conf}} = \sum_{v \in \{1,2\}} \, \sum_{i \in \mathcal{D}^v} C^{v,1}_i \mathcal{l}_{\text{reg}}(v,i) - \alpha \log C^{v,1}_i where Civ,1C^{v,1}_i is the confidence score for pixel ii, and α\alpha is a hyper-parameter controlling the regularization. To ensure a strictly positive confidence, they define Civ,1=1+expCiv,1~>1C^{v,1}_i=1+\exp \widetilde{C^{v,1}_i} >1. This has the effect of forcing the network to extrapolate in harder areas, e.g. like those ones covered by a single view. Training the network with this objective allows to estimate confidence scores without an explicit supervision.

Machine Learning Research Flashcards is a collection of flashcards associated with scientific research papers in the field of machine learning. Best used with Anki or Obsidian. Edit MLRF on GitHub.