Multi-View Stereo by Temporal Nonparametric Fusion

https://aaltoml.github.io/GP-MVS/

Problem: Accurate, real-time dense depth estimation from monocular sequence

Idea: Similar views should have similar representations in latent-space

Solution: Soft-constrain bottleneck layer of depth estimation network using Gaussian Procceses (GPs) with appropriate pose-kernel

First, we go into the problem of estimating depth for a single view (using multiview constraints in this case)

Network Architecture for Single Frame Depth Estimation

Exact same architecture as in MVDepthNet: Real-time Multiview Depth Estimation Neural Network

Problem: Leverage multiview geometry in real-time dense depth network (assuming known poses of images from odometry source)

Idea: Cost volumes can directly encode geometric constraints so that networks don't have to

Solution: Feed cost volume along with input image into network

Methods prior to this work:

Single-view methods do not generalize well due to lack of constraints

Difficult to parameterize epipolar geometry into network

Cannot easily augment dataset due to multiview constraints

Cost Volume Construction

Construct cost-volume with respect to a reference frame and auxiliary frames

Discretize inverse depth (here using 64 values)

Note: No window size used for cost aggregation (only single pixel used) so that details are preserved

Network Architecture

Feed RGB reference frame (depth 3) and cost volume (depth 64) into encoder-decoder network with skip connections

Data augmentation is now straightforward because all inputs and ground-truth are with respect to the same frame

Notable Ablation Studies

Example of how adding RGB image to inputs provides finer details as compared to cost-volume alone:

Impact of number of auxiliary frames on accuracy of depth estimation:

Problems

While MVDepthNet uses a window of frames, it cannot leverage information beyond this window

If places revisited, will attempt independent estimation of depth and lack consistency

To address this issue, latent representations can enforce temporal consistency between different frames

Pose-Kernel Gaussian Process Prior

Gaussian Process Reminder

Nonparametric method often used for regression

Observed function values can be used to infer posterior distribution over functions

Condition on training data to retrieve predictions for test data

Figure generated using https://distill.pub/2019/visual-exploration-gaussian-processes/

Pose Kernel

For kernel to assess similarity between poses, we need some a distance metric between rigid-body poses:

Matern kernel used:

Two learnable parameters here: amplitude $\gamma$ and length-scale $\ell$ .

Gaussian Process Formulation

Independent GP priors are placed on all latent space values $z_j$ for $j = 1,2...(512 \times 8 \times 10)$

The third learnable GP parameter is $\sigma$ , which is the noise standard deviation.

Graphical model overview (online method):

Estimation

Batch method can account for relationships between all poses while online method only has linear complexity

Batch (Offline)

Solve GP for all N frames

Efficient because matrix $\mathbf{C}$ only depends on kernel, which only depends on poses, so inverse can be shared across all $512 \times 8 \times 10$ subproblems.

Posterior mean passed through decoder to retrieve the depth map

Matrix inversion limited by number of poses, becomes too expensive after hundreds of frames

Online

Relax GP graphical model to be a Markov chain

Reduce the complexity of GP from $O(n^3)$ to $O(n)$ , by reformulating it as a Kalman filtering problem

Convert the problem to a state-space model

Latent space encoding conditioned on all image-pose pairs up until current pair

Experiments

Generally results in improved results

Mistakes made early in online method can take more observations to correct since they are propagated forward

Comparison of different kernels on results

This method with 2 frames outperforms other methods with 5 frames

Discussion

Online Method Limitations

No mention of offline runtime?

Switching to online method loses many of the benefits proposed by this formulation (online method has an implicit window-size)

Generalization

While very efficient and provides improvement in indoor scenes, generalization should be discussed further

Does this method help for some motions more than others? (Rotation vs. translation)
- MVDepthNet cost volume will lack information for pure rotation?
- This method can propagate future information back into past, while MVDepthNet cannot

Gaussian Process lacks information about size of scene
- Length-scale parameter learned, but may not generalize to larger scenes
- Influence of poses within 1m very different depending on indoor vs. outdoor scene
- Kernels could be learned, but pose information alone may not be sufficient

Computation may be severely limited if more information was included, from Section 3.3:
- "Because the likelihood is Gaussian and all the GPs share the same poses at which the covariance function is evaluated,we may solve all the 512×8×10GP regression problems with one matrix inversion. This is due to the posterior co-variance only being a function of the input poses, not values of learnt representations of images (i.e.,y does not appear in the posterior variance terms in Eq. 5)"

Graphical Models and Refinement of Latent Space

Interesting to think about ways of utilizing latent space in sequential estimation

CodeSLAM propagates observed errors to refine latent space (geometric and photometric errors should be minimized)

This work implicitly constrains the latent space according to hand-crafted kernel (similar poses should have similar latent spaces)

Uncertainty Estimation

Uncertainty estimation especially useful for the downstream tasks we are interested in

Is the uncertainty estimation from GPs useful for certain tasks?
- Useful for determining if similar data was previously seen
- Maybe not possible to propagate due to skip connections?

Are there benefits to using GP uncertainty instead of having networks directly predict the uncertainty as an output? (https://arxiv.org/pdf/1703.04977.pdf)
- To model uncertainty, some networks commonly use loss such as

GPs can be viewed as infinitely wide neural network layer with i.i.d. prior on weights

GPs give more flexible models, while Bayesian NNs give more scalable models

With enough data, is it sufficient to use neural networks?

Examples of neural network uncertainty usage in dense SLAM
- D3VO: https://openaccess.thecvf.com/content_CVPR_2020/papers/Yang_D3VO_Deep_Depth_Deep_Pose_and_Deep_Uncertainty_for_Monocular_CVPR_2020_paper.pdf
- CodeSLAM: https://arxiv.org/pdf/1804.00874.pdf