# Multi-View Stereo by Temporal Nonparametric Fusion

https://aaltoml.github.io/GP-MVS/

Problem: Accurate, real-time dense depth estimation from monocular sequence

Idea: Similar views should have similar representations in latent-space

Solution: Soft-constrain bottleneck layer of depth estimation network using Gaussian Procceses (GPs) with appropriate pose-kernel

First, we go into the problem of estimating depth for a single view (using multiview constraints in this case)

# Network Architecture for Single Frame Depth Estimation

Exact same architecture as in MVDepthNet: Real-time Multiview Depth Estimation Neural Network

Problem: Leverage multiview geometry in real-time dense depth network (assuming known poses of images from odometry source)

Idea: Cost volumes can directly encode geometric constraints so that networks don't have to

Solution: Feed cost volume along with input image into network

Methods prior to this work:

- Single-view methods do not generalize well due to lack of constraints

- Difficult to parameterize epipolar geometry into network

- Cannot easily augment dataset due to multiview constraints

### Cost Volume Construction

- Construct cost-volume with respect to a reference frame and auxiliary frames

- Discretize inverse depth (here using 64 values)

- Note: No window size used for cost aggregation (only single pixel used) so that details are preserved

### Network Architecture

Feed RGB reference frame (depth 3) and cost volume (depth 64) into encoder-decoder network with skip connections

Data augmentation is now straightforward because all inputs and ground-truth are with respect to the same frame

### Notable Ablation Studies

Example of how adding RGB image to inputs provides finer details as compared to cost-volume alone:

Impact of number of auxiliary frames on accuracy of depth estimation:

### Problems

- While MVDepthNet uses a window of frames, it cannot leverage information beyond this window

- If places revisited, will attempt independent estimation of depth and lack consistency

- To address this issue, latent representations can enforce temporal consistency between different frames

# Pose-Kernel Gaussian Process Prior

### Gaussian Process Reminder

- Nonparametric method often used for regression

- Observed function values can be used to infer posterior distribution over functions

- Condition on training data to retrieve predictions for test data

Figure generated using https://distill.pub/2019/visual-exploration-gaussian-processes/

### Pose Kernel

- For kernel to assess similarity between poses, we need some a distance metric between rigid-body poses:

- Matern kernel used:

- Two learnable parameters here: amplitude $\gamma$ï»¿ and length-scale $\ell$ï»¿.

### Gaussian Process Formulation

- Independent GP priors are placed on all latent space values $z_j$ï»¿ for $j = 1,2...(512 \times 8 \times 10)$ï»¿

- The third learnable GP parameter is $\sigma$ï»¿, which is the noise standard deviation.

Graphical model overview (online method):

# Estimation

Batch method can account for relationships between all poses while online method only has linear complexity

### Batch (Offline)

- Solve GP for all N frames

- Efficient because matrix $\mathbf{C}$ï»¿ only depends on kernel, which only depends on poses, so inverse can be shared across all $512 \times 8 \times 10$ï»¿ subproblems.

- Posterior mean passed through decoder to retrieve the depth map

- Matrix inversion limited by number of poses, becomes too expensive after hundreds of frames

### Online

- Relax GP graphical model to be a Markov chain

- Reduce the complexity of GP from $O(n^3)$ï»¿ to $O(n)$ï»¿, by reformulating it as a Kalman filtering problem

- Convert the problem to a state-space model

- Latent space encoding conditioned on all image-pose pairs up until current pair

# Experiments

Generally results in improved results

Mistakes made early in online method can take more observations to correct since they are propagated forward

Comparison of different kernels on results

This method with 2 frames outperforms other methods with 5 frames

# Discussion

### Online Method Limitations

- No mention of offline runtime?

- Switching to online method loses many of the benefits proposed by this formulation (online method has an implicit window-size)

### Generalization

- While very efficient and provides improvement in indoor scenes, generalization should be discussed further

- Does this method help for some motions more than others? (Rotation vs. translation)
- MVDepthNet cost volume will lack information for pure rotation?

- This method can propagate future information back into past, while MVDepthNet cannot

- Gaussian Process lacks information about size of scene
- Length-scale parameter learned, but may not generalize to larger scenes

- Influence of poses within 1m very different depending on indoor vs. outdoor scene

- Kernels could be learned, but pose information alone may not be sufficient

- Computation may be severely limited if more information was included, from Section 3.3:
- "Because the likelihood is Gaussian and all the GPs share the same poses at which the covariance function is evaluated,we may solve all the 512Ã—8Ã—10GP regression problems with one matrix inversion. This is due to the posterior co-variance only being a function of the input poses, not values of learnt representations of images (i.e.,y does not appear in the posterior variance terms in Eq. 5)"

### Graphical Models and Refinement of Latent Space

Interesting to think about ways of utilizing latent space in sequential estimation

- CodeSLAM propagates observed errors to refine latent space (geometric and photometric errors should be minimized)

- This work implicitly constrains the latent space according to hand-crafted kernel (similar poses should have similar latent spaces)

### Uncertainty Estimation

- Uncertainty estimation especially useful for the downstream tasks we are interested in

- Is the uncertainty estimation from GPs useful for certain tasks?
- Useful for determining if similar data was previously seen

- Maybe not possible to propagate due to skip connections?

- Are there benefits to using GP uncertainty instead of having networks directly predict the uncertainty as an output? (https://arxiv.org/pdf/1703.04977.pdf)
- To model uncertainty, some networks commonly use loss such as

- GPs can be viewed as infinitely wide neural network layer with i.i.d. prior on weights

- GPs give more flexible models, while Bayesian NNs give more scalable models

- With enough data, is it sufficient to use neural networks?

- Examples of neural network uncertainty usage in dense SLAM
- CodeSLAM: https://arxiv.org/pdf/1804.00874.pdf