Given an input RGB stream, we first track and then map every keyframe. The pose is initially estimated with local bundle adjustment (BA) via frame-to-frame tracking of recurrent optical flow estimation. This is done with our novel DSPO (Disparity, Scale and Pose Optimization) layer, which combines pose and depth estimation with scale and depth refinement by leveraging a monocular depth prior. The DSPO layer also refines the poses globally via online loop closure and global BA. To map the estimated pose, a proxy depth map is estimated by combining the noisy keyframe depths from the tracking module with the monocular depth prior to account for missing observations. Mapping is done, along with the input RGB keyframe via a deformable neural point cloud, leveraging depth guided volumetric rendering. A re-rendering loss to the input RGB and proxy depth optimizes the neural features and the color decoder weights. Importantly, the neural point cloud deforms to account for global updates of the poses and proxy depth before each mapping phase.
Input: Optical flow between keyframes & mono depth prior
Output: Camera poses ω and disparity maps d
Step A: Reprojection Error
@article{zhang2024glorie,
title={Glorie-slam: Globally optimized rgb-only implicit encoding point cloud slam},
author={Zhang, Ganlin and Sandstr{\"o}m, Erik and Zhang, Youmin and Patel, Manthan and Van Gool, Luc and Oswald, Martin R},
journal={arXiv preprint arXiv:2403.19549},
year={2024}
}