GlORIE-SLAM

Abstract

Recent advancements in RGB-only dense Simultaneous Localization and Mapping (SLAM) have predominantly utilized grid-based neural implicit encodings and/or struggle to efficiently realize global map and pose consistency. To this end, we propose an efficient RGB-only dense SLAM system using a flexible neural point cloud scene representation that adapts to keyframe poses and depth updates, without needing costly backpropagation. Another critical challenge of RGB-only SLAM is the lack of geometric priors. To alleviate this issue, with the aid of a monocular depth estimator, we introduce a novel DSPO layer for bundle adjustment which optimizes the pose and depth of keyframes along with the scale of the monocular depth. Finally, our system benefits from loop closure and online global bundle adjustment and performs either better or competitive to existing dense neural RGB SLAM methods in tracking, mapping and rendering accuracy on the Replica, TUM-RGBD and ScanNet datasets.

GlORIE-SLAM uses a deformable point cloud as the scene representation and achieves lower trajectory error and higher rendering accuracy compared to competitive approaches, e.g. GO-SLAM. The geometric accuracy is qualitatively evaluated. The light blue trajectory is ground truth and the blue is the estimated. The PSNR is evaluated for all keyframes.

Method Overview

Given an input RGB stream, we first track and then map every keyframe. The pose is initially estimated with local bundle adjustment (BA) via frame-to-frame tracking of recurrent optical flow estimation. This is done with our novel DSPO (Disparity, Scale and Pose Optimization) layer, which combines pose and depth estimation with scale and depth refinement by leveraging a monocular depth prior. The DSPO layer also refines the poses globally via online loop closure and global BA. To map the estimated pose, a proxy depth map is estimated by combining the noisy keyframe depths from the tracking module with the monocular depth prior to account for missing observations. Mapping is done, along with the input RGB keyframe via a deformable neural point cloud, leveraging depth guided volumetric rendering. A re-rendering loss to the input RGB and proxy depth optimizes the neural features and the color decoder weights. Importantly, the neural point cloud deforms to account for global updates of the poses and proxy depth before each mapping phase.

DSPO (Disparity, Scale and Pose Optimization)

Input: Optical flow between keyframes & mono depth prior
Output: Camera poses ω and disparity maps d

Step A: Reprojection Error

Step B: Multi-view Filter

Step C: Optimize Scale, Shift and Disparity Jointly

only optimize disparities that failed in multi-view filter

Qualitative Results

Textured Meshes

Rendered Images

BibTeX

@article{zhang2024glorie,
    title={Glorie-slam: Globally optimized rgb-only implicit encoding point cloud slam},
    author={Zhang, Ganlin and Sandstr{\"o}m, Erik and Zhang, Youmin and Patel, Manthan and Van Gool, Luc and Oswald, Martin R},
    journal={arXiv preprint arXiv:2403.19549},
    year={2024}
}