Given sequential video frames without intrinsics as the input, our frontend model takes in view pairs and predicts local pointmaps and relative poses within each pair. We then use the pair-wise predictions to construct a Sim(3) pose graph with loop closure and optimize it via Levenberg-Marquardt algorithm. The frontend model employs a fully symmetric design, making the model lightweight and supporting more flexible pose graph optimization. The blue edges in the pose graph and final results correspond connections between neighboring nodes (views), the orange edges correspond to loop closures, and the light blue frustums represent the estimated camera poses.
office
redkitchen
apt1
office0
floor
room
@misc{zhang2025vista,
title={{ViSTA-SLAM}: Visual {SLAM} with Symmetric Two-view Association},
author={Ganlin Zhang and Shenhan Qian and Xi Wang and Daniel Cremers},
year={2025},
eprint={2509.01584},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2509.01584},
}