Deep Learning-Based Object Tracking for Augmented Reality: A System-Level Survey of Methods, Constraints, and Challenges

Duoduo Mou

doi:10.53104/j.acad.res.adv.2026.03001

Authors

Duoduo Mou Faculty of Computing, Universiti Teknologi Malaysia, Skudai, Johor 81310, Malaysia

DOI:

https://doi.org/10.53104/j.acad.res.adv.2026.03001

Keywords:

Augmented reality; Object tracking; Deep learning; Transformer; lightweight models; Multimodal fusion; 6-DoF pose estimation

Abstract

The immersiveness and usability of augmented reality (AR) systems rely on accurate, temporally stable, and computationally efficient object tracking. Recent advances in deep learning have reshaped visual tracking and enabled increasingly complex AR applications on mobile and edge platforms. As AR progresses toward large-scale consumer and industrial deployment, tracking has become a system-critical perception module that directly affects visual stability, interaction latency, and user trust. This survey reviews deep learning based object tracking for AR from 2018 to 2025, focusing on algorithmic paradigms and system-level constraints. We analyze AR-specific requirements such as tight latency budgets, limited energy, long-term operation, and perceptual stability, and examine four representative paradigms (Siamese networks, deep discriminative correlation filters, Transformer based models, and long-term frameworks) with their design rationales and deployment challenges. We further discuss lightweight architectures, state-space temporal models, and diffusion-based approaches, along with integration strategies involving efficiency optimization, hardware-aware design, 6-DoF pose tracking, SLAM coupling, neural scene representations, and multimodal fusion. Representative datasets and evaluation protocols are analyzed from an AR deployment viewpoint, and open challenges and future research directions are identified. We argue that AR-oriented tracking constitutes a distinct research domain where algorithmic accuracy, perceptual stability, and system efficiency must be jointly optimized to support trustworthy and immersive next-generation AR experiences.

References

Alansari M, Hassan M, et al., 2025. Visual tracking by matching points using diffusion models. Future Generation Computer Systems, 152, pp. 98–110.

Cai W, Liu Y, et al., 2025. SPMTrack: Spatio-temporal parameter-efficient fine-tuning with mixture of experts for scalable visual tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10452–10461.

Chen B, Li P, Bai L, et al., 2022. Backbone is all your need: A simplified architecture for visual object tracking. In: Proceedings of the European Conference on Computer Vision, pp. 375–391.

Chen X, Peng H, Wang D, et al., 2023. SeqTrack: Sequence to sequence learning for visual object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14535–14544.

Chen X, Yan B, Zhu J, et al., 2021. Transformer tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8126–8135.

Chen Y H, Kristan M, et al., 2023. NeighborTrack: Single object tracking by bipartite matching with neighbor tracklets. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 1823–1832.

Cui Y, Jiang C, Wang L, et al., 2022. MixFormer: End-to-end tracking with iterative mixed attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13608–13617.

Dai K, Zhang Y, Wang D, et al., 2020. High-performance long-term tracking with meta-updater. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6298–6307.

Danelljan, M., Goutam, B. and Khan, F.S., 2020. Probabilistic regression for visual tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7183–7192.

Gao S, Zhou C, Ma C, et al., 2022. AiATrack: Attention in attention for transformer visual tracking. In: Proceedings of the European Conference on Computer Vision, pp. 282–299.

Guo M, Zhang R, et al., 2025. DreamTrack: Dreaming the future for multimodal visual object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11234–11243.

Hodan T, Sundermeyer M, Drost B, et al., 2020. BOP Challenge 2020 on 6D object localization. In: Proceedings of the European Conference on Computer Vision Workshops, pp. 577–594.

Hodan T, Sundermeyer M, et al., 2023. BOP Challenge 2022 on detection, localization and pose estimation. International Journal of Computer Vision, 131(8), pp. 1940–1962.

Hong L, Wang Z, et al., 2024. OneTracker: Unifying visual object tracking with foundation models and efficient tuning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16412–16421.

Kristan M, Leonardis A, Matas J, et al., 2023. The VOTS2023 challenge results. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pp. 312–331.

Kristan M, Leonardis A, Matas J, et al., 2024. The visual object tracking challenge 2024 results. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 1–15.

Labbé Y, Carpentier J, Aubry M, et al., 2020. CosyPose: Consistent multi-view multi-object 6D pose estimation. In: Proceedings of the European Conference on Computer Vision, pp. 574–591.

Li Y, Zhang H, et al., 2024. A transformer-based visual object tracker via learning immediate appearance change information in videos. Pattern Recognition, 148, pp. 110125.

Liang S, Chen Z, et al., 2025. Autoregressive sequential pretraining for visual tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9231–9240.

Liu X, Wang Y, et al., 2025. MambaVLT: Time-evolving multimodal state space model for vision-language tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12801–12810.

Mayer C, Danelljan M, Paudel D P, et al., 2021. Learning target candidate association to keep track of what not to track. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13444–13454.

Mayer C, Danelljan M, Paudel D P, et al., 2022. Transforming model prediction for tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8731–8740.

Paul M, Danelljan M, Van Gool L, et al., 2022. Robust visual tracking by segmentation. In: Proceedings of the European Conference on Computer Vision, pp. 571–588.

Ravi N, et al., 2024. SAM 2: Segment anything in images and videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4012–4021.

Shim K, Lee S, et al., 2025. Focusing on tracks for online multi-object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14903–14912.

Sun C, Zhao J, et al., 2025. Exploring historical information for RGBE visual tracking with Mamba. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13592–13601.

Videnovic J, et al., 2025. A distractor-aware memory for visual object tracking with SAM2. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14122–14131.

Wang N, Zhou W, Wang J, et al., 2021. Transformer meets tracker: Exploiting temporal context for robust visual tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15711–15720.

Wang X, Zhang Y, et al., 2024. Event stream-based visual object tracking: A high-resolution benchmark dataset and evaluation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11983–11992.

Wen B, Yang W, Birchfield S, et al., 2021. BundleTrack: 6D pose tracking for novel objects without instance or category-level 3D models. In: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 806–813.

Wen B, Yang W, Kautz J, et al., 2023a. BundleSDF: Neural 6-DoF tracking and 3D reconstruction of unknown objects. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15861–15870.

Wen B, Yang W, Kautz J, et al., 2023b. Neural object-centric tracking and reconstruction in the wild. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 18456–18465.

Wen B, Yang W, Kautz J, et al., 2024. FoundationPose: Unified 6D pose estimation and tracking of novel objects. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14291–14300.

Wu J, Jiang Y, Liu Q, et al., 2024. General object foundation model for images and videos at scale. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5123–5132.

Xie F, Wang C, Wang G, et al., 2024. DiffusionTrack: Point set diffusion model for visual object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8751–8760.

Yan B, Peng H, Wu K, et al., 2021. Learning spatio-temporal transformer for visual tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10448–10457.

Yu B, Tang M, Zheng L, et al., 2021. High-performance discriminative tracking with transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9854–9863.

Zhang T, Huang L, et al., 2023. Unified multimodal tracking with event cameras and RGB sensors. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 19342–19351.

Zhao M, Okada K, Inaba M., 2021. TrTr: Visual tracking with transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pp. 156–165.

Deep Learning-Based Object Tracking for Augmented Reality: A System-Level Survey of Methods, Constraints, and Challenges

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Current Issue

Make a Submission

Information