awesome-spatial-intelligence
awesome-spatial-intelligence copied to clipboard
🌐 Forging Spatial Intelligence: A Roadmap of Multi-Modal Data Pre-Training for Autonomous Systems
Forging Spatial Intelligence
A Roadmap of Multi-Modal Data Pre-Training for Autonomous Systems
![]() |
|---|
| Figure 1: Taxonomy of Multi-Modal Representation Learning for Spatial Intelligence. |
This repository serves as the official resource collection for the paper "Forging Spatial Intelligence: A Roadmap of Multi-Modal Data Pre-Training for Autonomous Systems".
In this work, we establish a systematic taxonomy for the field, unifying terminology, scope, and evaluation benchmarks. We organize existing methodologies into three complementary paradigms based on information flow and abstraction level:
- 📷 Single-Modality Pre-Training
The Bedrock of Perception. Focuses on extracting foundational features from individual sensor streams (Camera or LiDAR) via self-supervised learning techniques, such as Contrastive Learning, Masked Modeling, and Forecasting. This paradigm establishes the fundamental representations for sensor-specific tasks. - 🔄 Multi-Modality Pre-Training
Bridging the Semantic-Geometric Gap. Leverages cross-modal synergy to fuse heterogeneous sensor data. This category includes LiDAR-Centric (distilling visual semantics into geometry), Camera-Centric (injecting geometric priors into vision), and Unified frameworks that jointly learn modality-agnostic representations. - 🌍 Open-World Perception and Planning
The Frontier of Embodied Autonomy. Represents the evolution from passive perception to active decision-making. This paradigm encompasses Generative World Models (e.g., video/occupancy generation), Embodied Vision-Language-Action (VLA) models, and systems capable of Open-World reasoning.
Citation
If you find this work helpful for your research, please kindly consider citing our paper:
@article{wang2026forging,
title = {Forging Spatial Intelligence: A Roadmap of Multi-Modal Data Pre-Training for Autonomous Systems},
author = {Song Wang and Lingdong Kong and Xiaolu Liu and Hao Shi and Wentong Li and Jianke Zhu and Steven C. H. Hoi},
journal = {arXiv preprint arXiv:2512.24385},
year = {2025}
}
Table of Contents
- 1. Benchmarks & Datasets
- Vehicle-Based Datasets
- Drone-Based Datasets
- Other Datasets
- 2. Single-Modality Pre-Training
- LiDAR-Only
- Camera-Only
- 3. Multi-Modality Pre-Training
- LiDAR-Centric (Vision-to-LiDAR)
- Camera-Centric (LiDAR-to-Vision)
- Unified Frameworks
- 4. Open-World Perception and Planning
- Text-Grounded Understanding
- Unified World Representation for Action
- 5. Acknowledgements
1. Benchmarks & Datasets
Vehicle-Based Datasets
| Dataset | Venue | Sensor | Task | Website |
|---|---|---|---|---|
KITTI |
CVPR'12 | 2 Cam(RGB), 2 Cam(Gray), 1 LiDAR(64) | 3D Det, Stereo, Optical Flow, SLAM | |
ApolloScape |
TPAMI'19 | 2 Cam, 2 LiDAR | 3D Det, HD Map | |
nuScenes |
CVPR'20 | 6 Cam(RGB), 1 LiDAR(32), 5 Radar | 3D Det, Seg, Occ, Map | |
SemanticKITTI |
ICCV'19 | 4 Cam, 1 LiDAR(64) | 3D Det, Occ | |
Waymo |
CVPR'20 | 5 Cam(RGB), 5 LiDAR | Perception (Det, Seg, Track), Motion | |
Argoverse |
CVPR'19 | 7 Cam(RGB), 2 LiDAR(32) | 3D Tracking, Forecasting, Map | |
Lyft L5 |
CoRL'20 | 7 Cam(RGB), 3 LiDAR, 5 Radar | 3D Det, Motion Forecasting/Planning | |
A*3D |
ICRA'20 | 2 Cam, 1 LiDAR(64) | 3D Det | |
KITTI-360 |
TPAMI'22 | 4 Cam, 1 LiDAR(64) | 3D Det, Occ | |
A2D2 |
arXiv'20 | 6 Cam, 5 LiDAR(16) | 3D Det | |
PandaSet |
ITSC'21 | 6 Cam(RGB), 2 LiDAR(64) | 3D Det, LiDAR Seg | |
Cirrus |
ICRA'21 | 1 Cam, 2 LiDAR(64) | 3D Det | |
ONCE |
NeurIPS'21 | 7 Cam(RGB), 1 LiDAR(40) | 3D Det (Self-supervised/Semi-supervised) | |
Shifts |
arXiv'21 | - | 3D Det, HD Map | |
nuPlan |
arXiv'21 | 8 Cam, 5 LiDAR | 3D Det, HD Map, E2E Plan | |
Argoverse2 |
NeurIPS'21 | 7 Cam, 2 LiDAR(32) | 3D Det, Occ, HD Map, E2E Plan | |
MONA |
ITSC'22 | 3 Cam | 3D Det, HD Map | |
Dual Radar |
Sci. Data'25 | 1 Cam, 1 LiDAR(80) 2 Radar | 3D Det | |
MAN TruckScenes |
NeurIPS'24 | 4 Cam, 6 LiDAR(64), 6 RADAR | 3D Det | |
OmniHD-Scenes |
arXiv'24 | 6 Cam, 1 LiDAR(128), 6 RADAR | 3D Det, Occ, HD Map | |
AevaScenes |
2025 | 6 Cam, 6 LiDAR | 3D Det, HD Map | |
PhysicalAI-AV |
2025 | 7 Cam, 1 LiDAR, 11 RADAR | E2E Plan |
Drone-Based Datasets
| Dataset | Venue | Sensor | Task | Website |
|---|---|---|---|---|
Campus |
ECCV'16 | 1 Cam | Target Forecasting/ Tracking | |
UAV123 |
ECCV'16 | 1 Cam | UAV Trackong | |
CarFusion |
CVPR'18 | 22 Cam | 3D Vehicle Reconstruction | |
UAVDT |
ECCV'18 | 1 Cam | 2D Object Detection/ Tracking | |
DOTA |
CVPR'18 | Multi-Scoure | 2D Object Detection | |
VisDrone |
TPAMI'21 | 1 Cam | 2D Object Detection/ Tracking | |
DOTA V2.0 |
TPAMI'21 | Multi-Scoure | 2D Object Detection | |
MOR-UAV |
MM'20 | 1 Cam | Moving Object Recognation | |
AU-AIR |
ICRA'20 | 1 Cam | 2D Object Detection | |
UAVid |
ISPRS JPRS'20 | 1 Cam | Semantic Segmentation | |
MOHR |
Neuro'21 | 3 Cam | 2D Object Detection | |
SensatUrban |
CVPR'21 | 1 Cam | 2D Object Detection | |
UAVDark135 |
TMC'22 | 1 Cam | 2D Object Tracking | |
MAVREC |
CVPR'24 | 1 Cam | 2D Obejct Detection | |
BioDrone |
IJCV'24 | 1 Cam | 2D Object Tracking | |
PDT |
ECCV'24 | 1 Cam, 1 LiDAR | 2D Object Detection | |
UAV3D |
NeurIPS'24 | 5 Cam | 3D Object Detection/ Tracking | |
IndraEye |
arXiv'24 | 1 Cam | 2D Object Detection/ Semantic Segmentation | |
UAVScenes |
ICCV'25 | 1 Cam, 1 LiDAR | Semantic Segmentation, Visual Localization |
Other Robotic Platforms
| Dataset | Venue | Platform | Sensors | Website |
|---|---|---|---|---|
RailSem19 |
CVPRW'19 | Railway | 1× Camera | |
FRSign |
arXiv'20 | Railway | 2× Camera (Stereo) | |
RAWPED |
TVT'20 | Railway | 1× Camera | |
SRLC |
AutCon'21 | Railway | 1× LiDAR | |
Rail-DB |
MM'22 | Railway | 1× Camera | |
RailSet |
IPAS'22 | Railway | 1× Camera | |
OSDaR23 |
ICRAE'23 | Railway | 9× Camera, 6× LiDAR, 1× Radar | |
Rail3D |
Infra'24 | Railway | 4× Camera, 1× LiDAR | |
WHU-Railway3D |
TITS'24 | Railway | 1× LiDAR | |
FloW |
ICCV'21 | USV (Water) | 2× Camera, 1× 4D Radar | |
DartMouth |
IROS'21 | USV (Water) | 3× Camera, 1× LiDAR | |
MODS |
TITS'21 | USV (Water) | 2× Camera, 1× LiDAR | |
SeaSAW |
CVPRW'22 | USV (Water) | 5× Camera | |
WaterScenes |
T-ITS'24 | USV (Water) | 1× Camera, 1× 4D Radar | |
MVDD13 |
Appl. Ocean Res.'24 | USV (Water) | 1× Camera | |
SeePerSea |
TFR'25 | USV (Water) | 1× Camera, 1× LiDAR | |
WaterVG |
TITS'25 | USV (Water) | 1× Camera, 1× 4D Radar | |
Han et al. |
NMI'24 | Legged Robot | 1× Depth Camera | |
Luo et al. |
CVPR'25 | Legged Robot | 1× Panoramic Camera | |
QuadOcc |
arXiv'25 | Legged Robot | 1× Panoramic Camera, 1× LiDAR | |
M3ED |
CVPRW'23 | Multi-Robot | 3× Camera, 2× Event Camera, 1× LiDAR | |
Pi3DET |
ICCV'25 | Multi-Robot | 3× Camera, 2× Event Camera, 1× LiDAR |
2. Single-Modality Pre-Training
LiDAR-Only
Methods utilizing Point Cloud Contrastive Learning, Masked Autoencoders (MAE), or Forecasting.
Camera-Only
Self-supervised learning from image sequences for driving/robotics.
| Model | Paper | Venue | GitHub |
|---|---|---|---|
INoD |
Injected Noise Discriminator for Self-Supervised Representation | RA-L 2023 | |
TempO |
Self-Supervised Representation Learning From Temporal Ordering | RA-L 2024 | |
LetsMap |
Unsupervised Representation Learning for Label-Efficient Semantic BEV Mapping | ECCV 2024 | |
NeRF-MAE |
Masked AutoEncoders for Self-Supervised 3D Representation Learning | ECCV 2024 | |
VisionPAD |
A Vision-Centric Pre-training Paradigm for Autonomous Driving | arXiv 2024 |
3. Multi-Modality Pre-Training
LiDAR-Centric Pre-Training
Enhancing LiDAR representations using Vision foundation models (Knowledge Distillation).
Camera-Centric Pre-Training
Learning 3D Geometry from Camera inputs using LiDAR supervision.
Unified Pre-Training
Joint optimization of multi-modal encoders for unified representations.
Incorporating Additional Sensors: With Radar
Incorporating additional modalities into pre-training frameworks for representation learning.
| Model | Paper | Venue | GitHub |
|---|---|---|---|
RadarContrast |
Self-Supervised Contrastive Learning for Camera-to-Radar Knowledge Distillation | DCOSS-IoT 2024 | |
AssociationNet |
Radar camera fusion via representation learning in autonomous driving | CVPRW 2021 | |
MVRAE |
Multi-View Radar Autoencoder for Self-Supervised Automotive Radar Representation Learning | IV 2024 | |
SSRLD |
Self-supervised representation learning for the object detection of marine radar | ICCAI 2022 | |
U-MLPNet |
Learning Omni-Dimensional Spatio-Temporal Dependencies for Millimeter-Wave Radar Perception | Remote Sens 2024 | |
4D-ROLLS |
4D-ROLLS: 4D Radar Occupancy Learning via LiDAR Supervision | arXiv 2025 | |
SS-RODNet |
Pre-Training For mmWave Radar Object Detection Through Masked Image Modeling | SS-RODNet | |
Radical |
Bootstrapping autonomous driving radars with self-supervised learning | CVPR 2024 | |
RiCL |
Leveraging Self-Supervised Instance Contrastive Learning for Radar Object Detection | arXiv 2024 | |
RSLM |
Radar spectra-language model for automotive scene parsing | RADAR 2024 |
Incorporating Additional Sensors: With Event Camera
| Model | Paper | Venue | GitHub |
|---|---|---|---|
ECDP |
Event Camera Data Pre-training | ICCV 2023 | |
MEM |
Masked Event Modeling: Self-Supervised Pretraining for Event Cameras | WACV 2024 | |
DMM |
Data-efficient event camera pre-training via disentangled masked modeling | arXiv 2024 | |
STP |
Enhancing Event Camera Data Pretraining via Prompt-Tuning with Visual Models | - | |
ECDDP |
Event Camera Data Dense Pre-training | ECCV2024 | |
EventBind |
Eventbind: Learning a unified representation to bind them all for event-based open-world understanding | ECCV2024 | |
EventFly |
EventFly: Event Camera Perception from Ground to the Sky | CVPR 2025 |
4. Open-World Perception and Planning
Text-Grounded Understanding
Unified World Representation for Action
5. Acknowledgements
We thank the authors of the referenced papers for their open-source contributions.
