| 1 |
|
| 2 |
HU Z T, SHU T M. Language models, agent models, and world models: the LAW for machine reasoning and planning[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2312.05230.
|
| 3 |
|
| 4 |
|
| 5 |
ZHU H, WANG Y, ZHOU J, et al. Aether: geometric-aware unified world modeling[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Washington D.C., USA: IEEE Press, 2025: 8535-8546.
|
| 6 |
WANG X F, ZHU Z, HUANG G, et al. DriveDreamer: towards real-world-drive world models for autonomous driving[C]//Proceedings of European Conference on Computer Vision. Berlin, Germany: Springer, 2024: 55-72.
|
| 7 |
HU Y C, GUO Y J, WANG P C, et al. Video prediction policy: a generalist robot policy with predictive visual representations[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2412.14803.
|
| 8 |
ASSRAN M, BARDES A, FAN D, et al. V-JEPA2: self-supervised video models enable understanding, prediction and planning[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2506.09985.
|
| 9 |
ZHANG X N, LIN J Y, MOU X Y, et al. SocioVerse: a world model for social simulation powered by LLM agents and a pool of 10 million real-world users[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2504.10157.
|
| 10 |
WANG L, GAO H Y, BO X H, et al. YuLan-OneSim: towards the next generation of social simulator with large language models[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2505.07581.
|
| 11 |
DING J T, ZHANG Y K, SHANG Y, et al. Understanding world or predicting future? A comprehensive survey of world models[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2411.14499.
|
| 12 |
|
| 13 |
TU S F, ZHOU X, LIANG D K, et al. The role of world models in shaping autonomous driving: a comprehensive survey[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2502.10498.
|
| 14 |
|
| 15 |
|
| 16 |
HAO S, GU Y, MA H, et al. Reasoning with language model is planning with world model[C]//Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Singapore: Association for Computational Linguistics, 2023: 8154-8173.
|
| 17 |
LONG Y X, LI X Q, CAI W Z, et al. Discuss before moving: visual language navigation via multi-expert discussions[C]//Proceedings of the IEEE International Conference on Robotics and Automation (ICRA). Washington D.C., USA: IEEE Press, 2024: 17380-17387.
|
| 18 |
ZHAO G L, LI G B, CHEN W K, et al. OVER-NAV: elevating iterative vision-and-language navigation with open-vocabulary detection and structured representation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2024: 16296-16306.
|
| 19 |
YANG Z Y, LIN J G, CHEN P H, et al. RILA: reflective and imaginative language agent for zero-shot semantic audio-visual navigation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2024: 16251-16261.
|
| 20 |
CHEN H J, CHEN X, DENG S M, et al. Agent planning with world knowledge model[C]//Proceedings of the Advances in Neural Information Processing Systems 37. Vancouver, Canada: Neural Information Processing Systems Foundation, Inc. (NeurIPS), 2024: 114843-114871.
|
| 21 |
CHAE H, KIM N, ONG K T, et al. Web agents with world models: learning and leveraging environment dynamics in web navigation[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2410.13232.
|
| 22 |
HU M K, ZHAO P, XU C, et al. AgentGen: enhancing planning abilities for large language model based agent via environment and task generation[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2408.00764.
|
| 23 |
WANG R Y, TODD G, YUAN X D, et al. ByteSized32: a corpus and challenge task for generating task-specific world models expressed as text games[C]//Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Singapore: Association for Computational Linguistics, 2023: 13455-13471.
|
| 24 |
|
| 25 |
FENG J, LIU T H, DU Y W, et al. CityGPT: empowering urban spatial cognition of large language models[C]//Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining. New York, USA: ACM Press, 2025: 591-602.
|
| 26 |
|
| 27 |
FENG J, DU Y W, ZHAO J, et al. AgentMove: a large language model based agentic framework for zero-shot next location prediction[C]//Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). Philadelphia, USA: Association for Computational Linguistics, 2025: 1322-1338.
|
| 28 |
LI L, ZHOU Y, LIANG Y X, et al. Recognition through reasoning: reinforcing image geo-localization with large vision-language models[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2506.14674.
|
| 29 |
|
| 30 |
GAO Q Y, PI X Y, LIU K, et al. Do vision-language models have internal world models? Towards an atomic evaluation[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2506.21876.
|
| 31 |
CHEN R R, JIANG W F, QIN C W, et al. Theory of mind in large language models: assessment and enhancement[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2505.00026.
|
| 32 |
SAP M, LE BRAS R, FRIED D, et al. Neural theory-of-mind? On the limits of social intelligence in large LMs[C]//Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Philadelphia, USA: Association for Computational Linguistics, 2022: 3762-3780.
|
| 33 |
STRACHAN J W A , ALBERGO D , BORGHINI G , et al. Testing theory of mind in large language models and humans. Nature Human Behaviour, 2024, 8 (7): 1285- 1295.
doi: 10.1038/s41562-024-01882-z
|
| 34 |
PARK J S, O'BRIEN J, CAI C J, et al. Generative agents: interactive simulacra of human behavior[C]//Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology. New York, USA: ACM Press, 2023: 1-22.
|
| 35 |
BAKLASHKIN M, BODISHTIANU V, GLUSHANINA M, et al. EAI: emotional decision-making of LLMs in strategic games and ethical dilemmas[C]//Proceedings of the Advances in Neural Information Processing Systems 37. Vancouver, Canada: Neural Information Processing Systems Foundation, Inc. (NeurIPS), 2024: 53969-54002.
|
| 36 |
VAYANI A, DISSANAYAKE D, WATAWANA H, et al. All languages matter: evaluating LMMs on culturally diverse 100 languages[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2025: 19565-19575.
|
| 37 |
|
| 38 |
|
|
Keling AI—next-generation AI creative productivity platform[EB/OL]. [2026-02-27]. https://klingai.com/cn/. (in Chinese)
|
| 39 |
|
| 40 |
VASWANI A , SHAZEER N , PARMAR N , et al. Attention is all you need. Advances in Neural Information Processing Systems, 2017, 30, 6000- 6010.
|
| 41 |
|
| 42 |
YANG Z Y, TENG J Y, ZHENG W D, et al. CogVideoX: text-to-video diffusion models with an expert Transformer[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2408.06072.
|
| 43 |
|
| 44 |
|
| 45 |
|
|
|
| 46 |
YIN S M, WU C F, YANG H, et al. NUWA-XL: diffusion over diffusion for extremely long video generation[C]//Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg, USA: Association for Computational Linguistics, 2023: 1309-1320.
|
| 47 |
|
| 48 |
|
| 49 |
WANG X F, ZHU Z, HUANG G, et al. WorldDreamer: towards general world models for video generation via predicting masked tokens[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2401.09985.
|
| 50 |
|
| 51 |
|
| 52 |
|
| 53 |
ANANDKUMAR A, FAN L X, HUANG D A, et al. MineDojo: building open-ended embodied agents with Internet-scale knowledge[C]//Proceedings of the Advances in Neural Information Processing Systems 35. New Orleans, USA: Neural Information Processing Systems Foundation, Inc. (NeurIPS), 2022: 18343-18362.
|
| 54 |
|
| 55 |
|
| 56 |
XIANG J N, LIU G Y, GU Y, et al. Pandora: towards general world model with natural language actions and video states[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2406.09455.
|
| 57 |
|
| 58 |
DENG B Y, TUCKER R, LI Z Q, et al. Streetscapes: large-scale consistent street view generation using autoregressive video diffusion[C]//Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference. New York, USA: ACM Press, 2024: 1-11.
|
| 59 |
|
| 60 |
HU M, CHEN T, ZOU Y, et al. Text2World: benchmarking large language models for symbolic world model generation[C]//Proceedings of Findings of the Association for Computational Linguistics: ACL 2025. Philadelphia, USA: Association for Computational Linguistics, 2025: 26043-26066.
|
| 61 |
|
| 62 |
IVANOVA A A, SATHE A, LIPKIN B, et al. Elements of World Knowledge (EWoK): a cognition-inspired framework for evaluating basic world knowledge in language models[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2405.09605.
|
| 63 |
|
| 64 |
FEI H , WU S Q , ZHANG M S , et al. Enhancing video-language representations with structural spatio-temporal alignment. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46 (12): 7701- 7719.
doi: 10.1109/TPAMI.2024.3393452
|
| 65 |
CHERIAN A, PAUL S, ROY-CHOWDHURY A. AVLEN: audio-visual-language embodied navigation in 3D environments[C]//Proceedings of the Advances in Neural Information Processing Systems 35. New Orleans, USA: Neural Information Processing Systems Foundation, Inc. (NeurIPS), 2022: 6236-6249.
|
| 66 |
SHEN B K, XIA F, LI C S, et al. iGibson 1.0: a simulation environment for interactive tasks in large realistic scenes[C]//Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). Washington D.C., USA: IEEE Press, 2021: 7520-7527.
|
| 67 |
BOGDOLL D, YANG Y T, JOSEPH T, et al. MUVO: a multimodal generative world model for autonomous driving with geometric representations[C]//Proceedings of the IEEE Intelligent Vehicles Symposium (Ⅳ). Washington D.C., USA: IEEE Press, 2025: 2243-2250.
|
| 68 |
WANG L R, LING Y Y, YUAN Z C, et al. GenSim: generating robotic simulation tasks via large language models[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2310.01361.
|
| 69 |
LIN K , AGIA C , MIGIMATSU T , et al. Text2Motion: from natural language instructions to feasible plans. Autonomous Robots, 2023, 47 (8): 1345- 1365.
|
| 70 |
WANG Z H, CAI S F, CHEN G Z, et al. Describe, explain, plan and select: interactive planning with large language models enables open-world multi-task agents[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2302.01560.
|
| 71 |
MAO Y S, ZHONG J H, FANG C, et al. SpatialLM: training large language models for structured indoor modeling[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2506.07491.
|
| 72 |
LIU B, JIANG Y Q, ZHANG X H, et al. LLM+P: empowering large language models with optimal planning proficiency[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2304.11477.
|
| 73 |
GUAN L, KAMBHAMPATI S, SREEDHARAN S, et al. Leveraging pre-trained large language models to construct and utilize world models for model-based task planning[C]//Proceedings of the Advances in Neural Information Processing Systems 36. New Orleans, USA. Neural Information Processing Systems Foundation, Inc. (NeurIPS), 2023: 79081-79094.
|
| 74 |
ABBEEL P, ADENIJI A, ESCONTRELA A, et al. Video prediction models as rewards for reinforcement learning[C]//Proceedings of the Advances in Neural Information Processing Systems 36. New Orleans, USA: Neural Information Processing Systems Foundation, Inc. (NeurIPS), 2023: 68760-68783.
|
| 75 |
CHEANG C L, CHEN G Z, JING Y, et al. GR-2: a generative video-language-action model with Web-scale knowledge for robot manipulation[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2410.06158.
|
| 76 |
ZHOU S Y, DU Y L, CHEN J B, et al. RoboDreamer: learning compositional world models for robot imagination[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2404.12377.
|
| 77 |
ABBEEL P, DAI B, DAI H J, et al. Learning universal policies via text-guided video generation[C]//Proceedings of the Advances in Neural Information Processing Systems 36. New Orleans, USA: Neural Information Processing Systems Foundation, Inc. (NeurIPS), 2023: 9156-9172.
|
| 78 |
|
| 79 |
|
| 80 |
MAJUMDAR A, AJAY A, ZHANG X H, et al. OpenEQA: embodied question answering in the era of foundation models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2024: 16488-16498.
|
| 81 |
GAO C, LAN X C, LU Z H, et al. S 3: social-network simulation system with large language model-empowered agents[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2307.14984.
|
| 82 |
|
| 83 |
LI N, GAO C, LI M Y, et al. EconAgent: large language model-empowered agents for simulating macroeconomic activities[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2310.10436.
|
| 84 |
JI J, LI Y, LIU H, et al. SRAP-Agent: simulating and optimizing scarce resource allocation policy with llm-based agent[C]//Proceedings of Findings of the Association for Computational Linguistics: EMNLP 2024. Miami, USA: Association for Computational Linguistics, 2024: 267-293.
|
| 85 |
|
| 86 |
PIAO J H, YAN Y W, ZHANG J, et al. AgentSociety: large-scale simulation of LLM-driven generative agents advances understanding of human behaviors and society[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2502.08691.
|
| 87 |
QIAN C, LIANG S H, QIN Y J, et al. Investigate-consolidate-exploit: a general strategy for inter-task agent self-evolution[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2401.13996.
|
| 88 |
ZHANG J T, XU X, ZHANG N Y, et al. Exploring collaboration mechanisms for LLM agents: a social psychology view[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2310.02124.
|
| 89 |
ZHANG W Q, TANG K, WU H, et al. Agent-pro: learning to evolve via policy-level reflection and optimization[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2402.17574.
|
| 90 |
|
|
World models reshape the future of cities: how generative AI is building a new paradigm for smart planning[EB/OL]. [2026-02-07]. https://www.aigc.cn/91864.html. (in Chinese)
|
| 91 |
魏天呈, 郭真, 杨云龙. 大模型赋能智慧城市建设的路径与策略研究. 信息通信技术与政策, 2025, 51 (8): 91- 96.
|
|
WEI T C , GUO Z , YANG Y L . Research on the paths and strategies of empowering smart city construction with large language models. Information and Communications Technology and Policy, 2025, 51 (8): 91- 96.
|
| 92 |
CHEN Z X , WANG G C , LIU Z W . SceneDreamer: unbounded 3D scene generation from 2D image collections. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45 (12): 15562- 15576.
doi: 10.1109/TPAMI.2023.3321857
|
| 93 |
LIN C H, LEE H Y, MENAPACE W, et al. InfiniCity: infinite-scale city synthesis[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Washington D.C., USA: IEEE Press, 2024: 22751-22761.
|
| 94 |
XIE H Z, CHEN Z X, HONG F Z, et al. CityDreamer: compositional generative model of unbounded 3D cities[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2024: 9666-9675.
|
| 95 |
DENG J, CHAI W H, DENG J, et al. CityGen: infinite and controllable city layout generation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Washington D.C., USA: IEEE Press, 2025: 1986-1996.
|
| 96 |
FENG C, CHEN Z Y, HOŁYŃ SKI A, et al. GPS as a control signal for image generation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2025: 2766-2778.
|
| 97 |
|
| 98 |
|
| 99 |
ZHAO G S, WANG X F, ZHU Z, et al. DriveDreamer-2: LLM-enhanced world models for diverse driving video generation[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto, USA: AAAI Press, 2025: 10412-10420.
|
| 100 |
ZHAO G S, NI C J, WANG X F, et al. DriveDreamer4D: world models are effective data machines for 4D driving scene representation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2025: 12015-12026.
|
| 101 |
WANG Y Q, HE J W, FAN L, et al. Driving into the future: multiview visual forecasting and planning with world model for autonomous driving[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2024: 14749-14759.
|
| 102 |
CHEN L, CHITTA K, GAO S Y, et al. Vista: a generalizable driving world model with high fidelity and versatile controllability[C]//Proceedings of the Advances in Neural Information Processing Systems 37. Vancouver, Canada: Neural Information Processing Systems Foundation, Inc. (NeurIPS), 2024: 91560-91596.
|
| 103 |
|
| 104 |
RUSSELL L, HU A, BERTONI L, et al. GAIA-2: a controllable multi-view generative world model for autonomous driving[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2503.20523.
|
| 105 |
ZHENG W Z, CHEN W L, HUANG Y H, et al. OccWorld: learning a 3D occupancy world model for autonomous driving[C]//Proceedings of European Conference on Computer Vision. Berlin, Germany: Springer, 2024: 55-72.
|
| 106 |
WANG L N, ZHENG W Z, REN Y L, et al. OccSora: 4D occupancy generation models as world simulators for autonomous driving[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2405.20337.
|
| 107 |
JIANG C K, ZHOU D S, LIU J M, et al. VectorWorld: efficient streaming world model via diffusion flow on vector graphs[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2603.17652.
|
| 108 |
CIPOLLA R, CORRADO G, GRIFFITHS N, et al. Model-based imitation learning for urban driving[C]//Proceedings of the Advances in Neural Information Processing Systems 35. New Orleans, USA: Neural Information Processing Systems Foundation, Inc. (NeurIPS), 2022: 20703-20716.
|
| 109 |
LOGG A , MARDAL K A , WELLS G . Automated solution of differential equations by the finite element method: the FEniCS book. Berlin, Germany: Springer, 2012.
|
| 110 |
COUMANS E, BAI Y P. PyBullet, a Python module for physics simulation for games, robotics and machine learning[EB/OL]. [2026-02-27]. http://pybullet.org.
|
| 111 |
TODOROV E, EREZ T, TASSA Y. MuJoCo: a physics engine for model-based control[C]//Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems. Washington D.C., USA: IEEE Press, 2012: 5026-5033.
|
| 112 |
HU Y M, LIU J C, SPIELBERG A, et al. ChainQueen: a real-time differentiable physical simulator for soft robotics[C]//Proceedings of the International Conference on Robotics and Automation (ICRA). Washington D.C., USA: IEEE Press, 2019: 6265-6271.
|
| 113 |
BILLARD A , ALBU-SCHAEFFER A , BEETZ M , et al. A roadmap for AI in robotics. Nature Machine Intelligence, 2025, 7 (6): 818- 824.
|
| 114 |
LIU J F , SHI H Y , ZHANG S Y , et al. Automatic quantization for physics-based simulation. ACM Transactions on Graphics, 2022, 41 (4): 1- 16.
|
| 115 |
ANTYPAS D, AYELE A, BORKAKOTY H, et al. BLEnD: a benchmark for LLMs on everyday knowledge in diverse cultures and languages[C]//Proceedings of the Advances in Neural Information Processing Systems 37. Vancouver, Canada: Neural Information Processing Systems Foundation, Inc. (NeurIPS), 2024: 78104-78146.
|
| 116 |
STREET W , SIY J O , KEELING G , et al. LLMs achieve adult human performance on higher-order theory of mind tasks. Frontiers in Human Neuroscience, 2025, 19, 1633272.
|
| 117 |
YANG J H, YANG S S, GUPTA A W, et al. Thinking in space: how multimodal large language models see, remember, and recall spaces[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2025: 10632-10643.
|
| 118 |
GUO X Y, HUO J Y, SHI Z M, et al. T2VPhysBench: a first-principles benchmark for physical consistency in text-to-video generation[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2505.00337.
|
| 119 |
|
| 120 |
|
| 121 |
HUANG Z Q, HE Y N, YU J S, et al. VBench: comprehensive benchmark suite for video generative models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2024: 21807-21818.
|
| 122 |
ZHENG D, HUANG Z Q, LIU H B, et al. VBench-2.0: advancing video generation benchmark suite for intrinsic faithfulness[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2503.21755.
|
| 123 |
|
| 124 |
FENG J, ZHANG J, LIU T H, et al. CityBench: evaluating the capabilities of large language models for urban tasks[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2406.13945.
|
| 125 |
ZHAO B N, FANG J J, DAI Z C, et al. UrbanVideo-bench: benchmarking vision-language models on embodied intelligence with video data in urban spaces[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2503.06157.
|
| 126 |
MA Y S, CUI C, CAO X, et al. LaMPilot: an open benchmark dataset for autonomous driving with language model programs[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2024: 15141-15151.
|
| 127 |
DARGAHI NOBARI K , BERTRAM T . A multimodal driver monitoring benchmark dataset for driver modeling in assisted driving automation. Scientific Data, 2024, 11, 327.
|