基于大模型的世界模型研究综述(特邀)

doi:10.19678/j.issn.1000-3428.0260356

计算机工程 ›› 2026, Vol. 52 ›› Issue (6): 1-16. doi: 10.19678/j.issn.1000-3428.0260356

基于大模型的世界模型研究综述(特邀)

赵翔¹^,*(), 黑梦哲², 李家旭¹, 庞宁³, 陈子阳¹

1. 国防科技大学大数据与决策国家级重点实验室, 湖南长沙 410005
2. 国防科技大学信息系统工程国家重点实验室, 湖南长沙 410005
3. 空军航空大学, 吉林长春 130000

收稿日期:2026-03-20 修回日期:2026-04-22 出版日期:2026-06-15 发布日期:2026-06-02
通讯作者: 赵翔
作者简介:
赵翔, 男, 教授、博士, 主研方向为大数据知识工程
黑梦哲(共同一作), 男, 博士研究生, 主研方向为世界模型、热点事件态势预测
李家旭, 博士研究生
庞宁, 讲师、博士
陈子阳, 博士研究生
基金资助:
国家自然科学基金(U25B2047); 国家自然科学基金(62272469); 国家自然科学基金(72501299)

Survey of World Models Based on Large Models (Invited)

ZHAO Xiang¹^,*(), HEI Mengzhe², LI Jiaxu¹, PANG Ning³, CHEN Ziyang¹

1. National Key Laboratory of Big Data and Decision, National University of Defense, Changsha 410005, Hunan, China
2. National Key Laboratory of Information Systems Engineering, National University of Defense, Changsha 410005, Hunan, China
3. Aviation University of Air Force, Changchun 130000, Jilin, China

Received:2026-03-20 Revised:2026-04-22 Online:2026-06-15 Published:2026-06-02
Contact: ZHAO Xiang

摘要/Abstract

摘要：

一般认为, 世界模型理解并表示外部世界, 同时根据当前的世界状态和动作预测世界的未来状态。大模型依靠海量的训练数据和庞大的参数规模, 拥有出众的文本知识学习、理解表示和生成能力, 例如语言大模型GPT-4、LLaMA等。近年来, 世界模型研究备受工业界和学术界的关注, 涌现出了一大批包括自动驾驶、社会模拟、具身智能和视频生成的研究和商业成果, 并且研究者将各类大模型的出色成果应用在世界模型上, 使世界模型的效果得到了进一步提升。本文对利用大模型构建的各领域世界模型进行了全面综述, 包括基于语言大模型和基于视觉大模型(VLM), 并且选取了数个重要的应用领域对相关模型进行介绍, 包括具身智能、智慧城市、社会模拟和物理环境模拟。本文首先基于大模型的模态对世界模型进行分类, 指出了基于不同模态的世界模型在功能上的不同; 随后给出了世界模型重要的开源资源和基准, 帮助相关领域的研究人员快速了解和使用世界模型; 最后对文章进行总结, 并对未来研究方向进行展望。

关键词: 世界模型, 大模型, 生成式大模型, 模拟, 具身智能

Abstract:

World models are generally believed to understand and represent the external world and predict future states based on current world states and actions. Large models leverage massive training data and vast parameter scales to exhibit outstanding capabilities in learning, understanding, representing, and generating textual knowledge, as exemplified by language large models such as GPT-4 and LLaMA. In recent years, research on world models has attracted significant attention from both industry and academia, leading to significant research and commercial achievements in domains such as autonomous driving, social simulation, embodied intelligence, and video generation. Moreover, researchers have applied the remarkable results of various large models to world models, further enhancing their performance. This paper comprehensively reviews world models built using large models across different domains, covering both language large model- and Vision Large Model (VLM)-based approaches. Several important application areas, including embodied intelligence, smart cities, social simulation, and physical environment simulation, are selected to introduce relevant models. This paper classifies world models based on the modality of the large models used, highlighting the functional differences between world models based on different modalities. Subsequently, important open-source resources and benchmarks for world models are presented to help researchers in related fields understand and utilize world models quickly. Finally, this paper is summarized and future research directions are presented.

Key words: world model, large model, generative large model, simulation, embodied intelligence

赵翔, 黑梦哲, 李家旭, 庞宁, 陈子阳. 基于大模型的世界模型研究综述(特邀)[J]. 计算机工程, 2026, 52(6): 1-16.

ZHAO Xiang, HEI Mengzhe, LI Jiaxu, PANG Ning, CHEN Ziyang. Survey of World Models Based on Large Models (Invited)[J]. Computer Engineering, 2026, 52(6): 1-16.

https://www.ecice06.com/CN/Y2026/V52/I6/1

图/表 9

图1 基于大模型的世界模型的分类与应用

Fig.1 Classification and application of large model-based world models

图2 世界模型、语言大模型和智能体模型的关系^[2]

Fig.2 Relationship among world models, language large models, and agent models^[2]

参考文献 127

1	XU K, ZHAO H, HU R, et al. From specialist to generalist: a comprehensive survey on world models[EB/OL]. [2026-02-27]. https://zenodo.org/records/18050668.
2	HU Z T, SHU T M. Language models, agent models, and world models: the LAW for machine reasoning and planning[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2312.05230.
3	HA D, SCHMIDHUBER J. World models[EB/OL]. [2026-02-27]. https://arxiv.org/abs/1803.10122.
4	LIN J, DU Y Q, WATKINS O, et al. Learning to model the world with language[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2308.01399.
5	ZHU H, WANG Y, ZHOU J, et al. Aether: geometric-aware unified world modeling[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Washington D.C., USA: IEEE Press, 2025: 8535-8546.
6	WANG X F, ZHU Z, HUANG G, et al. DriveDreamer: towards real-world-drive world models for autonomous driving[C]//Proceedings of European Conference on Computer Vision. Berlin, Germany: Springer, 2024: 55-72.
7	HU Y C, GUO Y J, WANG P C, et al. Video prediction policy: a generalist robot policy with predictive visual representations[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2412.14803.
8	ASSRAN M, BARDES A, FAN D, et al. V-JEPA2: self-supervised video models enable understanding, prediction and planning[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2506.09985.
9	ZHANG X N, LIN J Y, MOU X Y, et al. SocioVerse: a world model for social simulation powered by LLM agents and a pool of 10 million real-world users[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2504.10157.
10	WANG L, GAO H Y, BO X H, et al. YuLan-OneSim: towards the next generation of social simulator with large language models[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2505.07581.
11	DING J T, ZHANG Y K, SHANG Y, et al. Understanding world or predicting future? A comprehensive survey of world models[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2411.14499.
12	GUAN Y C, LIAO H C, LI Z N, et al. World models for autonomous driving: an initial survey[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2403.02622.
13	TU S F, ZHOU X, LIANG D K, et al. The role of world models in shaping autonomous driving: a comprehensive survey[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2502.10498.
14	KONG L D, YANG W, MEI J B, et al. 3D and 4D world modeling: a survey[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2509.07996.
15	GURNEE W, TEGMARK M. Language models represent space and time[EB/OL]. [2026-02-27]. https://www.semanticscholar.org/paper/Language-Models-Represent-Space-and-Time-Gurnee-Tegmark/740c783ac07039cf30b6d8a8f95e775b3297c79e.
16	HAO S, GU Y, MA H, et al. Reasoning with language model is planning with world model[C]//Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Singapore: Association for Computational Linguistics, 2023: 8154-8173.
17	LONG Y X, LI X Q, CAI W Z, et al. Discuss before moving: visual language navigation via multi-expert discussions[C]//Proceedings of the IEEE International Conference on Robotics and Automation (ICRA). Washington D.C., USA: IEEE Press, 2024: 17380-17387.
18	ZHAO G L, LI G B, CHEN W K, et al. OVER-NAV: elevating iterative vision-and-language navigation with open-vocabulary detection and structured representation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2024: 16296-16306.
19	YANG Z Y, LIN J G, CHEN P H, et al. RILA: reflective and imaginative language agent for zero-shot semantic audio-visual navigation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2024: 16251-16261.
20	CHEN H J, CHEN X, DENG S M, et al. Agent planning with world knowledge model[C]//Proceedings of the Advances in Neural Information Processing Systems 37. Vancouver, Canada: Neural Information Processing Systems Foundation, Inc. (NeurIPS), 2024: 114843-114871.
21	CHAE H, KIM N, ONG K T, et al. Web agents with world models: learning and leveraging environment dynamics in web navigation[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2410.13232.
22	HU M K, ZHAO P, XU C, et al. AgentGen: enhancing planning abilities for large language model based agent via environment and task generation[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2408.00764.
23	WANG R Y, TODD G, YUAN X D, et al. ByteSized32: a corpus and challenge task for generating task-specific world models expressed as text games[C]//Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Singapore: Association for Computational Linguistics, 2023: 13455-13471.
24	TOUVRON H, LAVRIL T, IZACARD G, et al. LLaMA: open and efficient foundation language models[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2302.13971.
25	FENG J, LIU T H, DU Y W, et al. CityGPT: empowering urban spatial cognition of large language models[C]//Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining. New York, USA: ACM Press, 2025: 591-602.
26	ROBERTS J, LVDDECKE T, DAS S, et al. GPT4GEO: how a language model sees the world's geography[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2306.00020.
27	FENG J, DU Y W, ZHAO J, et al. AgentMove: a large language model based agentic framework for zero-shot next location prediction[C]//Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). Philadelphia, USA: Association for Computational Linguistics, 2025: 1322-1338.
28	LI L, ZHOU Y, LIANG Y X, et al. Recognition through reasoning: reinforcing image geo-localization with large vision-language models[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2506.14674.
29	JIN C, RINARD M. Emergent representations of program semantics in language models trained on programs[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2305.11169.
30	GAO Q Y, PI X Y, LIU K, et al. Do vision-language models have internal world models? Towards an atomic evaluation[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2506.21876.
31	CHEN R R, JIANG W F, QIN C W, et al. Theory of mind in large language models: assessment and enhancement[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2505.00026.
32	SAP M, LE BRAS R, FRIED D, et al. Neural theory-of-mind? On the limits of social intelligence in large LMs[C]//Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Philadelphia, USA: Association for Computational Linguistics, 2022: 3762-3780.
33	STRACHAN J W A , ALBERGO D , BORGHINI G , et al. Testing theory of mind in large language models and humans. Nature Human Behaviour, 2024, 8 (7): 1285- 1295. doi: 10.1038/s41562-024-01882-z
34	PARK J S, O'BRIEN J, CAI C J, et al. Generative agents: interactive simulacra of human behavior[C]//Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology. New York, USA: ACM Press, 2023: 1-22.
35	BAKLASHKIN M, BODISHTIANU V, GLUSHANINA M, et al. EAI: emotional decision-making of LLMs in strategic games and ethical dilemmas[C]//Proceedings of the Advances in Neural Information Processing Systems 37. Vancouver, Canada: Neural Information Processing Systems Foundation, Inc. (NeurIPS), 2024: 53969-54002.
36	VAYANI A, DISSANAYAKE D, WATAWANA H, et al. All languages matter: evaluating LMMs on culturally diverse 100 languages[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2025: 19565-19575.
37	Sora \| OpenAI[EB/OL]. [2026-02-27]. https://openai.com/zh-Hans-CN/sora/.
38	可灵AI——新一代AI创意生产力平台[EB/OL]. [2026-02-27]. https://klingai.com/cn/.
	Keling AI—next-generation AI creative productivity platform[EB/OL]. [2026-02-27]. https://klingai.com/cn/. (in Chinese)
39	Runway \| AI image and video generator[EB/OL]. [2026-02-27]. https://runwayml.com/product.
40	VASWANI A , SHAZEER N , PARMAR N , et al. Attention is all you need. Advances in Neural Information Processing Systems, 2017, 30, 6000- 6010.
41	ZHENG Z W, PENG X Y, YANG T J, et al. Open-Sora: democratizing efficient video production for all[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2412.20404.
42	YANG Z Y, TENG J Y, ZHENG W D, et al. CogVideoX: text-to-video diffusion models with an expert Transformer[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2408.06072.
43	AGARWAL N, ALI A, BALA M, et al. Cosmos world foundation model platform for physical AI[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2501.03575.
44	PARKER-HOLDER J, BALL P, BRUCE J, et al. Genie2: a large-scale foundation world modelEB/OL]. [2026-02-27]. https://deepmind.google/discover/blog/genie-2-a-large-scale-foundation-world-model.
45	Genie3: 通往AGI的世界模型新前沿\|Google DeepMind[EB/OL]. [2026-02-27]. https://www.genie3.cloud/zh#feature.
	Genie3: a new frontier of world models toward AGI \| Google DeepMind[EB/OL]. [2026-02-27]. https://www.genie3.cloud/zh#feature. (in Chinese)
46	YIN S M, WU C F, YANG H, et al. NUWA-XL: diffusion over diffusion for extremely long video generation[C]//Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg, USA: Association for Computational Linguistics, 2023: 1309-1320.
47	BRUCE J, DENNIS M, EDWARDS A, et al. Genie: generative interactive environments[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2402.15391.
48	YANG S, WALKER J, PARKER-HOLDER J, et al. Video as the new language for real-world decision making[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2402.17139.
49	WANG X F, ZHU Z, HUANG G, et al. WorldDreamer: towards general world models for video generation via predicting masked tokens[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2401.09985.
50	YU W, QIAN R J, LI Y M, et al. MosaicMem: hybrid spatial memory for controllable video world models[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2603.17117.
51	CHANG A, DAI A, FUNKHOUSER T, et al. Matterport3D: learning from RGB-D data in indoor environments[EB/OL]. [2026-02-27]. https://arxiv.org/abs/1709.06158.
52	SHANG Y, LIN Y M, ZHENG Y, et al. UrbanWorld: an urban world model for 3D city generation[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2407.11965.
53	ANANDKUMAR A, FAN L X, HUANG D A, et al. MineDojo: building open-ended embodied agents with Internet-scale knowledge[C]//Proceedings of the Advances in Neural Information Processing Systems 35. New Orleans, USA: Neural Information Processing Systems Foundation, Inc. (NeurIPS), 2022: 18343-18362.
54	YANG S, DU Y L, GHASEMIPOUR K, et al. Learning interactive real-world simulators[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2310.06114.
55	CHI X W, FAN C K, ZHANG H Y, et al. EVA: an embodied world model for future video anticipation[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2410.15461.
56	XIANG J N, LIU G Y, GU Y, et al. Pandora: towards general world model with natural language actions and video states[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2406.09455.
57	ZHEN H Y, SUN Q, ZHANG H X, et al. TesserAct: learning 4D embodied world models[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2504.20995.
58	DENG B Y, TUCKER R, LI Z Q, et al. Streetscapes: large-scale consistent street view generation using autoregressive video diffusion[C]//Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference. New York, USA: ACM Press, 2024: 1-11.
59	RIGTER M, GUPTA T, HILMKIL A, et al. AVID: adapting video diffusion models to world models[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2410.12822.
60	HU M, CHEN T, ZOU Y, et al. Text2World: benchmarking large language models for symbolic world model generation[C]//Proceedings of Findings of the Association for Computational Linguistics: ACL 2025. Philadelphia, USA: Association for Computational Linguistics, 2025: 26043-26066.
61	GE Z Q, HUANG H Z, ZHOU M Z, et al. WorldGPT: empowering LLM as multimodal world model[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2404.18202.
62	IVANOVA A A, SATHE A, LIPKIN B, et al. Elements of World Knowledge (EWoK): a cognition-inspired framework for evaluating basic world knowledge in language models[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2405.09605.
63	YU J F, WANG X Z, TU S Q, et al. KoLA: carefully benchmarking world knowledge of large language models[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2306.09296.
64	FEI H , WU S Q , ZHANG M S , et al. Enhancing video-language representations with structural spatio-temporal alignment. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46 (12): 7701- 7719. doi: 10.1109/TPAMI.2024.3393452
65	CHERIAN A, PAUL S, ROY-CHOWDHURY A. AVLEN: audio-visual-language embodied navigation in 3D environments[C]//Proceedings of the Advances in Neural Information Processing Systems 35. New Orleans, USA: Neural Information Processing Systems Foundation, Inc. (NeurIPS), 2022: 6236-6249.
66	SHEN B K, XIA F, LI C S, et al. iGibson 1.0: a simulation environment for interactive tasks in large realistic scenes[C]//Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). Washington D.C., USA: IEEE Press, 2021: 7520-7527.
67	BOGDOLL D, YANG Y T, JOSEPH T, et al. MUVO: a multimodal generative world model for autonomous driving with geometric representations[C]//Proceedings of the IEEE Intelligent Vehicles Symposium (Ⅳ). Washington D.C., USA: IEEE Press, 2025: 2243-2250.
68	WANG L R, LING Y Y, YUAN Z C, et al. GenSim: generating robotic simulation tasks via large language models[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2310.01361.
69	LIN K , AGIA C , MIGIMATSU T , et al. Text2Motion: from natural language instructions to feasible plans. Autonomous Robots, 2023, 47 (8): 1345- 1365.
70	WANG Z H, CAI S F, CHEN G Z, et al. Describe, explain, plan and select: interactive planning with large language models enables open-world multi-task agents[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2302.01560.
71	MAO Y S, ZHONG J H, FANG C, et al. SpatialLM: training large language models for structured indoor modeling[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2506.07491.
72	LIU B, JIANG Y Q, ZHANG X H, et al. LLM+P: empowering large language models with optimal planning proficiency[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2304.11477.
73	GUAN L, KAMBHAMPATI S, SREEDHARAN S, et al. Leveraging pre-trained large language models to construct and utilize world models for model-based task planning[C]//Proceedings of the Advances in Neural Information Processing Systems 36. New Orleans, USA. Neural Information Processing Systems Foundation, Inc. (NeurIPS), 2023: 79081-79094.
74	ABBEEL P, ADENIJI A, ESCONTRELA A, et al. Video prediction models as rewards for reinforcement learning[C]//Proceedings of the Advances in Neural Information Processing Systems 36. New Orleans, USA: Neural Information Processing Systems Foundation, Inc. (NeurIPS), 2023: 68760-68783.
75	CHEANG C L, CHEN G Z, JING Y, et al. GR-2: a generative video-language-action model with Web-scale knowledge for robot manipulation[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2410.06158.
76	ZHOU S Y, DU Y L, CHEN J B, et al. RoboDreamer: learning compositional world models for robot imagination[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2404.12377.
77	ABBEEL P, DAI B, DAI H J, et al. Learning universal policies via text-guided video generation[C]//Proceedings of the Advances in Neural Information Processing Systems 36. New Orleans, USA: Neural Information Processing Systems Foundation, Inc. (NeurIPS), 2023: 9156-9172.
78	ZHU F, WU H, GUO S, et al. IRASim: learning interactive real-robot action simulators[EB/OL]. [2026-02-27]. https://arxiv.org/html/2406.14540v1.
79	SHANG Y, ZHANG X, TANG Y Z, et al. RoboScape: physics-informed embodied world model[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2506.23135.
80	MAJUMDAR A, AJAY A, ZHANG X H, et al. OpenEQA: embodied question answering in the era of foundation models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2024: 16488-16498.
81	GAO C, LAN X C, LU Z H, et al. S³: social-network simulation system with large language model-empowered agents[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2307.14984.
82	PAPACHRISTOU M, YUAN Y. Network formation and dynamics among multi-LLMs[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2402.10659.
83	LI N, GAO C, LI M Y, et al. EconAgent: large language model-empowered agents for simulating macroeconomic activities[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2310.10436.
84	JI J, LI Y, LIU H, et al. SRAP-Agent: simulating and optimizing scarce resource allocation policy with llm-based agent[C]//Proceedings of Findings of the Association for Computational Linguistics: EMNLP 2024. Miami, USA: Association for Computational Linguistics, 2024: 267-293.
85	AL A, AHN A, BECKER N, et al. Project Sid: many-agent simulations toward AI civilization[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2411.00114.
86	PIAO J H, YAN Y W, ZHANG J, et al. AgentSociety: large-scale simulation of LLM-driven generative agents advances understanding of human behaviors and society[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2502.08691.
87	QIAN C, LIANG S H, QIN Y J, et al. Investigate-consolidate-exploit: a general strategy for inter-task agent self-evolution[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2401.13996.
88	ZHANG J T, XU X, ZHANG N Y, et al. Exploring collaboration mechanisms for LLM agents: a social psychology view[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2310.02124.
89	ZHANG W Q, TANG K, WU H, et al. Agent-pro: learning to evolve via policy-level reflection and optimization[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2402.17574.
90	世界模型重塑未来城市, 生成式AI如何构建智慧规划新范式[EB/OL]. [2026-02-07]. https://www.aigc.cn/91864.html.
	World models reshape the future of cities: how generative AI is building a new paradigm for smart planning[EB/OL]. [2026-02-07]. https://www.aigc.cn/91864.html. (in Chinese)
91	魏天呈, 郭真, 杨云龙. 大模型赋能智慧城市建设的路径与策略研究. 信息通信技术与政策, 2025, 51 (8): 91- 96.
	WEI T C , GUO Z , YANG Y L . Research on the paths and strategies of empowering smart city construction with large language models. Information and Communications Technology and Policy, 2025, 51 (8): 91- 96.
92	CHEN Z X , WANG G C , LIU Z W . SceneDreamer: unbounded 3D scene generation from 2D image collections. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45 (12): 15562- 15576. doi: 10.1109/TPAMI.2023.3321857
93	LIN C H, LEE H Y, MENAPACE W, et al. InfiniCity: infinite-scale city synthesis[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Washington D.C., USA: IEEE Press, 2024: 22751-22761.
94	XIE H Z, CHEN Z X, HONG F Z, et al. CityDreamer: compositional generative model of unbounded 3D cities[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2024: 9666-9675.
95	DENG J, CHAI W H, DENG J, et al. CityGen: infinite and controllable city layout generation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Washington D.C., USA: IEEE Press, 2025: 1986-1996.
96	FENG C, CHEN Z Y, HOŁYŃ SKI A, et al. GPS as a control signal for image generation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2025: 2766-2778.
97	MANVI R, KHANNA S, MAI G C, et al. GeoLLM: extracting geospatial knowledge from large language models[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2310.06213.
98	FENG T, WANG W G, YANG Y. A survey of world models for autonomous driving[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2501.11260.
99	ZHAO G S, WANG X F, ZHU Z, et al. DriveDreamer-2: LLM-enhanced world models for diverse driving video generation[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto, USA: AAAI Press, 2025: 10412-10420.
100	ZHAO G S, NI C J, WANG X F, et al. DriveDreamer4D: world models are effective data machines for 4D driving scene representation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2025: 12015-12026.
101	WANG Y Q, HE J W, FAN L, et al. Driving into the future: multiview visual forecasting and planning with world model for autonomous driving[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2024: 14749-14759.
102	CHEN L, CHITTA K, GAO S Y, et al. Vista: a generalizable driving world model with high fidelity and versatile controllability[C]//Proceedings of the Advances in Neural Information Processing Systems 37. Vancouver, Canada: Neural Information Processing Systems Foundation, Inc. (NeurIPS), 2024: 91560-91596.
103	HU A, RUSSELL L, YEO H, et al. GAIA-1: a generative world model for autonomous driving[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2309.17080.
104	RUSSELL L, HU A, BERTONI L, et al. GAIA-2: a controllable multi-view generative world model for autonomous driving[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2503.20523.
105	ZHENG W Z, CHEN W L, HUANG Y H, et al. OccWorld: learning a 3D occupancy world model for autonomous driving[C]//Proceedings of European Conference on Computer Vision. Berlin, Germany: Springer, 2024: 55-72.
106	WANG L N, ZHENG W Z, REN Y L, et al. OccSora: 4D occupancy generation models as world simulators for autonomous driving[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2405.20337.
107	JIANG C K, ZHOU D S, LIU J M, et al. VectorWorld: efficient streaming world model via diffusion flow on vector graphs[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2603.17652.
108	CIPOLLA R, CORRADO G, GRIFFITHS N, et al. Model-based imitation learning for urban driving[C]//Proceedings of the Advances in Neural Information Processing Systems 35. New Orleans, USA: Neural Information Processing Systems Foundation, Inc. (NeurIPS), 2022: 20703-20716.
109	LOGG A , MARDAL K A , WELLS G . Automated solution of differential equations by the finite element method: the FEniCS book. Berlin, Germany: Springer, 2012.
110	COUMANS E, BAI Y P. PyBullet, a Python module for physics simulation for games, robotics and machine learning[EB/OL]. [2026-02-27]. http://pybullet.org.
111	TODOROV E, EREZ T, TASSA Y. MuJoCo: a physics engine for model-based control[C]//Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems. Washington D.C., USA: IEEE Press, 2012: 5026-5033.
112	HU Y M, LIU J C, SPIELBERG A, et al. ChainQueen: a real-time differentiable physical simulator for soft robotics[C]//Proceedings of the International Conference on Robotics and Automation (ICRA). Washington D.C., USA: IEEE Press, 2019: 6265-6271.
113	BILLARD A , ALBU-SCHAEFFER A , BEETZ M , et al. A roadmap for AI in robotics. Nature Machine Intelligence, 2025, 7 (6): 818- 824.
114	LIU J F , SHI H Y , ZHANG S Y , et al. Automatic quantization for physics-based simulation. ACM Transactions on Graphics, 2022, 41 (4): 1- 16.
115	ANTYPAS D, AYELE A, BORKAKOTY H, et al. BLEnD: a benchmark for LLMs on everyday knowledge in diverse cultures and languages[C]//Proceedings of the Advances in Neural Information Processing Systems 37. Vancouver, Canada: Neural Information Processing Systems Foundation, Inc. (NeurIPS), 2024: 78104-78146.
116	STREET W , SIY J O , KEELING G , et al. LLMs achieve adult human performance on higher-order theory of mind tasks. Frontiers in Human Neuroscience, 2025, 19, 1633272.
117	YANG J H, YANG S S, GUPTA A W, et al. Thinking in space: how multimodal large language models see, remember, and recall spaces[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2025: 10632-10643.
118	GUO X Y, HUO J Y, SHI Z M, et al. T2VPhysBench: a first-principles benchmark for physical consistency in text-to-video generation[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2505.00337.
119	QIN Y R, SHI Z L, YU J W, et al. WorldSimBench: towards video generation models as world simulators[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2410.18072.
120	DUAN H Y, YU H X, CHEN S R, et al. WorldScore: a unified evaluation benchmark for world generation[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2504.00983.
121	HUANG Z Q, HE Y N, YU J S, et al. VBench: comprehensive benchmark suite for video generative models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2024: 21807-21818.
122	ZHENG D, HUANG Z Q, LIU H B, et al. VBench-2.0: advancing video generation benchmark suite for intrinsic faithfulness[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2503.21755.
123	Evaluating robot policies in a world model[EB/OL]. [2026-02-27]. https://arxiv.org/html/2506.00613v1.
124	FENG J, ZHANG J, LIU T H, et al. CityBench: evaluating the capabilities of large language models for urban tasks[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2406.13945.
125	ZHAO B N, FANG J J, DAI Z C, et al. UrbanVideo-bench: benchmarking vision-language models on embodied intelligence with video data in urban spaces[EB/OL]. [2026-02-27]. https://arxiv.org/abs/2503.06157.
126	MA Y S, CUI C, CAO X, et al. LaMPilot: an open benchmark dataset for autonomous driving with language model programs[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2024: 15141-15151.
127	DARGAHI NOBARI K , BERTRAM T . A multimodal driver monitoring benchmark dataset for driver modeling in assisted driving automation. Scientific Data, 2024, 11, 327.

[1]	谭梓鸿, 潘安, 童晶, 刘耀辉, 韦剑. 面向瘢痕部位妊娠的子宫动脉分割与栓塞模拟[J]. 计算机工程, 2026, 52(6): 339-351.
[2]	廖勇, 韩小金, 刘金林, 汪浩. 可解释人工智能研究进展[J]. 计算机工程, 2026, 52(3): 41-61.
[3]	王利民, 朱光辉, 吴涛. 大模型技术演进：世界模型让人工智能从感知走向决策(特邀)[J]. 计算机工程, 2026, 52(2): 1-6.
[4]	张锦, 陈铸, 陈照云, 时洋, 陈冠军. 体系结构模拟器的研究现状、挑战与展望[J]. 计算机工程, 2025, 51(7): 1-11.
[5]	张玉博, 杨帆, 郭亚, 杨文慧. 基于视觉大模型的垃圾分类轻量化算法研究[J]. 计算机工程, 2025, 51(7): 140-151.
[6]	孟凡丰, 王子聪, 张金涛, 王彦景, 欧洋, 吴利舟, 肖侬. 基于gem5的CXL内存池系统设计与实现[J]. 计算机工程, 2025, 51(3): 180-188.
[7]	廖牛语, 田沄, 李岩松, 薛海峰, 杜长坤, 张国华. 大模型工具学习: 方法、作用与机制[J]. 计算机工程, 2025, 51(12): 1-17.
[8]	黄赟, 陈若言, 马力, 蔡一鸣, 陆恒杨, 方伟. 基于并行预测模拟退火的贝叶斯网络结构学习[J]. 计算机工程, 2025, 51(10): 160-172.
[9]	胡升龙, 陈彬, 张开华, 宋慧慧. 场景结构知识增强的协同显著性目标检测[J]. 计算机工程, 2025, 51(1): 31-41.
[10]	张玉鑫, 张雷, 欧冬秀. 面向磁浮轨道的多源点云数据的混合滤波方法[J]. 计算机工程, 2024, 50(9): 54-62.
[11]	李红娇, 王宝金, 王朝晖, 胡仁豪. 基于模型相似度与本地损失的双重客户端选择算法[J]. 计算机工程, 2024, 50(8): 153-164.
[12]	郝金骁, 王龑, 郭倩宇, 张文强. 早期工作阶段滚动轴承剩余寿命预测算法[J]. 计算机工程, 2024, 50(12): 48-58.
[13]	王靖尧, 曹敏. 基于文本的行人图像检索的多样化数据扩充方法[J]. 计算机工程, 2024, 50(12): 276-287.
[14]	靳雁霞, 史志儒, 杨晶, 刘亚变, 乔星宇, 张翎. 布料与精细建模物体间的碰撞检测算法研究[J]. 计算机工程, 2023, 49(7): 269-277.
[15]	李靖, 祝爱琦, 韩林, 侯超峰. 基于GPU的固态晶体硅分子动力学算法优化[J]. 计算机工程, 2023, 49(3): 288-295.

选择文件类型/文献管理软件名称

选择包含的内容

基于大模型的世界模型研究综述(特邀)

Survey of World Models Based on Large Models (Invited)

RichHTML

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

图/表 9

参考文献 127

相关文章 15

编辑推荐

Metrics

本文评价

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

基于大模型的世界模型研究综述(特邀)

Survey of World Models Based on Large Models (Invited)

RichHTML

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

图/表 9

参考文献 127

相关文章 15

编辑推荐

Metrics

本文评价