[1] ACHIAM J, ADLER S, AGARWAL S, et al. Gpt-4
technical report[J]. arXiv preprint arXiv:2303.08774,
2023.
[2] 李博,季佰军,段湘煜.基于译文易错词纠正机制的大语言模型机器翻译 [J/OL]. 计算机工
程 ,1-13[2025-03-22].https://doi.org/10.19678/j.issn.1000-
3428.0069767.
LI B, JI B J, DUAN X Y. Large language model machine
translation based on a mechanism for correcting easily
mistaken words in translations[J/OL]. Computer Engineering:
1-13[2025-03-22].https://doi.org/10.19678/j.issn.1000-34
28.0069767.
[3] Xi Z, CHEN W, GUO X, et al. The rise and potential of
large language model based agents: A survey[J]. Science
China Information Sciences, 2025, 68(2): 121101.
[4] 罗焕坤, 葛一烽, 刘帅. 大语言模型在数学推理中的研
究进展 [J]. 计 算 机 工 程 , doi:
10.19678/j.issn.1000-3428.0069590.
LUO H K, GE Y F , LIU S. Research Progress of Large
Language Models in Mathematical Reasoning[J]. Computer Engineering, doi:
10.19678/j.issn.1000-3428.0069590.
[5] ZHANG X, TIAN C, YANG X, et al. Alpacare: Instruction-tuned large language models for medical application[J]. arXiv preprint arXiv:2310.14558, 2023.
[6] 沈晨晨, 岳盛斌, 刘书隽, 等. 面向法律领域的大模型
微调与应用[J]. 大数据, 2024,10(5):12-27.
SHEN C C, YUE S B, LIU S J, et al. Fine-tuning and applications of large models for legal domain[J]. Big Data
Research, 2024,10(5):12-27.
[7] JI Z, LEE N, FRIESKE R, et al. Survey of hallucination in
natural language generation[J]. ACM Computing Surveys,
2023, 55(12): 1-38.
[8] ZHANG Y, LI Y, CUI L, et al. Siren's song in the AI ocean:
a survey on hallucination in large language models[J].
arXiv preprint arXiv: 2309.01219, 2023.
[9] RAWTE V, SHETH A, DAS A. A survey of hallucination
in large foundation models[J]. arXiv preprint
arXiv:2309.05922, 2023.
[10] HUANG L, YU W, MA W, et al. A survey on hallucination
in large language models: Principles, taxonomy, challenges, and open questions[J]. ACM Transactions on Information Systems, 2025, 43(2): 1-55.
[11] 刘泽垣,王鹏江,宋晓斌,等.大语言模型的幻觉问题研究
综 述 [J]. 软 件 学 报 ,2025,36(03):1152-1185.
DOI:10.13328/j.cnki.jos.007242.
LIU Z Y, WANG P J, SONG X B, et al. A Survey on hallucination problem in large language models[J]. Journal of
Software,2025,36(03):1152-1185.
DOI:10.13328/j.cnki.jos.007242.
[12] LINDLEY D V. On a measure of the information provided
by an experiment[J]. The Annals of Mathematical Statistics, 1956, 27(4): 986-1005.
[13] FARQUHAR S, KOSSEN J, KUHN L, et al. Detecting
hallucinations in large language models using semantic
entropy[J]. Nature, 2024, 630(8017): 625-630.
[14] THORNE J, VLACHOS A, CHRISTODOULOPOULOS
C, et al. FEVER: a large-scale dataset for fact extraction
and verification[C]//Proceedings of the 2018 Conference
of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, 2018: 809-819.
[15] LIN S, HILTON J, EVANS O. TruthfulQA: Measuring
how models mimic human falsehoods[C]//Proceedings of
the 60th Annual Meeting of the Association for Computational Linguistics, 2022: 3214-3252.
[16] HENDRYCKS D, BURNS C, BASART S, et al. Measuring Massive Multitask Language Understanding[C]//Proceedings of the International Conference on
Learning Representations, 2021.
[17] DU Y, LI S, TORRALBA A, et al. Improving Factuality
and Reasoning in Language Models through Multiagent
Debate[C]//International Conference on Machine Learning. PMLR, 2024: 11733-11763.
[18] TONMOY S M, ZAMAN S M, JAIN V, et al. A comprehensive survey of hallucination mitigation techniques in
large language models[J]. arXiv preprint
arXiv:2401.01313, 2024.
[19] WHITE J, FU Q, HAYS S, et al. A prompt pattern catalog
to enhance prompt engineering with chatgpt[J]. arXiv
preprint arXiv:2302.11382, 2023.
[20] SHUSTER K, POFF S, CHEN M, et al. Retrieval augmentation reduces hallucination in conversation[C]//Findings of the Association for Computational
Linguistics: EMNLP 2021. 2021: 3784-3803.
[21] DHULIAWALA S, KOMEILI M, XU J, et al.
Chain-of-Verification Reduces Hallucination in LargeLanguage Models[C]//Findings of the Association for
Computational Linguistics ACL 2024. 2024: 3563-3578.
[22] SHI W, HAN X, LEWIS M, et al. Trusting your evidence:
hallucinate less with context-aware decoding[C]//Proceedings of the 2024 Conference of the North
American Chapter of the Association for Computational
Linguistics: Human Language Technologies,2024:
783-791.
[23] BAYAT F F, QIAN K, HAN B, et al. FLEEK: Factual
error detection and correction with evidence retrieved
from external knowledge [C]//Proceedings of the 2023
Conference on Empirical Methods in Natural Language
Processing: System Demonstrations,2023: 124-130.
[24] CHRISTIANO P F, LEIKE J, BROWN T B, et al. Deep
reinforcement learning from human preferences[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems,2017:
4302-4310.
[25] BRADLEY R A,TERRY M E. Rank analysis of incomplete block designs: the method of paired comparisons[J].
Biometrika, 1952,39(3-4):324-345.
[26] SCHULMAN J, WOLSKI F, DHARIWAL P, et al. Proximal policy optimization algorithms[J]. arXiv preprint
arXiv:1707.06347, 2017.
[27] OUYANG L, WU J, JIANG X, et al. Training language
models to follow instructions with human feedback[J].
Advances in neural information processing systems, 2022,
35: 27730-27744.
[28] SCHULMAN J, MORITZ P, LEVINE S, et al.
High-dimensional continuous control using generalized
advantage estimation[J]. arXiv preprint arXiv:1506.02438,
2015.
[29] SUTTON R S. Learning to predict by the methods of
temporal differences[J]. Machine learning, 1988, 3: 9-44.
[30] MALON C. Team Papelo: Transformer networks at
FEVER[C]//Proceedings of the First Workshop on Fact
Extraction and VERification (FEVER),2018: 109-113.
[31] VASWANI A, SHAZER N, PARMAR N, et al. Attention
is all you need[C]//Proceedings of the 31st International
Conference on Neural Information Processing Systems,
2017:5998-6008.
[32] LIN C Y. Rouge: A package for automatic evaluation of
summaries[C]//Text summarization branches out,2004:
74-81.
[33] PAPINENI K, ROUKOS S, WARD T, et al. Bleu: A
method for automatic evaluation of machine translation[C]
//Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 2002: 311-318.
[34] ZHANG T, KISHORE V, WU F, et al. BERTScore: Evaluating Text Generation with BERT[C]//International
Conference on Learning Representations, 2020.
[35] LI J, GALLEY M, BROCKETT C, et al. A Diversity-Promoting Objective Function for Neural Conversation
Models[C]//Proceedings of the 2016 Conference of the
North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2016:
110-119.
[36] YOFFE L, AMAYUELAS A, WANG W Y. DebUnc: mitigating hallucinations in large language model agent
communication with uncertainty estimations[J]. arXiv
preprint arXiv:2407.06426, 2024.
[37] WANG X, WEI J, SCHUURMANS D, et al.
Self-consistency improves chain of thought reasoning in
language models[J]. arXiv preprint arXiv: 2203.11171,
2022.
[38] WEI J, WANG X, SCHUURMANS D, et al.
Chain-of-thought prompting elicits reasoning in large
language models[J]. Advances in neural information processing systems, 2022, 35: 24824-24837.
[39] SHINN N, CASSANO F, GOPINATH A, et al. Reflexion:
Language agents with verbal reinforcement learning[J].
Advances in Neural Info rmation Processing Systems,
2023, 36: 8634-8652.
[40] CHUANG Y, XIE Y, LUO H, et al. DoLa: Decoding by
Contrasting Layers Improves Factuality in Large Language Models[C]//Proceedings of the Twelfth International Conference on Learning Representations, 2024.
[41] WANG A, SONG L, PENG B, et al. Fine-Grained
Self-Endorsement Improves Factuality and Reasoning[J].
CoRR, 2024.
[42] PENEDO G, MALARTIC Q, HESSLOW D, et al. The
refinedweb dataset for falcon llm: Outperforming curated
corpora with web data only[J]. Advances in Neural Information Processing Systems, 2023, 36: 79155-79172. [43] GLM T, ZENG A, XU B, et al. Chatglm: A family of large
language models from glm-130b to glm-4 all tools[J].
arXiv preprint arXiv:2406.12793, 2024.
[44] BAI J, BAI S, CHU Y, et al. Qwen technical report[J].
arXiv preprint arXiv:2309.16609, 2023.
[45] CUI Y, YANG Z, YAO X. Efficient and effective text encoding for chinese llama and alpaca[J]. arXiv preprint
arXiv:2304.08177, 2023.
[46] TOUVRON H, MARTIN L, STONE K, et al. Llama 2:
Open foundation and fine-tuned chat models[J]. arXiv
preprint arXiv:2307.09288, 2023.
[47] HU E J, SHEN Y, WALLIS P, et al. Lora: Low-rank adaptation of large language models[J]. ICLR, 2022, 1(2): 3.
[48] WILLIAMS A, NANGIA N, BOWMAN S. A
Broad-coverage challenge corpus for sentence understanding through inference[C]//Proceedings of the 2018
Conference of the North American Chapter of the Association for Computational Linguistics: Human Language
Technologies, 2018: 1112-1122.
[49] HE P, LIU X, GAO J, et al. Deberta: Decoding-enhanced
bert with disentangled attention[J]. arXiv preprint
arXiv:2006.03654, 2020.
|