基于因果机制约束的强化推荐系统

doi:10.19678/j.issn.1000-3428.0066384

摘要/Abstract

摘要： 利用历史数据训练强化学习推荐系统已经得到越来越多研究人员的关注,但是历史数据使得强化学习模型对状态-动作估值错误,产生数据偏差,如流行度偏差和选择偏差。造成上述问题的原因是历史数据分布与强化学习策略采集的数据分布不一致以及历史数据本身带有偏差。使用因果机制可以在约束策略采集数据分布的同时解决数据偏差的问题,提出基于因果机制约束的强化推荐系统,包含因果机制约束模块和对比策略模块。因果机制约束模块用于约束推荐策略可选择的样本空间以减少策略分布与数据分布误差,考虑随时间动态变化的物品流行度分布以缓解流行度偏差。对比策略模块通过平衡正负样本的重要性,缓解选择偏差的影响。在真实数据集Ciao和Epinions上的实验结果表明,相比深度Q网络(DQN)-r、GAIL、SOFA等,该算法具有较优的准确性和多样性,包含加入因果机制约束模块后的模型在F-measure指标上分别提高2%和3%,进一步验证了因果机制约束模块的有效性。

关键词: 推荐系统, 强化学习, 因果机制, 外推误差, 数据偏差

Abstract: The application of historical data for training reinforcement learning recommendation systems is currently gaining attention from researchers. However,historical data leads to the incorrect estimation of state-actions in reinforcement learning models,resulting in data biases such as popularity and selection biases. The reason for this is that the distribution of historical data is inconsistent with the data collected by reinforcement learning strategies,and the historical data itself exhibits bias. To address this challenge,the use of causal mechanisms has proven effective in resolving data bias while constraining the distribution of data collected through policies. This paper proposes a reinforcement recommendation system based on causal mechanism constraint,comprising a causal mechanism constraint module and a comparison strategy module. The causal mechanism constraint module serves to limit the sample space that recommendation strategies can choose,thereby reducing errors in policy and data distributions. Notably,the causal mechanism constraint module considers the dynamic changes in the distribution of item popularity over time to alleviate popularity bias. Simultaneously,the comparison strategy module mitigates the impact of selection bias by balancing the importance of positive and negative samples. Experimental results on real datasets Ciao and Epinions show that,in comparison to Deep Q Network(DQN)-r,GAIL,SOFA,etc.,this algorithm exhibits superior accuracy and diversity. Moreover,the model with the causal constraint module improves the F-measure index by 2% and 3%,respectively,compared to the model without the causal constraint module,further verifying the effectiveness of the causal constraint module.

Key words: recommendation system, reinforcement learning, causal mechanism, extrapolation error, data bias

中图分类号:

TP391

张斯力, 李梓健, 蔡瑞初, 郝志峰, 闫玉光. 基于因果机制约束的强化推荐系统[J]. 计算机工程, 2024, 50(5): 279-290.

ZHANG Sili, LI Zijian, CAI Ruichu, HAO Zhifeng, YAN Yuguang. Reinforcement Recommendation System Based on Causal Mechanism Constraint[J]. Computer Engineering, 2024, 50(5): 279-290.

https://www.ecice06.com/CN/Y2024/V50/I5/279

参考文献

[1] ZHAI C X.Interactive information retrieval:models,algorithms,and evaluation[C]//Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval.New York,USA:ACM Press,2020:2444-2447.
[2] 张明悦,金芝,赵海燕,等.机器学习赋能的软件自适应性综述[J].软件学报,2020,31(8):2404-2431. ZHANG M Y,JIN Z,ZHAO H Y,et al.Survey of machine learning enabled software self-adaptation[J].Journal of Software,2020,31(8):2404-2431.(in Chinese)
[3] 刘全,翟建伟,章宗长,等.深度强化学习综述[J].计算机学报,2018,41(1):1-27. LIU Q,ZHAI J W,ZHAGN Z C,et al.A survey on deep reinforcement learning[J].Chinese Journal of Computers,2018,41(1):1-27.(in Chinese)
[4] 宋健,王子磊.基于值分解的多目标多智能体深度强化学习方法[J].计算机工程,2023,49(1):31-40. SONG J,WANG Z L.Multi-goal multi-agent deep reinforcement learning method based on value decomposition[J].Computer Engineering,2023,49(1):31-40.(in Chinese)
[5] 朱斐,吴文,伏玉琛,等.基于双深度网络的安全深度强化学习方法[J].计算机学报,2019,42(8):1812-1826. ZHU F,WU W,FU Y C,et al.A dual deep network based secure deep reinforcement learning method[J].Chinese Journal of Computers,2019,42(8):1812-1826.(in Chinese)
[6] 刘成浩,朱斐,刘全.基于优化子目标数的Option-Critic算法[J].计算机学报,2021,44(9):1922-1933. LIU C H,ZHU F,LIU Q.Option-Critic algorithm based on sub-goal quantity optimization[J].Chinese Journal of Computers,2021,44(9):1922-1933.(in Chinese)
[7] HA D,SCHMIDHUBER J. World models[EB/OL].[2022-10-25].http://arXiv preprint.arXiv:1803.10122,2018.
[8] ZHANG S T,YAO H S,WHITESON S.Breaking the deadly triad with a target network[EB/OL].[2022-10-25].https://arxiv.org/abs/2101.08862v4.
[9] ZOU L X,XIA L,DING Z Y,et al.Reinforcement learning to optimize long-term user engagement in recommender systems[C]//Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.New York,USA:ACM Press,2019:2810-2818.
[10] IE E,HSU C W,MLADENOV M,et al.RecSim:a configurable simulation platform for recommender systems[EB/OL].[2022-10-25].https://arxiv.org/abs/1909. 04847v2.
[11] SHI B,OZSOY M G,HURLEY N,et al.PyrecGym:a reinforcement learning Gym for recommender systems[C]//Proceedings of the 13th Conference on Recommender Systems.New York,USA:ACM Press,2019:491-495.
[12] HUANG J,OOSTERHUIS H,DE RIJKE M,et al.Keeping dataset biases out of the simulation:a debiased simulator for reinforcement learning based recommender systems[C]//Proceedings of the 14th Conference on Recommender Systems.New York,USA:ACM Press,2020:190-199.
[13] FUJIMOTO,S,MEGER D,PRECUP D.Off-policy deep reinforcement learning without exploration[EB/OL].[2022-10-25].https://arxiv.org/pdf/1812.02900.pdf.
[14] ZHANG Y,FENG F L,HE X N,et al.Causal intervention for leveraging popularity bias in recommendation[C]//Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval.New York,USA:ACM Press,2021:11-20.
[15] KINGMA D P,WELLING M.Auto-encoding variational bayes[EB/OL].[2022-10-25].https://arxiv.org/pdf/1312.6114v1.pdf.
[16] LOUIZOS C,SHALIT U,MOOIJ J,et al.Causal effect inference with deep latent-variable models[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems.New York,USA:ACM Press,2017:6449-6459.
[17] DULAC-ARNOLD G,EVANS R,HASSELT H V,et al.Deep reinforcement learning in large discrete action spaces[EB/OL].[2022-10-25].https://arxiv.org/pdf/1512.07679.pdf.
[18] XIAO T,WANG D L.A general offline reinforcement learning framework for interactive recommendation[C]//Proceedings of the AAAI Conference on Artificial Intelligence.[S.l.]:AAAI Press,2021:4512-4520.
[19] 周运腾,张雪英,李凤莲,等.Q-learning算法优化的SVDPP推荐算法[J].计算机工程,2021,47(2):46-51. ZHOU Y T,ZHANG X Y,LI F L,et al.SVDPP recommendation algorithm optimized by Q-learning algorithm[J].Computer Engineering,2021,47(2):46-51.(in Chinese)
[20] 金志军,王浩,方宝富.稀疏场景下基于理性好奇心的多智能体强化学习[J].计算机工程,2023,49(5):302-309. JIN Z J,WANG H,FANG B F.Multi-agent reinforcement learning based on rational curiosity in sparse scenarios[J].Computer Engineering,2023,49(5):302-309.(in Chinese)
[21] 周瑞朋,秦进.基于最佳子策略记忆的强化探索策略[J].计算机工程,2022,48(2):106-112. ZHOU R P,QIN J.Reinforcement exploration strategy based on best sub-strategy memory[J].Computer Engineering,2022,48(2):106-112.(in Chinese)
[22] ZOU L X,XIA L,DU P,et al.Pseudo Dyna-Q:a reinforcement learning framework for interactive recommendation[C]//Proceedings of the 13th International Conference on Web Search and Data Mining.New York,USA:ACM Press,2020:816-824.
[23] 梁星星,冯旸赫,黄金才,等.基于自回归预测模型的深度注意力强化学习方法[J].软件学报,2020,31(4):948-966. LIANG X X,FENG Y H,HUANG J C,et al.Novel deep reinforcement learning algorithm based on attention-based value function and autoregressive environment model[J].Journal of Software,2020,31(4):948-966.(in Chinese)
[24] 韦炜,全渝娟,卓奕涛,等.基于多阶马尔可夫预测的个性化推荐算法[J].计算机工程,2015,41(11):59-66. WEI W,QUAN Y J,ZHUO Y T,et al.Personalized recommendation algorithm based on multi-order Markov prediction[J].Computer Engineering,2015,41(11):59-66.(in Chinese)
[25] CHEN J W,DONG H D,WANG X,et al.Bias and debias in recommender system:a survey and future directions[EB/OL].[2022-10-25].https://arxiv.org/abs/2010.03240v2.
[26] MARLIN B M,ZEMEL R S,ROWEIS S,et al.Collaborative filtering and the missing at random assumption[EB/OL].[2022-10-25].https://arxiv.org/ftp/arxiv/papers/1206/1206.5267.pdf.
[27] ROHDE D,BONNER S,DUNLOP T,et al.F.RecoGym:a reinforcement learning environment for the problem of product recommendation in online advertising[EB/OL].[2022-10-25].https://arxiv.org/pdf/1808.00720.pdf.
[28] RENDLE S,FREUDENTHALER C,GANTNER Z,et al.BPR:Bayesian personalized ranking from implicit feedback[C]//Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence.New York,USA:ACM Press,2009:452-461.
[29] ZHAO X Y,ZHANG L,DING Z Y,et al.Recommendations with negative feedback via pairwise deep reinforcement learning[C]//Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.New York,USA:ACM Press,2018:1040-1048.
[30] VAN HASSELT H,GUEZ A,SILVER D.Deep reinforcement learning with double Q-learning[EB/OL].[2022-10-25].https://arxiv.org/pdf/1509.06461.pdf.
[31] HO J,ERMON S.Generative adversarial imitation learning[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems.New York,USA:ACM Press,2016:4572-4580.
[32] SHANG W J,YU Y,LI Q Y,et al.Environment reconstruction with hidden confounders for reinforcement learning based recommendation[C]//Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.New York,USA:ACM Press,2019:566-576.
[33] HE X N,DENG K,WANG X,et al.LightGCN:simplifying and powering graph convolution network for recommendation[C]//Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval.New York,USA:ACM Press,2020:639-648.
[34] SCHNABEL T,SWAMINATHAN A,SINGH A,et al.Recommendations as treatments:debiasing learning and evaluation[EB/OL].[2022-10-25].https://arxiv.org/pdf/1602.05352.pdf.
[35] GUO S Y,ZOU L X,LIU Y D,et al. Enhanced doubly robust learning for debiasing post-click conversion rate estimation[C]//Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York,USA:ACM Press,2021:275-284.

选择文件类型/文献管理软件名称

选择包含的内容