作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2024, Vol. 50 ›› Issue (5): 167-181. doi: 10.19678/j.issn.1000-3428.0067967

• 网络空间安全 • 上一篇    下一篇

基于贝叶斯网络的差分隐私高维数据发布技术研究

卢晓天1,2, 朴春慧1,2, 杨兴雨1,2, 白英杰2,3   

  1. 1. 石家庄铁道大学信息科学与技术学院, 河北 石家庄 050043;
    2. 河北省电磁环境效应与信息处理重点实验室, 河北 石家庄 050043;
    3. 北京全路通信信号研究设计院集团有限公司, 北京 100070
  • 收稿日期:2023-06-29 修回日期:2023-09-25 出版日期:2024-05-15 发布日期:2024-05-21
  • 通讯作者: 朴春慧,E-mail:pchls2011@126.com E-mail:pchls2011@126.com
  • 基金资助:
    河北省重点研发计划(21355902D)。

Research on Differential Privacy High Dimensional Data Publishing Technology Based on Bayesian Networks

LU Xiaotian1,2, PIAO Chunhui1,2, YANG Xingyu1,2, BAI Yingjie2,3   

  1. 1. School of Information Science and Technology, Shijiazhuang Tiedao University, Shijiazhuang 050043, Hebei, China;
    2. Hebei Key Laboratory for Electromagnetic Environmental Effects and Information Processing, Shijiazhuang 050043, Hebei, China;
    3. CRSC Research & Design Institute Group Co., Ltd., Beijing 100070, China
  • Received:2023-06-29 Revised:2023-09-25 Online:2024-05-15 Published:2024-05-21
  • Contact: 朴春慧,E-mail:pchls2011@126.com E-mail:pchls2011@126.com

摘要: 在实现隐私保护的同时提高数据可用性是高维结构化数据发布研究中的挑战性问题,经典算法PrivBayes针对该问题提供了一种解决方案。为进一步减少计算开销、提高数据可用性,提出基于贝叶斯网络的差分隐私数据发布算法ELPrivBayes。分析贝叶斯网络结构学习阶段的理论计算开销,构建存储属性之间互信息的相关矩阵,避免结构学习算法迭代过程中互信息的冗余计算,降低了时间复杂度。基于平均互信息优化了节点进入贝叶斯网络的顺序,提高结构学习迭代过程中指数机制贡献的互信息期望值,进而提高生成数据集与原始数据集的统计近似度,并实证分析网络结构质量对首节点选择的低敏感性。在4个典型数据集上的实验结果表明,与经典算法PrivBayes及其改进方案相比较,结构学习阶段的计算开销降低了97%~99%,基于指数机制捕获的互信息提高了14%~67%,生成数据集与原始数据集的平均变差距离降低了32%~40%,构建的支持向量机(SVM)分类器的准确率提高了4%~5%,并且当ε≤0.8时,采用ELPrivBayes算法生成数据的可用性提升更为显著。

关键词: 数据发布, 贝叶斯网络, 差分隐私, 隐私保护, 相关矩阵, 平均互信息

Abstract: Improving data availability while implementing privacy protection is challenging in high-dimensional structured data publishing; however, the classic PrivBayes algorithm can solve this issue. To further reduce computational costs and improve data availability, a differential privacy data-publishing algorithm based on Bayesian networks, ELPrivBayes, is proposed. It analyzes the theoretical computational cost of the Bayesian network structure in the learning stage, constructs a correlation matrix for storing Mutual Information(MI) between attributes, avoids redundant calculations of MI in the iterative process of structural learning algorithms, and reduces time complexity. Based on the Average MI(AMI), the order in which nodes enter the Bayesian network is optimized, and the expected mutual information contribution of the exponential mechanism in the iterative process of structural learning increases, thereby improving the statistical approximation between the generated and original datasets. The low sensitivity of the network structure quality to the selection of the first node is analyzed empirically. Experimental results on four typical datasets show that, compared with the classical PrivBayes algorithm and its improved solutions, the computational cost in the structural learning stage is reduced by 97%-99%, the MI captured based on the exponential mechanism is improved by 14%-67%, the average variation distance between the generated and original datasets is reduced by 32%-40%, and the accuracy of the constructed Support Vector Machine(SVM) classifier is improved by 4%-5%. Moreover, when ε≤0.8, the availability improvement of data generated using the ELPrivBayes algorithm is more significant.

Key words: data publishing, Bayesian network, differential privacy, privacy protection, correlation matrix, Average Mutual Information(AMI)

中图分类号: