作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• •    

基于贝叶斯网络的差分隐私高维数据发布技术研究

  • 发布日期:2023-11-14

Research on Differential Privacy High Dimensional Data Publishing Technology via Bayesian Networks

  • Published:2023-11-14

摘要: 在实现隐私保护的同时提高数据可用性是高维结构化数据发布研究中的一个挑战性问题,经典算法PrivBayes针对该问题提供了一种解决方案。为进一步减少计算开销、提高数据可用性,提出了基于贝叶斯网络的差分隐私数据发布算法ELPrivBayes。分析贝叶斯网络结构学习阶段的理论计算开销,构建存储属性之间互信息的相关矩阵,避免结构学习算法迭代过程中互信息的冗余计算,降低了时间复杂度。基于平均互信息优化了节点进入贝叶斯网络的顺序,提高结构学习迭代过程中指数机制贡献的互信息期望值,进而提高生成数据集与原始数据集的统计近似度,并实证分析了网络结构质量对首节点选择的低敏感性。在四个典型数据集上的实验结果表明:与经典算法PrivBayes及其改进方案相比较,结构学习阶段的计算开销降低了97%-99%,基于指数机制捕获的互信息提高了14%-67%,生成数据集与原始数据集的平均变差距离降低了32%-40%,构建的SVM分类器准确率提高了4%-5%,并且当ε≤0.8时,采用ELPrivBayes算法生成数据的可用性提升更为显著。

Abstract: Improving data availability while achieving privacy protection is a challenging problem in the research of high-dimensional structured data publishing. The classical algorithm PrivBayes provides a solution to this problem. In order to further reduce the computational overhead and improve the data availability, a differential privacy data publishing algorithm ELPrivBayes based on Bayesian network is proposed. The theoretical calculation cost of the Bayesian network structure learning stage is analyzed, and the correlation matrix of mutual information between attributes is constructed to avoid the redundant calculation of mutual information in the iterative process of the structure learning algorithm and reduce the time complexity. Based on the average mutual information, the order of nodes entering the Bayesian network is optimized, the expectation of mutual information contributed by the exponential mechanism in the iterative process of structural learning is improved, and the statistical approximation between the generated dataset and the original dataset is improved. The low sensitivity of the network structure quality to the first node selection is empirically analyzed. The experimental results on four typical datasets show that compared with the classical algorithm PrivBayes and its improved scheme, the computational cost of the structure learning phase is reduced by 97%-99%, the mutual information captured based on the exponential mechanism is increased by 14%-67%, the average variation distance between the generated dataset and the original dataset is reduced by 32%-40%, and the accuracy of the constructed SVM classifier is increased by 4%-5%. When ε≤0.8, the availability of data generated by ELPrivBayes algorithm is improved more significantly.