作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2022, Vol. 48 ›› Issue (10): 123-129. doi: 10.19678/j.issn.1000-3428.0062524

• 人工智能与模式识别 • 上一篇    下一篇

基于偏相关性测试的递归式因果推断算法

陈铭杰1,2, 张浩2,3, 彭昱忠4, 谢峰5, 庞悦3,6   

  1. 1. 东莞理工学院 计算机科学与技术学院, 广东 东莞 523808;
    2. 广东石油化工学院 计算机学院, 广东 茂名 525099;
    3. 复旦大学 计算机科学技术学院, 上海 200433;
    4. 南宁师范大学 计算机与信息工程学院, 南宁 530001;
    5. 北京大学 数学科学学院, 北京 100871;
    6. 中国银联博士后科研工作站, 上海 201201
  • 收稿日期:2021-08-28 修回日期:2021-10-29 发布日期:2021-11-05
  • 作者简介:陈铭杰(1999—),男,硕士研究生,主研方向为机器学习、因果推断;张浩(通信作者),讲师、博士;彭昱忠,教授、博士;谢峰、庞悦,博士后。
  • 基金资助:
    国家自然科学基金(62006051);中国博士后科学基金(2020M680225);广东省高校青年创新人才项目(2020KQNCX049)。

Recursive Causal Inference Algorithm Based on Partial Correlation Test

CHEN Mingjie1,2, ZHANG Hao2,3, PENG Yuzhong4, XIE Feng5, PANG Yue3,6   

  1. 1. School of Computer Science and Technology, Dongguan University of Technology, Dongguan, Guangdong 523808, China;
    2. School of Computer, Guangdong University of Petrochemical Technology, Maoming, Guangdong 525099, China;
    3. School of Computer Science, Fudan University, Shanghai 200433, China;
    4. School of Computer and Information Engineering, Nanning Normal University, Nanning 530001, China;
    5. School of Mathematical Sciences, Peking University, Beijing 100871, China;
    6. China UnionPay Post-Doctoral Research Station, Shanghai 201201, China
  • Received:2021-08-28 Revised:2021-10-29 Published:2021-11-05

摘要: 因果推断是挖掘事物间联系的一种重要方式,但在高维数据场景下,利用因果推断算法进行条件独立性(CI)测试存在冗余测试多和测试效率低的问题,这限制了因果推断在高维数据集上的应用。提出一种基于偏相关性测试的递归式因果推断算法。采用“分治”的方法对变量集进行递归式因果分割,得到更易于处理的低维子数据集,提高对数据集的处理效率。在每个子数据集上进行局部因果推断,减少每次因果推断的计算量并提升算法的运行速度。在此基础上,通过比较显著性值的合并策略整合所有子结果并得到完整的因果关系,保证总体因果结构的准确性。在“分治”过程中,采用高效的偏相关性测试避免高复杂度的核密度估算,进一步提升算法效率。基于10个经典数据集的实验结果表明,在准确率与经典推断算法CAPA持平的情况下,该算法的运算速度提升了2~10倍,且在样本量越大的数据集中提升效果越明显,证明递归式因果推断算法可以有效处理高维数据集,在保证准确率的同时提高运算效率。

关键词: 因果推断, 因果网络, 条件独立性测试, 偏相关性测试, 递归式算法

Abstract: Causal inference is an important tool for mining relationships between observed data points.The causal inference algorithm encounters the problems of redundant tests and low test efficiency in high-dimensional cases, which limits the application of causal inference in high-dimensional datasets.This study proposes a recursive causal inference algorithm based on partial correlation test.The strategy of ‘divide and conquer’ is used to perform the recursive causal segmentation of the variable set to obtain the low-dimensional sub-dataset, which is easier to handle and improves the processing efficiency of the dataset.Local causal inference is performed on each subset to reduce the computation amount for each causal inference and improve the running speed of the algorithm.Thereafter, the significant values of the merger strategy are compared to integrate all subresults and obtain a complete causal relationship to ensure the accuracy of the overall causal structure.By ‘dividing and conquering’, an efficient partial correlation test is used to avoid the high complexity of kernel density estimation and further improve the efficiency of the algorithm.Experiments are performed on ten classical data sets.The results show that when the accuracy is the same as that of the classical inference algorithm, CAPA, the operation speed of this algorithm improved by two to ten times.The improvement effect is more obvious on the dataset with a larger sample size, which proves that the recursive causal inference algorithm can effectively handle high-dimensional datasets, ensure a good accuracy, and improve the operational efficiency.

Key words: causal inference, causal network, Conditional Independence(CI) test, partial correlation test, recursive algorithm

中图分类号: