作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2025, Vol. 51 ›› Issue (12): 189-201. doi: 10.19678/j.issn.1000-3428.0069804

• 先进计算与数据处理 • 上一篇    下一篇

基于《个人信息保护法》的App隐私政策合规性检测

孙雯倩, 徐天辰, 余佩厚, 陈云芳, 张伟*()   

  1. 南京邮电大学计算机学院,江苏 南京 210023
  • 收稿日期:2024-04-29 修回日期:2024-06-25 出版日期:2025-12-15 发布日期:2024-08-19
  • 通讯作者: 张伟
  • 基金资助:
    国家自然科学基金(62202406)

Compliance Detection of App Privacy Policies Based on Personal Information Protection Law

SUN Wenqian, XU Tianchen, YU Peihou, CHEN Yunfang, ZHANG Wei*()   

  1. School of Computer Science, Nanjing University of Posts and Telecommunications, Nanjing 210023, Jiangsu, China
  • Received:2024-04-29 Revised:2024-06-25 Online:2025-12-15 Published:2024-08-19
  • Contact: ZHANG Wei

摘要:

数据隐私保护已成为社会关注的焦点,各国和地区正在陆续制定相关的法律法规,但是由于App产品发布的隐私政策存在篇幅长、专业性强等问题,利用自动化手段检测隐私政策的合规性成为亟待解决的技术难题。作为主流解决方法的机器学习模型需要标签注释的数据集进行支撑,而国内目前缺少这样的App隐私政策数据集。在分析欧盟《通用数据保护条例》(GDPR)合规性分析相关工作的基础上,设计适合我国《个人信息保护法》的标签方案,具体包括15个要求标签,然后使用网络爬虫获取10个类别、363个App的中文隐私政策,并对这些隐私政策进行语句级划分和标注,构建包括104 134个隐私政策语句及标签组成的中文隐私政策语料库。采用百度最新开源的预训练语言模型ERNIE对语料库进行训练与测试,实验结果表明,该方案检测准确率达到85.75%。

关键词: 隐私政策, 《个人信息保护法》, 合规性分析, 语料库, 自然语言处理

Abstract:

Data privacy protection has become the focus of social attention, and countries and regions are gradually formulating relevant laws and regulations in this regard. However, because of the long and professional privacy policies released by App products, the use of automated methods to detect compliance with privacy policies has become an urgent technical challenge. Machine learning models, the widely popular solutions for this challenge, require labeled annotated datasets for support; however, a lack of such App privacy policy datasets currently exists in China. Based on the EU General Data Protection Regulation (GDPR) compliance analysis, a labeling scheme suitable for China′s Personal Information Protection Law is designed, which includes 15 required labels. Subsequently, Chinese privacy policies for 363 Apps in 10 categories are obtained using Web crawlers, and these privacy policies are classified and annotated at the sentence level. A Chinese privacy policy corpus consisting of 104 134 privacy policy statements and labels is constructed. The corpus is trained and tested using the latest open-source pretraining language model from Baidu, ERNIE, with a detection accuracy of 85.75%.

Key words: privacy policy, Personal Information Protection Law, compliance analysis, corpus, Natural Language Processing (NLP)