作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2024, Vol. 50 ›› Issue (6): 188-196. doi: 10.19678/j.issn.1000-3428.0067926

• 网络空间安全 • 上一篇    下一篇

非独立同分布下联邦半监督学习的数据分享研究

顾永跟, 高凌轩, 吴小红, 陶杰   

  1. 湖州师范学院信息工程学院, 浙江 湖州 313000
  • 收稿日期:2023-06-25 修回日期:2023-09-13 发布日期:2024-06-11
  • 通讯作者: 陶杰,E-mail:taojie@zjhu.edu.cn E-mail:taojie@zjhu.edu.cn
  • 基金资助:
    浙江省现代农业资源智慧管理与应用研究重点实验室项目(2020E10017)。

Research on Data Sharing of Federated Semi-Supervised Learning with Non-IID

GU Yonggen, GAO Lingxuan, WU Xiaohong, TAO Jie   

  1. School of Information Engineering, Huzhou University, Huzhou 313000, Zhejiang, China
  • Received:2023-06-25 Revised:2023-09-13 Published:2024-06-11

摘要: 联邦学习作为一种保护本地数据隐私安全的分布式机器学习方法,联合分散的设备共同训练共享模型。通常联邦学习在数据均有标签情况下进行训练,然而现实中无法保证标签数据完全存在,提出联邦半监督学习。在联邦半监督学习中,如何利用无标签数据提升系统性能和如何缓解数据异质性带来的负面影响是两大挑战。针对标签数据仅在服务器场景,基于分享的思想,设计一种可应用在联邦半监督学习系统上的方法Share&Mark,该方法将客户端的分享数据由专家标记后参与联邦训练。同时,为充分利用分享的数据,根据各客户端模型在服务器数据集上的损失值动态调整各客户端模型在联邦聚合时的占比,即ServerLoss聚合算法。综合考虑隐私牺牲、通信开销以及人工标注成本3个方面的因素,对不同分享率下的实验结果进行分析,结果表明,约3%的数据分享比例能平衡各方面因素。此时,采用Share&Mark方法的联邦半监督学习系统FedMatch在CIFAR-10和Fashion-MNIST数据集上训练的模型准确率均可提升8%以上,并具有较优的鲁棒性。

关键词: 联邦半监督学习, 联邦学习, 数据非独立同分布, 鲁棒性, 聚合算法, 数据分享

Abstract: Federated Learning (FL) is a distributed machine-learning method that protects the privacy and security of local data by training a shared model on decentralized devices. Typically, FL is performed when all data are labeled. However, in reality, the availability of labeled data is not always guaranteed. Therefore, Federated Semi-Supervised Learning (FSSL) has been proposed. In FSSL, there are two major challenges: utilizing unlabeled data to improve system performance and mitigating the negative effects of data heterogeneity. To address the scenario in which labeled data exist only on the server, a method called Share&Mark is designed based on the concept of sharing. This method can be applied to FSSL systems. Share&Mark involves having experts annotate the shared data from client devices, which then participate in federated training. In addition, to leverage the shared data fully, the ServerLoss aggregation algorithm dynamically adjusts the proportions of the client models during federated aggregation based on their respective loss values on the server dataset. Considering privacy sacrifices, communication costs, and manual annotation costs, the experimental results for different sharing ratios are analyzed. It is found that a sharing ratio of approximately 3% is a balanced choice considering all factors. With Share&Mark method, the FSSL system called FedMatch achieves an accuracy improvement of more than 8% on the CIFAR-10 and Fashion-MNIST datasets. It also demonstrates high robustness.

Key words: Federated Semi-Supervised Learning(FSSL), Federated Learning(FL), data non-Independent and Identical Distribution(non-IID), robustness, aggregation algorithm, data sharing

中图分类号: