作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2024, Vol. 50 ›› Issue (6): 48-55. doi: 10.19678/j.issn.1000-3428.0067378

• 热点与综述 • 上一篇    下一篇

基于多源域适应的单细胞智能分类

魏琢艺1,2, 罗迈1,2, 李文兵1,2, 曾远松1,2, 余伟江1,2, 杨跃东1,2   

  1. 1. 中山大学计算机学院, 广东 广州 510000;
    2. 中山大学国家超级计算广州中心, 广东 广州 510000
  • 收稿日期:2023-04-10 修回日期:2023-10-15 发布日期:2023-11-14
  • 通讯作者: 魏琢艺,E-mail:weizhy8@mail2.sysu.edu.cn E-mail:weizhy8@mail2.sysu.edu.cn
  • 基金资助:
    国家重点研发计划(2022YFF1203100);国家自然科学基金(12126610)。

Intelligent Single-Cell Classification Based on Multisource Domain Adaptation

WEI Zhuoyi1,2, LUO Mai1,2, LI Wenbing1,2, ZENG Yuansong1,2, YU Weijiang1,2, YANG Yuedong1,2   

  1. 1. School of Computer Science and Engineering, Sun Yat-Sen University, Guangzhou 510000, Guangdong, China;
    2. National Supercomputer Center in Guangzhou, Sun Yat-Sen University, Guangzhou 510000, Guangdong, China
  • Received:2023-04-10 Revised:2023-10-15 Published:2023-11-14

摘要: 单细胞核糖核酸(RNA)测序技术被成功应用于产生人体组织和器官的高分辨率细胞图谱,这加深了研究者们对人类疾病组织中细胞异质性的理解。细胞注释是单细胞RNA测序数据分析中非常关键的一步,许多典型的模型利用一个有标签的单细胞参考数据集去注释目标数据集,但目标数据集中部分细胞类型可能不在参考数据集中。整合多个参考数据集可以更好地覆盖目标数据集中的细胞类型,然而多个参考数据集和目标数据集之间存在因测序技术差异等原因造成的批次效应。为此,提出一种基于多源域适应的单细胞分类模型,利用多个已标注细胞类型的参考数据集分别与未标注细胞类型的目标数据集进行对抗训练,消除了批次效应。采用虚拟对抗训练,进一步提升模型预测结果对数据点周围局部微小扰动或噪声的鲁棒性,防止过拟合。在多个单细胞数据集上的实验结果表明,该模型比目前主流模型的细胞识别精度至少提升了5个百分点,为新测序的单细胞身份鉴定提供了新的选择和参考。

关键词: 单细胞核糖核酸测序, 单细胞分类, 多源域适应, 对抗训练, 深度学习

Abstract: Single-cell Ribonucleic Acid (RNA) sequencing technology has proven effective in generating high-resolution cell maps of human tissues and organs, thereby enhancing researchers' comprehension of cellular heterogeneity in human disease tissues. Cell annotation stands as a crucial step in single-cell RNA sequencing data analysis. While many conventional models rely on a labeled single-cell reference dataset to annotate the target dataset, certain cell types within the target dataset may not be represented in the reference dataset. Consequently, integrating multiple reference datasets can offer broader coverage of cell types in the target dataset. Nevertheless, batch effects arise between multiple reference datasets and the target dataset due to disparities in sequencing technologies and other factors. To mitigate this issue, this study introduces a single-cell classification model based on multisource domain adaptation. This model leverages multiple reference datasets, each annotated with cell types, to undergo adversarial training with an unlabeled target dataset, thereby mitigating batch effects. Additionally, virtual adversarial training is employed to bolster the model's predictive robustness against minor perturbations or noise around data points, thus preventing overfitting. Experimental findings across multiple single-cell datasets demonstrate that this model enhances cell recognition accuracy by a minimum of 5 percentage points compared to current mainstream models, offering new avenues and benchmarks for identifying newly sequenced single-cell identities.

Key words: single-cell Ribonucleic Acid(RNA) sequencing, single-cell classification, multisource domain adaptation, adversarial training, deep learning

中图分类号: