作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• •    

一种基于多源域适应的单细胞智能分类方法

  • 发布日期:2023-11-14

A Method Based on Multi-source Adaptive for Single Cell Classification

  • Published:2023-11-14

摘要: 单细胞 ribonucleic acid (RNA) 测序技术被成功用于产生人体组织和器官的高分辨率细胞图谱,这加深了研究者们 对人类疾病组织中细胞异质性的理解。细胞注释是单细胞 RNA 测序数据分析中非常关键的一步,许多典型的方法利用一个有 标签的单细胞参考数据集去注释目标数据集,但目标数据集中部分细胞类型可能不在参考数据集中。整合多个参考数据可以 更好的覆盖目标数据集中的细胞类型,然而多个参考数据集和目标数据集之间存在因测序技术差异等原因而造成的批次效应。 为此,这篇文章提出了一种基于多源域适应的单细胞分类模型,它利用多个已标注细胞类型的参考数据集分别和未标注细胞 类型的目标数据集进行对抗训练,实现了批次消除。此外,作者采用虚拟对抗训练,进一步提升模型预测结果对数据点周围 局部微小扰动或噪声的鲁棒性,防止过拟合。通过在多个单细胞数据集上比较,这篇文章提出的方法比目前最先进方法的细 胞识别精度提升了至少 5%。这为新测序的单细胞身份鉴定提供了新的选择和借鉴。

Abstract: 】Single-cell ribonucleic acid (RNA) sequencing technology has been successfully used to generate high-resolution cellular maps of human tissues and organs, which has deepened researchers’ understanding of cellular heterogeneity in human disease tissues. Cellular annotation is a very critical step in the analysis of single-cell RNA sequencing data, and many typical methods utilize a labeled single-cell reference dataset to annotate the target dataset, but some cell types in the target dataset may not be in the reference dataset. Integrating multiple reference data can better cover the cell types in the target dataset, however, there are batch effects between multiple reference datasets and the target dataset due to differences in sequencing technology and other reasons. To this end, this paper proposes a single-cell classification model based on multi-source domain adaptation, which achieves batch elimination by using multiple reference datasets with labeled cell types trained against unlabeled target datasets, respectively. In addition, the authors use virtual adversarial training to further enhance the robustness of model prediction results to small local perturbations or noise around data points and prevent overfitting. By comparing on multiple single-cell datasets, the proposed method achieves a cell identification accuracy at least 5% higher than the state-of-the-art method. This provides new options and lessons for single-cell identity identification for new sequencing.