作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• •    

TGMM:结合解析树与GPU的大规模多语言多粒度代码克隆检测方法

  • 发布日期:2025-05-09

TGMM: Combining Parse Tree with GPU for Scalable Multilingual and Multi-Granularity Code Clone Detection

  • Published:2025-05-09

摘要: 针对现有代码克隆检测工具在多语言适配与大规模克隆分析方面存在的不足,提出一种基于解析树和图形处理器(GPU)加速的大规模代码克隆检测方法—TGMM。该方法采用三级处理架构进行克隆分析:首先,根据各编程语言的词法与语法规则生成标准化解析树,并从中提取满足特定粒度要求的子树;其次,通过对子树进行剪枝和语义等价转换,实现子树的简化与非功能性差异的消除;最后,利用GPU并行构建全局后缀数组,实现大规模代码块相似度的快速计算。实验环节从克隆检测效能和语言扩展性两个维度对TGMM进行了测试:在公开的基准数据集BigCloneBench上,TGMM以97%的精确率显著优于对比的7种主流工具,其平均执行时间较次优工具缩短50%以上,同时保证召回率在各类克隆类型上与对比工具相当;在语言拓展性测试中,TGMM成功解析30种主流编程语言中的25种。此外,通过应用TGMM对GitHub排名前45的项目(涵盖9种编程语言)执行多粒度克隆分析,首次揭示了不同语言在克隆分布密度上的显著差异,并详细分析了其背后成因,从而为软件维护提供了切实有效的参考依据。

Abstract: This paper proposes TGMM, a large-scale code clone detection method based on parse trees and GPU acceleration, addressing the limitations of existing tools in multi-language adaptation and large-scale analysis. The method employs a three-stage architecture for clone detection. First, it generates standardized parse trees based on programming languages’ lexical and syntactic rules, then extracts subtrees meeting granularity requirements. Second, it simplifies subtrees through pruning and removes non-functional differences via semantic equivalence transformations. Finally, a global suffix array is constructed in parallel using GPUs to rapidly calculate the similarity of code blocks. The method is tested in terms of clone detection efficiency and language scalability. On the public benchmark dataset BigCloneBench, TGMM achieves a precision of 97%, significantly outperforming seven mainstream tools, with an average execution time reduced by over 50% compared to the second-best tool, while maintaining a comparable recall rate across various clone types. In the language scalability test, TGMM successfully parses 25 out of 30 mainstream programming languages. Additionally, TGMM is applied to conduct a multi-granularity clone analysis on the top 45 GitHub projects (covering 9 programming languages), revealing significant differences in clone density across different languages and providing an in-depth analysis of the underlying causes, thus offering practical and effective references for software maintenance.