TGMM：结合解析树与GPU的大规模多语言多粒度代码克隆检测方法

doi:10.19678/j.issn.1000-3428.0252041

摘要/Abstract

摘要： 针对现有代码克隆检测工具在多语言适配与大规模克隆分析方面存在的不足，提出一种基于解析树和图形处理器（GPU）加速的大规模代码克隆检测方法—TGMM。该方法采用三级处理架构进行克隆分析：首先，根据各编程语言的词法与语法规则生成标准化解析树，并从中提取满足特定粒度要求的子树；其次，通过对子树进行剪枝和语义等价转换，实现子树的简化与非功能性差异的消除；最后，利用GPU并行构建全局后缀数组，实现大规模代码块相似度的快速计算。实验环节从克隆检测效能和语言扩展性两个维度对TGMM进行了测试：在公开的基准数据集BigCloneBench上，TGMM以97%的精确率显著优于对比的7种主流工具，其平均执行时间较次优工具缩短50%以上，同时保证召回率在各类克隆类型上与对比工具相当；在语言拓展性测试中，TGMM成功解析30种主流编程语言中的25种。此外，通过应用TGMM对GitHub排名前45的项目（涵盖9种编程语言）执行多粒度克隆分析，首次揭示了不同语言在克隆分布密度上的显著差异，并详细分析了其背后成因，从而为软件维护提供了切实有效的参考依据。

Abstract: This paper proposes TGMM, a large-scale code clone detection method based on parse trees and GPU acceleration, addressing the limitations of existing tools in multi-language adaptation and large-scale analysis. The method employs a three-stage architecture for clone detection. First, it generates standardized parse trees based on programming languages’ lexical and syntactic rules, then extracts subtrees meeting granularity requirements. Second, it simplifies subtrees through pruning and removes non-functional differences via semantic equivalence transformations. Finally, a global suffix array is constructed in parallel using GPUs to rapidly calculate the similarity of code blocks. The method is tested in terms of clone detection efficiency and language scalability. On the public benchmark dataset BigCloneBench, TGMM achieves a precision of 97%, significantly outperforming seven mainstream tools, with an average execution time reduced by over 50% compared to the second-best tool, while maintaining a comparable recall rate across various clone types. In the language scalability test, TGMM successfully parses 25 out of 30 mainstream programming languages. Additionally, TGMM is applied to conduct a multi-granularity clone analysis on the top 45 GitHub projects (covering 9 programming languages), revealing significant differences in clone density across different languages and providing an in-depth analysis of the underlying causes, thus offering practical and effective references for software maintenance.

叶宇航, 任潇宁, 吴月明. TGMM：结合解析树与GPU的大规模多语言多粒度代码克隆检测方法[J]. 计算机工程, doi: 10.19678/j.issn.1000-3428.0252041.

YE Yuhang, REN Xiaoning, WU Yuming. TGMM: Combining Parse Tree with GPU for Scalable Multilingual and Multi-Granularity Code Clone Detection[J]. Computer Engineering, doi: 10.19678/j.issn.1000-3428.0252041.

参考文献

[1] Al-Ekram R, Kapser C, Holt R, et al. Cloning by

accident: an empirical study of source code

cloning across software systems[C]//2005

International Symposium on Empirical Software

Engineering, 2005. IEEE, 2005: 10 pp.

[2] Roy C K, Cordy J R. A survey on software clone

detection research[J]. Queen’s School of

computing TR, 2007, 541(115): 64-68.

[3] Roy C K, Cordy J R, Koschke R. Comparison

and evaluation of code clone detection techniques

and tools: A qualitative approach[J]. Science of

computer programming, 2009, 74(7): 470-495.

[4] Mondal M, Rahman M S, Saha R K, et al. An

empirical study of the impacts of clones in

software maintenance[C]//2011 IEEE 19th

International Conference on Program

Comprehension. IEEE, 2011: 242-245.

[5] 苏小红,张凡龙.面向管理的克隆代码研究综述

[J].计算机学报,2018,41(03):628-651.

ZHANG X H, ZHANG F L. A Survey for

Management-Oriented Code Clone Research[J].

Chinese Journal of Computers, 2018, 41(03):

628-651.

[6] 刘春玲,戚旭衍,唐永鹤,等.基于词汇的源代码克

隆检测技术综述 [J]. 计算机科

学,2024,51(06):12-22.

LIU C L, QI X Y, TANG Y H, et al. Summary of

Token-based Source Code Clone Detection

Techniques[J]. Computer Science, 2024, 51(06):

12-22.

[7] Cordy J R, Roy C K. The NiCad clone

detector[C]//2011 IEEE 19th international

conference on program comprehension. IEEE,

2011: 219-220.

[8] Sajnani H, Saini V, Svajlenko J, et al. Sourcerercc:

Scaling code clone detection to

big-code[C]//Proceedings of the

38th

international conference on software engineering

2016: 1157-1168.

[9] Nakagawa T, Higo Y, Kusumoto S. Nil:

large-scale detection of large-variance

clones[C]//Proceedings of the 29th ACM Joint

Meeting on European Software Engineering

Conference and Symposium on the Foundations

of Software Engineering. 2021: 830-841.

[10]

Zhu W, Yoshida N, Kamiya T, et al. MSCCD:

grammar pluggable clone detection based on

ANTLR parser generation[C]//Proceedings of the

30th IEEE/ACM International Conference on

Program Comprehension. 2022: 460-470.

[11]

Zhao J, Xia K, Fu Y, et al. An AST-based

code plagiarism detection algorithm[C]//2015

10th International conference on broadband and

wireless computing, communication and

applications (BWCCA). IEEE, 2015: 178-182.

[12]

Zou Y, Ban B, Xue Y, et al. CCGraph: a

PDG-based code clone detector with approximate

graph matching[C]//Proceedings of the 35th

IEEE/ACM international conference on

automated software engineering. 2020: 931-942.

[13]

Terence Parr. ANTLR. [EB/OL].

[2024-12-03]. https://www.antlr.org/.

[14]

ANTLR. grammars-v4. [EB/OL].

[2024-12-03].

https://github.com/antlr/grammars-v4.

[15]

Hunt J W, Szymanski T G. A fast algorithm

for computing longest common subsequences[J].

Communications of the ACM, 1977, 20(5):

350-353.

[16]

Wang Y, Ye Y, Wu Y, et al. Comparison and

evaluation of clone detection techniques with

different code representations[C]//2023

IEEE/ACM 45th International Conference on

Software Engineering (ICSE). IEEE, 2023:

332-344.

[17]

Semura Y, Yoshida N, Choi E, et al.

Ccfindersw: Clone detection tool with flexible

multilingual tokenization[C]//2017 24th

Asia-Pacific Software Engineering Conference

(APSEC). IEEE, 2017: 654-659.

[18]

Queen's University at Kingston. The Txl

Programming Language. [EB/OL]. [2024-12-31].

https://www.txl.ca/txl-index.html.

[19]

Jiang L, Misherghi G, Su Z, et al. Deckard:

Scalable and accurate tree-based detection of

code clones[C]//29th International Conference on

Software Engineering (ICSE'07). IEEE, 2007:

96-105.

[20]

Amme W, Heinze T S, Schäfer A. You look

so different: Finding structural clones and

subclones in java source code[C]//2021 IEEE

International Conference on Software

Maintenance and Evolution (ICSME). IEEE,

2021: 70-80.

[21]

Lei M, Li H, Li J, et al. Deep learning

application on code clone detection: A review of

current knowledge[J]. Journal of Systems and

Software, 2022, 184: 111141.

[22]

张冬梅, 陈永乐, 杨玉丽. 基于分层特征

的代码克隆检测方法[J]. 计算机工程, 2021,

47(10): 125-131.

ZHANG D M, CHEN Y L, YANG Y L. Code

Clone Detection Method Based on Hierarchical

Feature[J]. Computer Engineering, 2021, 47(10):

125-131.

[23]

吕泉润, 谢春丽, 万泽轩, 等. 基于对比学

习的跨语言代码克隆检测方法 [J]. 计算机应

用研究, 2024, 41 (7): 2147-2152.

LYU Q R, XIE C L, WAN Z X, et al. Contrastive

learning based cross-language code clone

detection [J]. Application Research of Computers,

2024, 41 (7): 2147-2152.

[24]

Sun W. Using GPU to accelerate suffix array

construction[C]//2014 7th International

Conference on Biomedical Engineering and

Informatics. IEEE, 2014: 677-682.

[25]

Svajlenko J, Islam J F, Keivanloo I, et al.

Towards a big data curated benchmark of

inter-project code clones[C]//2014 IEEE

International Conference on Software

Maintenance and Evolution. IEEE, 2014:

476-480.

[26]

Ambient software evolution group.

IJaDataset 2.0. [EB/OL]. (2023-01) [2024-12-31].

https://1drv.ms/u/s!AhXbM6MKt_yLj_tk29GJnc

9BKoIvCg?e=oVTVJm.

[27]

Wang P, Svajlenko J, Wu Y, et al. CCAligner:

a token based large-gap clone

detector[C]//Proceedings of the 40th International

Conference on Software Engineering. 2018:

1066-1077.

[28]

Kamiya T. Ccfinderx: An interactive code

clone analysis environment[J]. Code Clone

Analysis: Research, Tools, and Practices, 2021:

31-44.

[29]

Göde N, Koschke R. Incremental clone

detection[C]//2009 13th European conference on

software maintenance and reengineering. IEEE,

2009: 219-228.

[30]

Svajlenko, Jeffrey, Chanchal K. Roy, and

James R. Cordy. "A mutation analysis based

benchmarking framework for clone detectors."

2013 7th international workshop on software

clones (iwsc). IEEE, 2013.

[31]

Krutz, Daniel E., and Wei Le. "A code clone

oracle." Proceedings of the 11th working

conference on mining software repositories.

2014.

[32]

Pierre Carbonnelle. PYPL. [EB/OL].

[2024-12-31]. https://pypl.github.io/PYPL.html.

[33]

Rosetta Code contributors. Rosetta Code.

[EB/OL].

(2024-08-06)

[2025-01-03].

https://rosettacode.org/wiki/Rosetta_Code.

[34]

Li L, Feng H, Zhuang W, et al. Cclearner: A

deep learning-based clone

detection

approach[C]//2017 IEEE international conference

on software maintenance and evolution (ICSME).

IEEE, 2017: 249-260.

[35]

Zhang J, Wang X, Zhang H, et al. A novel

neural source code representation based on

abstract syntax tree[C]//2019 IEEE/ACM 41st

International Conference on Software

Engineering (ICSE). IEEE, 2019: 783-794.

[36]

Choi E, Fuke N, Fujiwara Y, et al.

Investigating the generalizability of deep

learning-based clone detectors[C]//2023

IEEE/ACM 31st International Conference on

Program Comprehension (ICPC). IEEE, 2023:

181-185.

[37]

Liu C, Lin Z, Lou J G, et al. Can neural

clone detection generalize to unseen

functionalitiesƒ[C]//2021 36th IEEE/ACM

International Conference on Automated Software

Engineering (ASE). IEEE, 2021: 617-629.

[38]

Google Code Jam dataset. Google Code Jam

dataset. [EB/OL]. [2024-12-31].

https://www.kaggle.com/datasets/jur1cek/gcj-data

set.

[39]

White M, Tufano M, Vendome C, et al.

Deep learning code fragments for code clone

detection[C]//Proceedings of the 31st IEEE/ACM

international conference on automated software

engineering. 2016: 87-98.

[40]

Feng S, Suo W, Wu Y, et al. Machine

Learning is All You Need: A Simple Token-based

Approach for Effective Code Clone

Detection[C]//Proceedings of the IEEE/ACM

46th International Conference on Software

Engineering. 2024: 1-13.

选择文件类型/文献管理软件名称

选择包含的内容