基于信息检索的源代码自动命名

doi:10.19678/j.issn.1000-3428.0068041

摘要/Abstract

摘要： 源代码自动命名是指为给定代码的方法体命名一个反映代码功能的有意义的名称,可以使代码易读易懂,提高软件开发效率。传统自动命名方法仅使用代码的词法或者语法等单一信息,基于深度学习的自动命名方法通常忽略了语料库中的相似代码,影响命名准确率。针对上述问题,提出一种基于信息检索的源代码自动命名方法。首先,利用预训练模型和BERT-whitening方法提取输入代码和语料库中代码的有效特征,使用欧氏距离计算两者之间的语义相似度。其次,在语料库代码中选择与输入代码语义相似度较高的代码组成候选库,利用Jaccard系数和最长公共子序列分别计算输入代码与候选库代码的词法和语法相似度。最后,使用加权和来匹配候选库中与输入代码最相似的代码片段,复用该代码片段的方法名称作为输入代码的方法名称。实验结果表明,在公开的Java-small数据集上,与基于向量空间模型(VSM)和基于深度学习模型Code2Vec的自动命名方法相比,该方法的F1值分别提升了6.93和1.22个百分点,具有较优的预测性能。

关键词: 自动命名, 信息检索, 深度学习, BERT-whitening方法, 语义相似度

Abstract: Automatic naming of source code entails predicting a descriptive name that reflects the code function within a given method body. This practice can improve code readability and comprehension, thus enhancing the software development efficiency. Traditional naming approaches only use single information, such as lexical or syntactic information of the code, whereas deep learning-based naming approaches usually ignore similar examples in the corpus; both these approaches affect the code naming accuracy. To address these problems, this paper proposes an approach for automatic naming of source codes based on information retrieval. The proposed approach utilizes a pre-trained model and Bidirectional Encoder Representations from Transformers (BERT)-whitening method, which is an overall method for extracting the effective features of the input code and the code in the corpus, and calculates the semantic similarity between them on the basis of the Euclidean distance. Subsequently, the code with the highest semantic similarity ranking to the input code is selected as a candidate library among the corpus codes. The lexical and syntactic similarity between the input code and candidate library codes is calculated using the Jaccard index and the Longest Common Subsequence (LCS) method. Finally, lexical and syntactic similarities are fused to match the code fragment in the candidate library with the highest similarity to the input code. The method name of the code snippet is then reused as the method name of the input code. Experimental results show that the F1 value of the proposed approach on the public Java-small dataset increases by 6.93 and 1.22 percentage points compared to that for the Vector Space Model (VSM) and Code2Vec model, respectively, indicating excellent predictive performance.

Key words: automatic naming, information retrieval, deep learning, BERT-whitening method, semantic similarity

中图分类号:

TP311

李雪, 王雅文, 张前进. 基于信息检索的源代码自动命名[J]. 计算机工程, 2024, 50(6): 304-310.

LI Xue, WANG Yawen, ZHANG Qianjin. Automatic Naming of Source Code Based on Information Retrieval[J]. Computer Engineering, 2024, 50(6): 304-310.

https://www.ecice06.com/CN/Y2024/V50/I6/304

参考文献

[1] MARTIN R C, FEATHERS M C. Clean code:a handbook of agile software craftsmanship[M]. Upper Saddle River, USA:Prentice Hall International, Inc., 2009.
[2] BUTLER S, WERMELINGER M, YU Y J, et al. Exploring the influence of identifier names on code quality:an empirical study[C]//Proceedings of the 14th European Conference on Software Maintenance and Reengineering. Washington D.C., USA:IEEE Press, 2010:156-165.
[3] 高原,刘辉,樊孝忠,等.基于代码库和特征匹配的函数名称推荐方法[J].软件学报, 2015, 26(12):3062-3074. GAO Y, LIU H, FAN X Z, et al. Method name recommendation based on source code depository and feature matching[J]. Journal of Software, 2015, 26(12):3062-3074.(in Chinese)
[4] KUHN A, DUCASSE S, GÍRBA T. Semantic clustering:identifying topics in source code[J]. Information and Software Technology, 2007, 49(3):230-243.
[5] WEI B L. Retrieve and refine:exemplar-based neural comment generation[C]//Proceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). Washington D.C., USA:IEEE Press, 2019:349-360.
[6] EDDY B P, ROBINSON J A, KRAFT N A, et al. Evaluating source code summarization techniques:replication and expansion[C]//Proceedings of the 21st International Conference on Program Comprehension (ICPC). Washington D.C., USA:IEEE Press, 2013:13-22.
[7] GUO D Y, LU S, DUAN N, et al. UniXcoder:unified cross-modal pre-training for code representation[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1:Long Papers). Stroudsburg, USA:Association for Computational Linguistics, 2022:7212-7225.
[8] YANG G, LIU K, CHEN X, et al. CCGIR:information retrieval-based code comment generation method for smart contracts[J]. Knowledge-Based Systems, 2022, 237:107858.
[9] ROY C K, CORDY J R, KOSCHKE R. Comparison and evaluation of code clone detection techniques and tools:a qualitative approach[J]. Science of Computer Programming, 2009, 74(7):470-495.
[10] WHITE M, TUFANO M, VENDOME C, et al. Deep learning code fragments for code clone detection[C]//Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering. New York, USA:ACM Press, 2016:87-98.
[11] 张应成,杨洋,蒋瑞,等.基于BiLSTM-CRF的商情实体识别模型[J].计算机工程, 2019, 45(5):308-314. ZHANG Y C, YANG Y, JIANG R, et al. Commercial intelligence entity recognition model based on BiLSTM-CRF[J]. Computer Engineering, 2019, 45(5):308-314.(in Chinese)
[12] KIM S, KIM D. Automatic identifier inconsistency detection using code dictionary[J]. Empirical Software Engineering, 2016, 21(2):565-604.
[13] HAIDUC S, APONTE J, MARCUS A. Supporting program comprehension with source code summarization[C]//Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering. New York, USA:ACM Press, 2010:223-226.
[14] ALLAMANIS M, PENG H, SUTTON C. A convolutional attention network for extreme summarization of source code[EB/OL].[2023-06-11]. https://arxiv.org/abs/1602.03001.
[15] 周锦峰,叶施仁,王晖.基于深度卷积神经网络模型的文本情感分类[J].计算机工程, 2019, 45(3):300-308. ZHOU J F, YE S R, WANG H. Text sentiment classification based on deep convolutional neural network model[J]. Computer Engineering, 2019, 45(3):300-308.(in Chinese)
[16] ALON U, ZILBERSTEIN M, LEVY O, et al. A general path-based representation for predicting program properties[J]. ACM SIGPLAN Notices, 2018, 53(4):404-419.
[17] ALON U, ZILBERSTEIN M, LEVY O, et al. Code2Vec:learning distributed representations of code[J]. Proceedings of the ACM on Programming Languages, 2018, 3:40.
[18] ALON U, BRODY S, LEVY O, et al. Code2Seq:generating sequences from structured representations of code[EB/OL].[2023-06-11]. https://arxiv.org/abs/1808.01400.
[19] ZHANG C T, WANG J, ZHANG R. Using a Euclid distance discriminant method to find protein coding genes in the yeast genome[J]. Computers&Chemistry, 2002, 26(3):195-206.
[20] NIWATTANAKUL S, SINGTHONGCHAI J, NAENUDORN E, et al. Using of Jaccard coefficient for keywords similarity[EB/OL].[2023-06-11]. http://www.researchgate.net/publication/317248581_Using_of_Jaccard_Coefficient_for_Keywords_Similarity?ev=prf_high.
[21] WHALE G. Plague:plagiarism detection using program structure[D]. Sydney, Australia:University of NSW, 1988.
[22] FENG Z Y, GUO D Y, TANG D Y, et al. CodeBERT:a pre-trained model for programming and natural languages[C]//Proceedings of the Findings of the Association for Computational Linguistics:EMNLP 2020. Stroudsburg, USA:Association for Computational Linguistics, 2020:1536-1547.
[23] JOHNSON J, DOUZE M, JEGOU H. Billion-scale similarity search with GPUs[J]. IEEE Transactions on Big Data, 2021, 7(3):535-547.
[24] YANG G, CHEN X, CAO J X, et al. ComFormer:code comment generation via transformer and fusion method-based hybrid code representation[C]//Proceedings of the 8th International Conference on Dependable Systems and Their Applications (DSA). Washington D.C., USA:IEEE Press, 2021:30-41.
[25] HU X, LI G, XIA X, et al. Deep code comment generation[C]//Proceedings of the 26th Conference on Program Comprehension. New York, USA:ACM Press, 2018:200-210.
[26] BILLE P. A survey on tree edit distance and related problems[J]. Theoretical Computer Science, 2005, 337(1/2/3):217-239.
[27] KONDO M, OLIVA G A, JIANG Z M, et al. Code cloning in smart contracts:a case study on verified contracts from the Ethereum blockchain platform[J]. Empirical Software Engineering, 2020, 25(6):4617-4675.
[28] KENTON J D M W C, TOUTANOVA L K. BERT:pre-training of deep bidirectional Transformers for language understanding[C]//Proceedings of NAACL-HLT'19. New York, USA:ACM Press, 2019:4171-4186.

选择文件类型/文献管理软件名称

选择包含的内容