作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• •    

基于信息检索的源代码方法自动命名研究

  • 发布日期:2023-10-30

Research on Automatic Naming of Source Code Methods Based on Information Retrieval

  • Published:2023-10-30

摘要: 源代码方法自动命名是指为给定代码的方法体预测一个反映代码功能的有意义的名称,可以使得代码易读易懂,提高软件开发效率。传统的命名方法仅使用代码的词法或者语法等单一信息,基于深度学习的命名方法通常忽略了语料库中的相似代码,影响方法命名的准确率。针对上述问题,提出一种基于信息检索的源代码方法自动命名的方法。首先,利用预训练模型和BERT-whitening方法提取输入代码和语料库中代码的有效特征,使用欧式距离方法计算两者之间的语义相似度。其次,在语料库代码中选择与输入代码语义相似度排名较高的代码作为候选库,利用Jaacard index和最长公共子序列方法分别计算输入代码与候选库代码的词法和语法相似度。最后,使用加权和来匹配候选库中与输入代码最相似的代码片段,复用该代码片段的方法名称作为输入代码的方法名称。实验结果表明,在公开的Java-small数据集上,相比传统方法VSM和深度学习模型Code2vec模型,该方法的F1值分别提升6.93%和1.22%,具有较优的预测性能。

Abstract: Automatic naming of source code methods refers to predicting a meaningful name that reflects the code function for a given method body, which can make the code easy to read and understand and improve the efficiency of software development. Traditional naming methods only use single information such as lexical or syntactic information of the code, and deep learning-based naming methods usually ignore similar examples in the corpus, both affecting the accuracy of the method's naming. In response to the above problems, this paper proposes an approach for automatic naming of source code methods based on information retrieval. Specifically, we utilize a pre-training model and BERT-Whitening approach to extract the effective features of the input code and the code in the corpus and calculate the semantic similarity between them using the Euclidean distance approach. Then, we select the code with higher semantic similarity ranking with the input code as a candidate library among the corpus codes, calculate the lexical and syntactic similarity between the input code and the candidate library codes using the Jaacard index and the longest common subsequence approach. Finally, we fuse lexical and syntactic similarity further matches the most similar code fragment in the candidate library that is most similar to the input code, reusing the method name of that code snippet as the method name of the input code. The experimental results show the F1 value of the proposed approach in the public Java-small datasets increased by 6.93% and 1.22% compared to VSM and Code2vec models, respectively, indicating excellent predictive performance.