作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2023, Vol. 49 ›› Issue (6): 292-299,313. doi: 10.19678/j.issn.1000-3428.0064892

• 开发研究与工程应用 • 上一篇    下一篇

基于提示学习的维吾尔语文本分类研究

张博旭, 蒲智, 程曦   

  1. 新疆农业大学 计算机与信息工程学院, 乌鲁木齐 830000
  • 收稿日期:2022-06-02 修回日期:2022-07-29 发布日期:2022-09-30
  • 作者简介:张博旭(1998-),男,硕士研究生,主研方向为自然语言处理;蒲智(通信作者),副教授、博士;程曦,讲师、博士。
  • 基金资助:
    国家自然科学基金(62161048)。

Research on Uyghur Text Classification Based on Prompt Learning

ZHANG Boxu, PU Zhi, CHENG Xi   

  1. School of Computer and Information Engineering, Xinjiang Agricultural University, Urumqi 830000, China
  • Received:2022-06-02 Revised:2022-07-29 Published:2022-09-30

摘要: 维吾尔语属于低资源语言和黏着性语言,现有维吾尔语文本分类方法缺少足够的语料来训练维吾尔语预训练模型。因此,维吾尔语无法基于预训练模型提取有效的句向量信息。现有的文本分类方法利用深度学习模型提取词向量,然而,维吾尔语具有特征稀疏且维度偏高的特点,使得其在文本分类上的效果较差。为此,提出基于提示学习的维吾尔语文本分类方法。基于提示学习,采用多语言预训练模型Cino构造不同的模板,利用模型的掩码预测能力对不同的掩码位置进行预测。为避免掩码预测的词汇信息具有多样性,将模板掩盖掉的词向量代替整体的句向量,利用掩码模型的预测能力,以有限大小的向量表示当前句子的语义信息,将下游任务靠近模型的预训练任务,减少在微调阶段两者不同所造成的影响。在爬取维吾尔语网站所构建新闻数据集上进行的文本分类实验结果表明,相比Cino微调预训练模型,融合提示学习的Cino模型的F1值最高可达到92.53%,精准率和召回率分别提升了1.79、1.04个百分点,具有更优的维吾尔语文本分类效果。

关键词: 文本分类, 维吾尔语, 提示学习, 预训练模型, 深度学习

Abstract: Uyghur,a low-resource and agglutinative language,suffers from insufficient corpus for training pre-existing Uyghur models.This lack hinders the extraction of effective sentence vector information based on pre-training models.Current text classification methods utilize deep learning models to extract word vectors.However,due to the Uyghur language's inherent sparse features and high dimensionality,these methods underperform in text classification tasks.As a response,a Uyghur text classification method based on prompt learning is introduced.Leveraging prompt learning,this paper utilize a multilingual pre-training model,Cino,to construct varied templates and employ the model's mask prediction ability for predicting different mask positions.To counteract the diversity of lexical information predicted by the mask,the word vector masked by the template is replaced by the entire sentence vector.The predictive ability of the mask model is then used to represent the current sentence's semantic information with a finite size vector.This approach aligns downstream tasks more closely with the model's pre-training tasks,thereby minimizing discrepancies during the fine-tuning stage.Text classification experiments conducted on news datasets,derived from crawled Uyghur websites,demonstrate superior classification performance in Uyghur language texts.Compared to the Cino fine-tuning pre-training model,the fusion prompt learning Cino model yielded the highest F1 value of 92.53%,enhancing accuracy and recall rates by 1.79 and 1.04 percentage points,respectively.

Key words: text classification, Uyghur language, prompt learning, pre-training model, deep learning

中图分类号: