基于提示学习的维吾尔语文本分类研究

doi:10.19678/j.issn.1000-3428.0064892

摘要/Abstract

摘要： 维吾尔语属于低资源语言和黏着性语言，现有维吾尔语文本分类方法缺少足够的语料来训练维吾尔语预训练模型。因此，维吾尔语无法基于预训练模型提取有效的句向量信息。现有的文本分类方法利用深度学习模型提取词向量，然而，维吾尔语具有特征稀疏且维度偏高的特点，使得其在文本分类上的效果较差。为此，提出基于提示学习的维吾尔语文本分类方法。基于提示学习，采用多语言预训练模型Cino构造不同的模板，利用模型的掩码预测能力对不同的掩码位置进行预测。为避免掩码预测的词汇信息具有多样性，将模板掩盖掉的词向量代替整体的句向量，利用掩码模型的预测能力，以有限大小的向量表示当前句子的语义信息，将下游任务靠近模型的预训练任务，减少在微调阶段两者不同所造成的影响。在爬取维吾尔语网站所构建新闻数据集上进行的文本分类实验结果表明，相比Cino微调预训练模型，融合提示学习的Cino模型的F1值最高可达到92.53%，精准率和召回率分别提升了1.79、1.04个百分点，具有更优的维吾尔语文本分类效果。

关键词: 文本分类, 维吾尔语, 提示学习, 预训练模型, 深度学习

Abstract: Uyghur，a low-resource and agglutinative language，suffers from insufficient corpus for training pre-existing Uyghur models.This lack hinders the extraction of effective sentence vector information based on pre-training models.Current text classification methods utilize deep learning models to extract word vectors.However，due to the Uyghur language's inherent sparse features and high dimensionality，these methods underperform in text classification tasks.As a response，a Uyghur text classification method based on prompt learning is introduced.Leveraging prompt learning，this paper utilize a multilingual pre-training model，Cino，to construct varied templates and employ the model's mask prediction ability for predicting different mask positions.To counteract the diversity of lexical information predicted by the mask，the word vector masked by the template is replaced by the entire sentence vector.The predictive ability of the mask model is then used to represent the current sentence's semantic information with a finite size vector.This approach aligns downstream tasks more closely with the model's pre-training tasks，thereby minimizing discrepancies during the fine-tuning stage.Text classification experiments conducted on news datasets，derived from crawled Uyghur websites，demonstrate superior classification performance in Uyghur language texts.Compared to the Cino fine-tuning pre-training model，the fusion prompt learning Cino model yielded the highest F1 value of 92.53%，enhancing accuracy and recall rates by 1.79 and 1.04 percentage points，respectively.

Key words: text classification, Uyghur language, prompt learning, pre-training model, deep learning

中图分类号:

TP391

张博旭, 蒲智, 程曦. 基于提示学习的维吾尔语文本分类研究[J]. 计算机工程, 2023, 49(6): 292-299,313.

ZHANG Boxu, PU Zhi, CHENG Xi. Research on Uyghur Text Classification Based on Prompt Learning[J]. Computer Engineering, 2023, 49(6): 292-299,313.

https://www.ecice06.com/CN/Y2023/V49/I6/292

图/表 10

20230615171016

20230615171020

20230615171024

20230615171028

20230615171032

20230615171035

20230615171038

20230615171044

20230615171047

20230615171050

参考文献

[1] DEVLIN J,CHANG M W,LEE K,et al.BERT:pre-training of deep bidirectional transformers for language understanding[EB/OL].[2022-04-28].https://arxiv.org/pdf/1810.04805.pdf.
[2] 吐尔地·托合提,维尼拉·木沙江,艾斯卡尔·艾木都拉.基于语义串抽取及主题相似度度量的维吾尔文文本分类[J].中文信息学报,2017,31(4):100-107.Turdi Tohti,Winira Musajan,Askar Hamdulla.Semantic string-based topic similarity measuring approach for uyghur text classification[J].Journal of Chinese Information Processing,2017,31(4):100-107.(in Chinese)
[3] 阿力甫·阿不都克里木,李晓.基于TextRank算法和互信息相似度的维吾尔文关键词提取及文本分类[J].计算机科学,2016,43(12):36-40.Ghalip Abdukerim,LI X.Uyghur keyword extraction and text classification based on TextRank algorithm and mutual information similarity[J].Computer Science,2016,43(12):36-40.(in Chinese)
[4] YANG Z Q,XU Z H,CUI Y M,et al.Cino:a Chinese minority pre-trained language model[EB/OL].[2022-04-28].https://arxiv.org/abs/2202.13558.
[5] LAMPLE G,CONNEAU A.Cross-lingual language model pretraining[EB/OL].[2022-04-28].https://arxiv.org/pdf/1901.07291.pdf.
[6] LIU Y H,OTT M,GOYAL N,et al.RoBERTa:a robustly optimized BERT pretraining approach[EB/OL].[2022-04-28].https://arxiv.org/abs/1907.11692.
[7] CONNEAU A,KHANDELWAL K,GOYAL N,et al.Unsupervised cross-lingual representation learning at scale[EB/OL].[2022-04-28].https://arxiv.org/abs/1911.02116v2.
[8] LIU Y H,GU J T,GOYAL N,et al.Multilingual denoising pre-training for neural machine translation[J].Transactions of the Association for Computational Linguistics,2020,8:726-742.
[9] LEWIS M,LIU Y H,GOYAL N,et al.BART:denoising sequence-to-sequence pre-training for natural language generation,translation,and comprehension[EB/OL].[2022-04-28].https://arxiv.org/abs/1910.13461.
[10] WU S J,DREDZE M.Are all languages created equal in multilingual BERT?[EB/OL].[2022-04-28].https://arxiv.org/abs/2005.09093.
[11] BROWN T B,MANN B,RYDER N,et al.Language models are few-shot learners[J].Advances in Neural Information Processing Systems,2020,33:1877-1901.
[12] SCHICK T,SCHÜTZE H.It's not just size that matters:small language models are also few-shot learners[EB/OL].[2022-04-28].https://arxiv.org/abs/2009.07118?utm_medium=email&_hsenc=p2ANqtz-_QwAkpWYd5cbmMTX5gb9_GYEBsWkI_vi0WyIti1i3vzXI7Qw0zTGiLe6VfcuW-v15PRAlZ.
[13] ZHANG S H,HUANG H R,LIU J C,et al.Spelling error correction with soft-masked BERT[EB/OL].[2022-04-28].https://arxiv.org/abs/2005.07421v1.
[14] LIU P,YUAN W,FU J,et al.Pre-train,prompt,and predict:a systematic survey of prompting methods in natural language processing[EB/OL].[2022-04-28].https://arxiv.org/abs/2107.13586.
[15] QI K,WAN H,DU J,et al.Enhancing cross-lingual natural language inference by prompt-learning from cross-lingual templates[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics.Stroudsburg,USA:Association for Computational Linguistics,2022:1910-1923.
[16] GRIVAS A,BOGOYCHEV N,LOPEZ A.Low-rank softmax can have unargmaxable classes in theory but rarely in practice[EB/OL].[2022-04-28].https://arxiv.org/abs/2203.06462v2.
[17] GAO T Y,FISCH A,CHEN D.Making pre-trained language models better few-shot learners[EB/OL].[2022-04-28].https://arxiv.org/abs/2012.15723.
[18] JIANG Z B,XU F F,ARAKI J,et al.How can we know what language models know?[J].Transactions of the Association for Computational Linguistics,2020,8:423-438.
[19] DAVISON J,FELDMAN J,RUSH A.Commonsense knowledge mining from pretrained models[C]//Proceedings of Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing.Stroudsburg,USA:Association for Computational Linguistics,2019:1173-1178.
[20] LIU X,JI K X,FU Y C,et al.P-Tuning v2:prompt tuning can be comparable to fine-tuning universally across scales and tasks[EB/OL].[2022-04-28].https://arxiv.org/abs/2110.07602v2.
[21] SHIN T,RAZEGHI Y,LOGAN IV R L,et al.Autoprompt:eliciting knowledge from language models with automatically generated prompts[EB/OL].[2022-04-28].https://arxiv.org/abs/2010.15980v1.
[22] 沙尔旦尔·帕尔哈提,米吉提·阿不里米提,艾斯卡尔·艾木都拉.基于稳健词素序列和LSTM的维吾尔语短文本分类[J].中文信息学报,2020,34(1):63-70.Sardar Parhat,Mijit Ablimit,Askar Hamdulla.Uyghur short text classification based on robust morpheme sequence and LSTM[J].Journal of Chinese Information Processing,2020,34(1):63-70.(in Chinese)
[23] LIU X,ZHENG Y N,DU Z X,et al.GPT understands,too[EB/OL].[2022-04-28].https://arxiv.org/abs/2103.10385v1.
[24] 加米拉·吾守尔,吴迪,王路路,等.基于多卷积核DPCNN的维吾尔语文本分类联合模型[J].中文信息学报,2021,35(7):63-71.Jiamila Wushouer,WU D,WANG L L,et al.Uyghur text categorization joint model based on multi-convolution kernel DPCNN[J].Journal of Chinese Information Processing,2021,35(7):63-71.(in Chinese)
[25] YOON K.Convolutional neural networks for sentence classification[EB/OL].[2022-04-28].http://de.arxiv.org/pdf/1408.5882.
[26] XU H W,CHEN Y J,DU Y L,et al.Zeroprompt:scaling prompt-based pretraining to 1,000 tasks improves zero-shot generalization[EB/OL].[2022-04-28].https://arxiv.org/abs/2201.06910.

选择文件类型/文献管理软件名称

选择包含的内容