Classification of Harmful Information on Internet Based on Long-Tailed Classification Algorithm

doi:10.19678/j.issn.1000-3428.0067003

Abstract

Abstract:

Currently, most existing methods for classifying harmful information on Internet overlook imbalanced data and long-tailed distributions, biasing the model towards more numerous data samples during classification. This makes them unable to effectively identify small data samples, which results in a decrease in overall recognition accuracy. To address this issue, a classification method LTIC for long-tailed harmful information datasets is proposed. By integrating few-shot learning with knowledge transfer strategies, the BERT model is used to learn the weights of the head class. The prototype of the head class is obtained through a Prototyper network specifically designed for few-shot learning.This design allows for the processing of head and tail data separately, thereby avoiding the data imbalance caused by mutual training. Researchers then use the mapping relationship learned from the prototype to convert the prototype of the tail class into weights. Subsequently, the head and tail class weights are combined to obtain the final classification result. In experiments, the LTIC method achieves classification accuracies of 82.7% and 83.5% on the Twitter and THUCNews datasets, respectively. This method also significantly improves the F1 value compared to the non-long tailed model, thus effectively improving classification accuracy. When compared with the latest classification methods such as BNN and OLTR, this method exhibits superior classification performance on long-tailed datasets, with an average accuracy improvement of 3%. When new categories of harmful information emerge, the LTIC method demonstrates the capability to predict them with minimal computation, achieving an accuracy of 70% and showcasing impressive scalability.

Key words: classification of harmful information, data imbalance, long-tailed dataset, few-shot learning, knowledge transfer

摘要：

目前已有的网络不良信息分类方法大多忽略了数据不平衡、数据存在长尾分布的情况，使得模型在分类时偏向于数据量多的样本，无法很好地识别数据量少的样本，从而导致整体识别精度下降。针对该问题，提出一种用于长尾不良信息数据集的分类方法LTIC。将小样本学习与知识转移策略相结合，使用BERT模型学习头部类的权重，通过专门为小样本学习而提出的Prototyper网络得到头部类的原型，将头尾数据分开处理，从而避免一起训练而导致的数据不平衡问题。学习从原型到权重的映射关系，利用学到的知识将尾部类的原型转换为权重，然后连接头部类权重和尾部类权重得到最终的分类结果。实验结果表明：LTIC方法在Twitter和THUCNews数据集上分别达到82.7%和83.5%的分类准确率，且F1值相较非长尾模型有显著提升，有效提高了模型分类精度；与目前较新的长尾数据集分类方法BNN、OLTR等相比，该方法具有更好的分类效果，平均准确率提升了3%；当新的不良信息类别出现时，LTIC方法只需少量计算就可对其进行预测，准确率达到70%，具有良好的扩展性。

关键词: 不良信息分类, 数据不平衡, 长尾数据集, 小样本学习, 知识转移

Jinshuo LIU, Daichen WANG, Juan DENG, Lina WANG. Classification of Harmful Information on Internet Based on Long-Tailed Classification Algorithm[J]. Computer Engineering, 2023, 49(8): 13-19, 28.

刘金硕, 王代辰, 邓娟, 王丽娜. 基于长尾分类算法的网络不良信息分类[J]. 计算机工程, 2023, 49(8): 13-19, 28.

/ / Recommend / Download Citations

URL: http://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0067003

http://www.ecice06.com/EN/Y2023/V49/I8/13

Figures/Tables 10

References 28

1	陈燕方, 李志宇, 梁循, 等. 在线社会网络谣言检测综述. 计算机学报, 2018, 41 (7): 1648- 1677. URL
	CHEN Y F, LI Z Y, LIANG X, et al. Review on rumor detection of online social networks. Chinese Journal of Computers, 2018, 41 (7): 1648- 1677. URL
2	张仰森, 彭媛媛, 段宇翔, 等. 基于评论异常度的新浪微博谣言识别方法. 自动化学报, 2020, 46 (8): 1689- 1702. URL
	ZHANG Y S, PENG Y Y, DUAN Y X, et al. The method of Sina Weibo rumor detecting based on comment abnormality. Acta Automatica Sinica, 2020, 46 (8): 1689- 1702. URL
3	XIAO L, ZHANG X L, JING L P, et al. Does head label help for long-tailed multi-label text classification. Proceedings of the AAAI Conference on Artificial Intelligence, 2021, 35 (16): 14103- 14111. doi: 10.1609/aaai.v35i16.17660
4	DEVLIN J, CHANG M, LEE K, et al. BERT: pre-training of deep bidirectional Transformers for language understanding[EB/OL]. [2023-01-05]. https://arxiv.org/abs/1810.04805.
5	SNELL J, SWERSKY K, ZEMEL R. Prototypical networks for few-shot learning[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. New York, USA: ACM Press, 2017: 4080-4090.
6	POUYANFAR S, TAO Y D, MOHAN A, et al. Dynamic sampling in convolutional neural networks for imbalanced data classification[C]//Proceedings of IEEE Conference on Multimedia Information Processing and Retrieval. Washington D. C., USA: IEEE Press, 2018: 112-117.
7	HE H B, GARCIA E A. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 2009, 21 (9): 1263- 1284. doi: 10.1109/TKDE.2008.239
8	HUANG C, LI Y N, LOY C C, et al. Deep imbalanced learning for face recognition and attribute prediction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42 (11): 2781- 2794. doi: 10.1109/TPAMI.2019.2914680
9	BRANCO P, TORGO L, RIBEIRO R P. A survey of predictive modeling on imbalanced domains. ACM Computing Surveys, 2017, 49 (2): 1- 50.
10	ZHOU B Y, CUI Q, WEI X S, et al. BBN: bilateral-branch network with cumulative learning for long-tailed visual recognition[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2020: 9716-9725.
11	QI H, BROWN M, LOWE D G. Low-shot learning with imprinted weights[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2018: 5822-5830.
12	YUAN M Q, XU J K, LI Z N. Long tail multi-label learning[C]//Proceedings of IEEE International Conference on Artificial Intelligence and Knowledge Engineering. Washington D. C., USA: IEEE Press, 2019: 28-31.
13	HARIHARAN B, GIRSHICK R. Low-shot visual recognition by shrinking and hallucinating features[C]//Proceedings of IEEE International Conference on Computer Vision. Washington D. C., USA: IEEE Press, 2017: 3037-3046.
14	GIDARIS S, KOMODAKIS N. Dynamic few-shot visual learning without forgetting[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2018: 4367-4375.
15	YIN X, YU X, SOHN K, et al. Feature transfer learning for face recognition with under-represented data[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2020: 5697-5706.
16	WANG Y X, RAMANAN D, HEBERT M H. Learning to model the tail [C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. New York, USA: ACM Press, 2017: 7029-7039.
17	LIU Z W, MIAO Z Q, ZHAN X H, et al. Large-scale long-tailed recognition in an open world[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2020: 2532-2541.
18	ARORA U, PAKA W S, CHAKRABORTY T. Multitask learning for blackmarket tweet detection[C]//Proceedings of 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining. Washington D. C., USA: IEEE Press, 2019: 127-130.
19	CHIRIL P, MORICEAU V, BENAMARA F, et al. An annotated corpus for sexism detection in French tweets[C]//Proceedings of the 12th Language Resources and Evaluation Conference. Washington D. C., USA: IEEE Press, 2020: 1397-1403.
20	ABUL B M, NAYAK R, SUZOR N, et al. Misogynistic tweet detection: modelling CNN with small datasets[EB/OL]. [2023-01-05]. https://arxiv.org/abs/2008.12452.
21	JAIN A, KASBE A. Fake news detection[C]//Proceedings of IEEE International Students' Conference on Electrical, Electronics and Computer Science. Washington D. C., USA: IEEE Press, 2018: 1-5.
22	HAMIDIAN S, DIAB M T. Rumor detection and classification for Twitter data[EB/OL]. [2023-01-05]. https://arxiv.org/abs/1912.08926.
23	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[EB/OL]. [2023-01-05]. https://arxiv.org/abs/1706.03762.
24	LI J Y, SUN M S. Scalable term selection for text categorization[C]//Proceedings of Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. [S. l. ]: ACL Press, 2007: 774-782.
25	SHU J, XIE Q, YI L X, et al. Meta-Weight-Net: learning an explicit mapping for sample weighting [C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems. New York, USA: ACM Press, 2019: 1919-1930.
26	WANG P, HAN K, WEI X S, et al. Contrastive learning based hybrid networks for long-tailed image classification[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2021: 943-952.
27	LI S, GONG K X, LIU C H, et al. MetaSAug: meta semantic augmentation for long-tailed visual recognition[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2021: 5208-5217.
28	YANG Z C, YANG D Y, DYER C, et al. Hierarchical attention networks for document classification[EB/OL]. [2023-01-05]. https://aclanthology.org/N16-1174/.

数据集	模型	Accuracy	Precision	Recall	F1
Twitter	BERT	0.731 4	0.705 7	0.723 1	0.714 3
	Prototyper	0.706 1	0.753 3	0.712 2	0.732 2
	BNN	0.768 9	0.803 3	0.781 1	0.792 0
	OLTR	0.772 1	0.802 3	0.774 2	0.780 0
	MW-Net	0.770 3	0.790 2	0.780 9	0.785 5
	Hybrid	0.788 2	0.810 9	0.790 3	0.800 5
	MetaSAug	0.790 5	0.805 1	0.795 6	0.800 3
	LTIC	0.826 9	0.815 2	0.802 1	0.808 6
THUCNews	BERT	0.761 2	0.723 5	0.718 5	0.720 9
	Prototyper	0.713 3	0.763 2	0.713 6	0.737 6
	BNN	0.792 1	0.802 1	0.806 2	0.804 1
	OLTR	0.802 1	0.812 5	0.808 3	0.810 3
	MW-Net	0.798 1	0.810 8	0.788 5	0.799 4
	Hybrid	0.816 0	0.817 9	0.805 9	0.811 8
	MetaSAug	0.809 3	0.807 7	0.817 9	0.812 7
	LTIC	0.835 4	0.825 1	0.833 6	0.829 3

数据集	模型	Accuracy	Precision	Recall	F1
Twitter	BERT	0.731 4	0.705 7	0.723 1	0.714 3
	Prototyper	0.706 1	0.753 3	0.712 2	0.732 2
	BNN	0.768 9	0.803 3	0.781 1	0.792 0
	OLTR	0.772 1	0.802 3	0.774 2	0.780 0
	MW-Net	0.770 3	0.790 2	0.780 9	0.785 5
	Hybrid	0.788 2	0.810 9	0.790 3	0.800 5
	MetaSAug	0.790 5	0.805 1	0.795 6	0.800 3
	LTIC	0.826 9	0.815 2	0.802 1	0.808 6
THUCNews	BERT	0.761 2	0.723 5	0.718 5	0.720 9
	Prototyper	0.713 3	0.763 2	0.713 6	0.737 6
	BNN	0.792 1	0.802 1	0.806 2	0.804 1
	OLTR	0.802 1	0.812 5	0.808 3	0.810 3
	MW-Net	0.798 1	0.810 8	0.788 5	0.799 4
	Hybrid	0.816 0	0.817 9	0.805 9	0.811 8
	MetaSAug	0.809 3	0.807 7	0.817 9	0.812 7
	LTIC	0.835 4	0.825 1	0.833 6	0.829 3

新增种类数量/种	Accuracy
1	0.705 1
3	0.713 3
5	0.710 1

新增种类数量/种	Accuracy
1	0.705 1
3	0.713 3
5	0.710 1

模型	Twitter		THUCNews
模型	Accuracy	F1	Accuracy	F1
LSTM	0.779 9	0.789 6	0.793 7	0.798 3
HAN	0.790 8	0.795 7	0.812 5	0.806 6
BERT	0.826 9	0.808 6	0.835 4	0.829 3

Please choose a citation manager

Content to export