Discovery of Nuisance Website Domain Name Generation Based on Domain Name Semantic Information and Similarity

doi:10.19678/j.issn.1000-3428.0069949

Abstract

Abstract:

Using domain name generation technology to identify nuisance website domains offers benefits such as broad coverage, the provision of substantial research data, and timely prevention of dissemination. Existing domain generation algorithms based on domain similarity face issues such as insufficient feature utilization, high redundancy in the generated domains, and a low concentration of nuisance website domains. To address these issues, this study proposes a new nuisance website domain name generation model based on semantic information and domain similarity. The proposed model employs a Transformer encoder to extract the semantic features of domain names and uses them to guide the generation process and enhance feature utilization. It improves Sequence Generative Adversarial Networks (SeqGANs) by separately focusing on semantic features for generation and contextual information for discrimination, thereby increasing the quality of the generated domains and the accuracy of the discriminator. The model detects generated domains through initial filtering, multitool rechecking, and final selection. Experimental results show that, compared to existing domain similarity-based generation models, the proposed model can discover more nuisance website domain names through its domain name generation mode and is advantageous in terms of generation quality, expansion rate, and active monitoring ability.

Key words: nuisance website domain name, generation algorithm, semantic feature, Transformer encoder, attention mechanism

摘要：

利用域名生成技术发现公害网站域名的方式具有覆盖面广、可提供大量研究数据、及时阻断和预防传播等优点。现有基于域名相似度的域名生成模型存在特征利用不充分、生成域名冗余度高、公害网站域名浓度低等问题。因此，提出一种基于域名语义信息与域名相似度的公害网站域名生成发现模型。该模型首先使用Transformer编码器提取域名的语义特征，并将其作为特征向量指导生成工作，提升了对域名特征的利用率；然后对序列生成对抗网络(SeqGAN)进行改进，在生成和鉴别时分别关注域名的语义特征和上下文信息，提高了生成器生成域名的质量和鉴别器的准确率；最后通过初步过滤、多工具复检、最终筛选等步骤，实现了对生成域名的检测。实验结果表明，与现有基于域名相似度的生成模型相比，该模型可以通过域名生成的方式发现更多公害网站域名，且在生成质量、扩展率及主动监测能力等关键指标上更具优势。

关键词: 公害网站域名, 生成算法, 语义特征, Transformer编码器, 注意力机制

YU Jie, ZHAO Chunlei, DONG Guozhong, REN Huaishuo, YOU Wei. Discovery of Nuisance Website Domain Name Generation Based on Domain Name Semantic Information and Similarity[J]. Computer Engineering, 2025, 51(10): 238-249.

于杰, 赵春蕾, 董国忠, 任怀硕, 尤伟. 基于域名语义信息与域名相似度的公害网站域名生成发现[J]. 计算机工程, 2025, 51(10): 238-249.

/ Recommend / Download Citations

URL: https://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0069949

https://www.ecice06.com/EN/Y2025/V51/I10/238

Figures/Tables 13

Fig.1 A certain nuisance website address posting page

Fig.2 Overall architecture of the model

Fig.3 Architecture of Transformer encoder

Fig.4 Architecture of domain name generation model

Fig.5 Architecture of generator

Fig.6 Architecture of end convolutional layer

Fig.7 Trends in the number of domain names for newly launched nuisance websites of the proposed model

References 30

1	ITU. Measuring digital development: facts and figures 2021[EB/OL]. [2024-05-08]. https://www.itu.int/en/ITU-D/Statistics/Documents/facts/FactsFigures2021.pdf.
2	YANG H, DU K, ZHANG Y B, et al. Casino royale: a deep exploration of illegal online gambling[C]//Proceedings of the 35th Annual Computer Security Applications Conference. New York, USA: ACM Press, 2019: 500-513.
3	CHENG Y N, LIU Y L, WANG L M, et al. Evaluating the effectiveness of handling abusive domain names by Internet entities. Electronics, 2022, 11(8): 1172. doi: 10.3390/electronics11081172
4	PECK J, NIE C, SIVAGURU R, et al. CharBot: a simple and effective method for evading DGA classifiers. IEEE Access, 2019, 7, 91759- 91771. doi: 10.1109/ACCESS.2019.2927075
5	WANG Z, GUO Y. Neural networks based domain Name generation. Journal of Information Security and Applications, 2021, 61, 102948. doi: 10.1016/j.jisa.2021.102948
6	ZHAI Y, YANG J, WANG Z X, et al. Cdga: a GAN-based controllable domain generation algorithm[C]//Proceedings of the IEEE International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom). Washington D.C., USA: IEEE Press, 2022: 352-360.
7	袁辰, 钱丽萍, 张慧, 等. 基于生成对抗网络的恶意域名训练数据生成. 计算机应用研究, 2019, 36(5): 1540-1543, 1568.
	YUAN C, QIAN L P, ZHANG H, et al. Generation of malicious domain training data based on generative adversarial network. Application Research of Computers, 2019, 36(5): 1540-1543, 1568.
8	邹可欣, 陈彦光, 时金桥, 等. 基于深度学习的仿冒域名生成工具. 电子技术应用, 2020, 46(7): 108- 112.
	ZOU K X, CHEN Y G, SHI J Q, et al. Typosquatting domain name generator based on deep learning. Application of Electronic Technique, 2020, 46(7): 108- 112.
9	WU C B, FEI J L. An abnormal domain name generation method based on a character-level model[C]//Proceedings of the 2022 4th International Conference on Robotics, Intelligent Control and Artificial Intelligence. New York, USA: ACM Press, 2022: 804-810.
10	GONG B, NING Z H, ZHU Y, et al. Character-level domain name generation algorithm based on ED-GAN[C]//Proceedings of the 11th International Conference on Software and Computer Applications. New York, USA: ACM Press, 2022: 198-205.
11	NAKAMURA A, DOBASHI F. Proactive phishing sites detection[C]//Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence. New York, USA: ACM Press, 2019: 443-448.
12	刘泽琨. 基于主动探测的钓鱼网站发现系统的设计与实现[D]. 北京: 北京邮电大学, 2022.
	LIU Z K. Design and implementation of phishing website discovery system based on active detection[D]. Beijing: Beijing University of Posts and Telecommunications, 2022. (in Chinese)
13	CHENG Y N, CHAI T T, ZHANG Z X, et al. Detecting malicious domain names with abnormal WHOIS records using feature-based rules. The Computer Journal, 2022, 65(9): 2262- 2275. doi: 10.1093/comjnl/bxab062
14	LIANG Y C, CHENG Y N, ZHANG Z X, et al. Illegal domain name generation algorithm based on character similarity of domain name structure. Applied Sciences, 2023, 13(6): 4061. doi: 10.3390/app13064061
15	PENG K, LEUNG V C M, HUANG Q J. Clustering approach based on mini batch kmeans for intrusion detection system over big data. IEEE Access, 2018, 6, 11897- 11906. doi: 10.1109/ACCESS.2018.2810267
16	BEJAR A J. k-means vs Mini batch k-means: a comparison[EB/OL]. [2024-05-08]. https://upcommons.upc.edu/bitstream/handle/2117/23414/R13-8.pdf.
17	MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space[EB/OL]. [2024-05-08]. https://arxiv.org/abs/1301.3781.
18	AL-MATHAM R N, AL-KHALIFA H S. SynoExtractor: a novel pipeline for Arabic synonym extraction using Word2Vec word embeddings. Complexity, 2021(1): 6627434.
19	余子丞, 凌捷. 基于Transformer和多特征融合的DGA域名检测方法. 计算机工程与科学, 2023, 45(8): 1416- 1423.
	YU Z C, LING J. A DGA domain name detection method based on Transformer and multi-feature fusion. Computer Engineering & Science, 2023, 45(8): 1416- 1423.
20	ZHANG X, CHENG H, FANG Y. A DGA domain name detection method based on Transformer. Computer Engineering & Science, 2023, 42(3): 411.
21	YU L T, ZHANG W N, WANG J, et al. SeqGAN: sequence generative adversarial nets with policy gradient[EB/OL]. [2024-05-08]. https://arxiv.org/abs/1609.05473v6.
22	KIM Y, JERNITE Y, SONTAG D, et al. Character-aware neural language models[EB/OL]. [2024-05-08]. https://arxiv.org/abs/1508.06615v4.
23	WANG K F, GOU C, DUAN Y J, et al. Generative adversarial networks: the state of the art and beyond. Acta Automatica Sinica, 2017, 43(3): 321- 332.
24	CRESWELL A, WHITE T, DUMOULIN V, et al. Generative adversarial networks: an overview. IEEE Signal Processing Magazine, 2018, 35(1): 53- 65. doi: 10.1109/MSP.2017.2765202
25	HINTON G E, SRIVASTAVA N, KRIZHEVSKY A, et al. Improving neural networks by preventing co-adaptation of feature detectors[EB/OL]. [2024-05-08]. https://arxiv.org/abs/1207.0580.
26	SRIVASTAVA N, HINTON G, KRIZHEVSKY A, et al. Dropout: prevent NN from overfitting. Journal of Machine Learning Research, 2014, 15, 1929- 1958.
27	MERCIONI M A, HOLBAN S. A survey of distance metrics in clustering data mining techniques[C]//Proceedings of the 3rd International Conference on Graphics and Signal Processing. New York, USA: ACM Press, 2019: 1-8.
28	ZHANG Z, CHENG Y, WU X. Illegal domain name mining method based on domain name structure similarity: CN108712403A[P]. 2018-10-26.
29	GONG L Y, LI Z H, WANG H Y, et al. Overlay-based Android malware detection at market scales: systematically adapting to the new technological landscape. IEEE Transactions on Mobile Computing, 2022, 21(12): 4488- 4501. doi: 10.1109/TMC.2021.3079433
30	GONG L Y, LIN H, LI Z H, et al. Systematically landing machine learning onto market-scale mobile malware detection. IEEE Transactions on Parallel and Distributed Systems, 2021, 32(7): 1615- 1628.

[1]	MA Gan, GU Yu, PENG Dongliang. Combining Improved YOLOv5s and Dynamic Data Augmentation for Sea Surface Ship Detection [J]. Computer Engineering, 2025, 51(9): 294-305.
[2]	CHEN Yanru, LIU Keliang, RAN Maoliang. Real-time Optimization of Instant Meal Delivery Based on Deep Reinforcement Learning [J]. Computer Engineering, 2025, 51(9): 328-339.
[3]	HUANG Jingui, LIU Peng, TANG Wensheng. MMD-YOLOv7: Vehicle Detection Method Under Dark Conditions [J]. Computer Engineering, 2025, 51(9): 340-349.
[4]	FU Jiacheng, TIAN Jin, ZHANG Yujin, FANG Zhijun. Knowledge Graph Recommendation Based on Previous Triple Set [J]. Computer Engineering, 2025, 51(9): 101-109.
[5]	ZHAI Zhipeng, CAO Yang, SHEN Qinqin, SHI Quan. Traffic Flow Prediction Based on Multiple Spatio-Temporal Graph Fusion and Dynamic Attention [J]. Computer Engineering, 2025, 51(9): 139-148.
[6]	NI Yuansong, HAN Jun, ZOU Xiaoyan, HU Guangyi, WANG Wenshuai. Two-Stage Adaptive Block Transmission Line Bolt Defect Detection Method [J]. Computer Engineering, 2025, 51(8): 281-291.
[7]	HAO Hongda, LUO Jianxu. Multi-Organ Semantic Segmentation Model Based on Multi-Scale Region Feature Fusion [J]. Computer Engineering, 2025, 51(8): 270-280.
[8]	ZHANG Zhaoli, LI Jiahao, LIU Hai, SHI Fobo, HE Jiawen. Personalized Forgetting Modeling for Knowledge Tracing via Transformers [J]. Computer Engineering, 2025, 51(8): 120-130.
[9]	YAN Jianhong, LIU Zhiyan, WANG Zhen. Multi-Scale Convolutional Vehicle Trajectory Prediction Integrating Spatiotemporal Attention Mechanism [J]. Computer Engineering, 2025, 51(8): 406-414.
[10]	LIU Chunxia, MENG Jixing, PAN Lihu, GONG Dali. Remote Sensing Small-Target Detection Method with Fusion of RGB and IR Images [J]. Computer Engineering, 2025, 51(7): 326-338.
[11]	LUAN Mengna, ZHENG Qiumei, WANG Fenghua. Real-time Traffic Sign Detection Algorithm Based on DMC-YOLO [J]. Computer Engineering, 2025, 51(7): 90-99.
[12]	SONG Jie, XU Huiying, ZHU Xinzhong, HUANG Xiao, CHEN Chen, WANG Zeyu. Improved Fall Detection Algorithm Based on YOLOv8: OEF-YOLO [J]. Computer Engineering, 2025, 51(7): 127-139.
[13]	HUA Jiabao, ZHANG Jingrui, ZHU Fumin, CHEN Lu. Adaptive Spatial Transformation Method for Vehicle Detection Based on Roadside Cameras [J]. Computer Engineering, 2025, 51(6): 349-359.
[14]	LIU Kai, REN Hongyi, LI Ying, JI Yi, LIU Chunping. Medical Visual Question Answering Based on Cross-Modal Attention Feature Enhancement [J]. Computer Engineering, 2025, 51(6): 49-56.
[15]	SHAN Pengchang, GAO Lijian, DONG Wenlong, MAO Qirong. Action Detection Method Based on Salient Target Tracking [J]. Computer Engineering, 2025, 51(6): 93-101.

Please choose a citation manager

Content to export