Author Login Chief Editor Login Reviewer Login Editor Login Remote Office

15 February 2026, Volume 52 Issue 2
    

  • Select all
    |
    Frontier Perspectives and Reviews
  • WANG Limin, ZHU Guanghui, WU Tao
    Computer Engineering. 2026, 52(2): 1-6. https://doi.org/10.19678/j.issn.1000-3428.0253281
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save
    Large Language Models (LLMs) have propelled artificial intelligence into an era of natural language-centric interaction; however, they remain significantly limited in terms of physical world modeling and complex decision-making. To address these limitations, this paper considers the world model as its core paradigm and systematically analyzes the key technical pathways for the evolution of LLMs into decision-making agents. First, the capability boundaries of LLMs are delineated, highlighting their intrinsic limitations in structured knowledge representation, real-world perception, and applications that require high reliability. Subsequently, the core essence and key characteristics of world models are summarized in terms of dynamic prediction, task-driven selective modeling, multimodal fusion, and physical consistency. Building on this, data-driven generative modeling and physics-prior-driven simulation modeling are systematically reviewed and compared. Additionally, common technical challenges, including acquisition of high-quality interactive data, long-term prediction consistency, unified multimodal representation, and real-time inference efficiency, are analyzed. Furthermore, the potential and limitations of world models in bridging common-sense gaps, enhancing planning and decision-making capabilities, and supporting embodied intelligence on the path toward Artificial General Intelligence (AGI) are discussed. Finally, considering current technological trends, a forward-looking perspective on future research directions, including LLM-world model integration, data and algorithm co-optimization, fusion of physics priors with generative modeling, tight integration with embodied intelligence, and ethical and safety governance, is provided. This paper systematically analyzes the current status and future development of world-model technologies and provides theoretical and practical guidance for advancing artificial intelligence from perception-to decision-driven capabilities.
  • WANG Tian, LI Yuting, WANG Wenhua
    Computer Engineering. 2026, 52(2): 7-12. https://doi.org/10.19678/j.issn.1000-3428.0253308
    Abstract ( ) Download PDF ( )   Knowledge map   Save
    Centralized computing exhibits diminishing returns under latency, bandwidth, energy, and privacy constraints in large-scale sensing and intelligent applications. Consequently, the architectural focus shifts from an ″everything to the cloud″ to near-source computing combined with cloud—edge collaboration. This paper reviews the stage-specific advantages and limitations of centralization. It characterizes edge computing as a near-data layer situated between endpoints and the cloud that uses local processing and closed-loop control to satisfy deterministic latency and resilience. From this perspective, the paper outlines a sensor-cloud—edge—device collaborative framework. This framework adopts upload-when-necessary data paths, Service Level Agreement (SLA)-aware task placement with two-tier scheduling, and a division of labor in which the edge closes loops instantly while the cloud performs policy and model governance. The paper then discusses the trajectory toward edge intelligence, including lightweight and on-device learning; federated learning and knowledge distillation; AIOps for Edge with multilevel degradation; and an evaluation regime oriented to end-to-end closed-loop efficiency, resilience, and auditability. Evidence from educational scenarios and current industral pratices demonstrate the the practical effectiveness of near-source computing and cloud—edge collaboration in ensuring deterministic latency, enhancing overall system resilience, and achieving cross-domain consistency, and accordingly identify the inevitable evolution of the computing paradigm from edge computing toward cloud—edge intelligent collaboration.
  • FANG Yihao, ZOU Danping
    Computer Engineering. 2026, 52(2): 13-23. https://doi.org/10.19678/j.issn.1000-3428.0070059
    Abstract ( ) Download PDF ( )   Knowledge map   Save
    The continuous integration of artificial intelligence and robotics technology has facilitated the widespread adoption of multi-rotor Unmanned Aerial Vehicles (UAVs) across various fields, demonstrating their flexibility and efficiency. However, when developing and validating flight control algorithms or solutions for multi-rotor UAVs, researchers face high costs and significant risks. To mitigate these risks and enhance the efficiency of algorithm testing and optimization, simulation platforms for multi-rotor UAVs provide a safe and controlled environment. In this regard, this paper first introduces conventional models of multi-rotor UAVs, selecting the commonly used quadrotor as the representative model. It then elaborates on dynamic models according to different levels of simulation. Subsequently, it provides an overview of the conventional system framework of multi-rotor UAV simulation platforms and discusses their evaluation methods and classification approaches. The evaluation of simulation platforms is detailed from the perspectives of function and performance. multi-rotor UAVs are classified based on whether they support an interactive learning environment and their focus areas: dynamics, sensors, and multi-UAV coordination. This paper also reviews the main solutions for existing UAV flight missions, analyzing typical multi-rotor UAV simulation platforms within the context of traditional and learning-based methods. Finally, the paper outlines future directions for multi-rotor UAV simulation platforms.
  • WANG Zi, WANG Hongqiang, YANG Xiaoyi, LAN Yuqing
    Computer Engineering. 2026, 52(2): 24-45. https://doi.org/10.19678/j.issn.1000-3428.0069799
    Abstract ( ) Download PDF ( )   Knowledge map   Save
    Operating Systems (OSs), which form a critical infrastructure in the information age, are widely used in core fields such as medical care, industries, and the military. Their reliability and security directly determine the operational stability of these key fields, while vulnerabilities can lead to serious consequences, including system crashes and data leakage. Hence, the construction of a systematic security assurance system would be of great theoretical and engineering value. This paper systematically reviews the research achievements in this field over the past decade based on the framework of ″formal specification—formal verification—engineering implementation″ and analyzes technical paths and practical applications. With regard to formal specification level, it clarifies the differences between model specifications that describe system functions based on mathematical structures, such as transition systems, and property specifications that define safety and liveness requirements based on Linear Temporal Logic (LTL). This analysis focuses on two key aspects: functional correctness and security attributes. Functional correctness covers task management scheduling, memory allocation and recycling, exception interrupt handling, inter-task communication, and file system read—write consistency, while security attributes focus on the BLP and BIBA models for access control, multidomain isolation in separation kernels, and noninterference and nonleakage theories of information flow. For formal verification, three core methods are considered: deductive proof, which verifies program consistency by relying on Hoare's logic; model checking, which verifies temporal properties based on LTL and Computation Tree Logic (CTL); and standardized process of property verification. Taking the seL4 microkernel, which is the first to achieve functional correctness and information flow non-interference through machine proof, as a case study, this discussion reveals the transformation from theory to engineering. With regard to engineering applications, the achievements in Controller Area Network (CAN) bus communication verification within the automotive field and robustness detection of intercomponent communication in the Android system of smartphones are summarized. This systematic review aims to establish a foundation for future research in related fields, provide dataset support for large language models, and provide a reference for the engineering implementation of these technologies.
  • QIN Yingxin, ZHANG Kejia, PAN Haiwei, JU Yahao
    Computer Engineering. 2026, 52(2): 46-68. https://doi.org/10.19678/j.issn.1000-3428.0069826
    Abstract ( ) Download PDF ( )   Knowledge map   Save
    Deep learning has driven the development of artificial intelligence, which is widely used in computer vision. It provides breakthroughs and remarkable results in complex tasks such as image recognition, object detection, object tracking, and face recognition, demonstrating its excellent recognition and prediction capabilities. However, vulnerabilities and loopholes in deep learning models have been gradually exposed. Deep learning techniques, represented by convolutional neural networks, are extremely sensitive to well-designed adversarial examples, which can easily affect the security and privacy of the models. This paper first summarizes the concept of adversarial attacks, reasons for generating adversarial examples, and related terms. It outlines several types of classical adversarial attack strategies in the digital and physical domains and analyzes their advantages and disadvantages. Second, it focuses on computer vision and summarizes the latest research in adversarial attacks during tasks such as object detection, face recognition, object tracking, monocular depth estimation, and optical flow estimation, from both the digital and physical domains, as well as the various datasets commonly used in the study. It also briefly introduces the current stage of adversarial example defense and detection methods, summarizes the advantages and disadvantages of these methods, and describes examples of the applications of adversarial sample defense for various visual tasks. Finally, based on the summary of adversarial attack methods, it explores and analyzes the deficiencies and challenges of existing computer vision adversarial attacks.
  • Computational Intelligence and Pattern Recognition
  • GUO Tiansheng, XIE Jinkui
    Computer Engineering. 2026, 52(2): 69-78. https://doi.org/10.19678/j.issn.1000-3428.0070167
    Abstract ( ) Download PDF ( )   Knowledge map   Save
    Collaborative Filtering (CF) is an effective recommendation method that predicts user preferences by learning the representations of users and items. A recent study on CF has improved representation quality and enhanced recommendation performance from the perspective of hypersphere alignment and uniformity. The present study promotes alignment to increase the similarity between the representations of interacting users and items and enhances uniformity, resulting in a more evenly distributed representation of users and items within the hyper sphere. However, the use of only supervised data for alignment and uniform representation optimization ignores issues such as behavioral noise, data sparsity, and differences in popularity, which inevitably damage the generalization performance and structural characteristics of the representation. To address these issues, a more accurate adaptive alignment and uniform recommendation model is proposed. The data is modeled as a bipartite graph of user-item interaction and a Graph Neural Network (GNN) is applied to learn user and item representations. The model performs self-supervised contrastive learning on user and project representations to capture additional graph structure patterns unrelated to the supervised data. During optimization, the alignment and uniformity optimization objectives are adaptively adjusted based on popularity, thereby achieving a more generalized alignment and uniformity. Extensive experiments are conducted on three real-world datasets, and the results demonstrate the superiority and robustness of the proposed model over the baseline models.
  • CHEN Zhenqing, WAN Jiafu, ZHANG Rui
    Computer Engineering. 2026, 52(2): 79-88. https://doi.org/10.19678/j.issn.1000-3428.0069787
    Abstract ( ) Download PDF ( )   Knowledge map   Save
    Compression algorithms struggle to maintain a high compression ratio when handling complex and diverse patterns in time series data. Thus, selecting the appropriate compression algorithms tailored to different patterns is an urgent requirement. Existing adaptive compression schemes have low accuracy when determining the optimal compression algorithm. To address this issue, this paper proposes an Adaptive Lossless Segmented Compression method integrating Temporal Dependencies and data Features (ALSC-TDF). This method performs segmented compression of time series data and selects the most suitable compression algorithm based on the pattern of each segment. ALSC-TDF converts the compression algorithm selection problem into a time series classification task; utilizes Gated Recurrent Unit (GRU) to capture temporal dependencies; and considers compression efficiency features that are closely related to the data compression ratio, including basic statistical features, permutation and variation features, and compression degree features. Temporal dependencies and proposed features are analyzed using a modified GRU-Fully Convolutional Network (GRU-FCN) to improve classification accuracy and robustness, thereby improving the overall data compression ratio. The effectiveness and advantages of ALSC-TDF are verified using multiple datasets, and it outperforms comparison models in terms of classification accuracy and F1 value, with an accuracy of 88.86%. Moreover, ALSC-TDF achieves a significantly better compression ratio than existing compression algorithms, with a 15.62% improvement in overall data compression ratio compared to that of the Elf algorithm. Experimental results indicate that comprehensively analyzing the data features and temporal dependencies of time series can greatly improve the accuracy and robustness of adaptive compression algorithm selection, thereby achieving a higher compression ratio.
  • XUE Yang, QIN Yao, ZHANG Shuxiang
    Computer Engineering. 2026, 52(2): 89-100. https://doi.org/10.19678/j.issn.1000-3428.0070127
    Abstract ( ) Download PDF ( )   Knowledge map   Save
    A recommendation system based on Graph Neural Network (GNN) can extract high-order connectivity between users and items. Collaborative Filtering (CF) is a classic recommendation algorithm that suffers from over-smoothing issues during the stacking of multilayer graph convolutional layers owing to the similarity between user and item embeddings. To address this issue, a graph neural network collaborative filtering recommendation algorithm named DAC-GCN that generates subgraphs using a dual graph attention mechanism is proposed. Users with common interests are clustered to generate subgraphs to avoid spreading negative information from high-order neighbors to the embedding learning. The graph attention mechanism is used in advance to preprocess node embeddings, increasing attention to important nodes and improving subgraph generation results. In addition, the graph attention mechanism is reintroduced during the subgraph propagation process to enhance the node discrimination within the subgraph, thereby improving the propagation of embedded information within the subgraph, reducing the impact of over-smoothing, and enhancing the recommendation performance. Finally, the proposed algorithm is tested on three publicly available datasets using Normalized Discounted Cumulative Gain (NDCG) and recall as evaluation metrics. The experimental results validate the effectiveness and superiority of the proposed algorithm.
  • MA Manfu, YANG Xin, LI Yong, LIU Zezheng
    Computer Engineering. 2026, 52(2): 101-109. https://doi.org/10.19678/j.issn.1000-3428.0069882
    Abstract ( ) Download PDF ( )   Knowledge map   Save
    The accurate recognition of rumor sources can help suppress the spread of rumors and reduce their impact on the public. Existing rumor source recognition models often overlook the differences in mutual influence between nodes, which leads to equal weighting when aggregating neighboring feature information, thereby reducing the accuracy of rumor source recognition. This paper proposes a multiple-rumor source recognition model based on Graph Attention Networks (GATs), called, MRSDGAT. First, in a social network where a rumor has already spread, user status, source prominence of rumors, and centrality are used to represent user nodes as vectors that are then used to construct a feature matrix for the nodes. Subsequently, using the GAT is employed to explore the mutual influence between nodes, calculate the influence weights, and aggregate node feature information according to the weight of the influence between nodes. Simultaneously, residual connections are introduced between the attention layers to resolve the issue of gradient disappearance and improve the ability to identify multiple rumor sources. Finally, the model outputs the probability value of each node as a source node. The larger the probability value, the greater the possibility that the node is a source node. The experimental results show that on the Karate dataset, the F1 value of the MRSDGAT model improves by 14.09, 13.32, and 13.10 percentage points compared to the baseline GCNSI model, and by 23.41, 22.59, and 24.21 percentage points compared to the baseline LPSI model, indicating better recognition performance.
  • WU Zixuan, LIU Yinhua
    Computer Engineering. 2026, 52(2): 110-124. https://doi.org/10.19678/j.issn.1000-3428.0069495
    Abstract ( ) Download PDF ( )   Knowledge map   Save
    In recent years, emotion recognition research based on physiological signal measurements has gradually gained traction. In particular, Pupil Diameter (PD) is considered a promising physiological indicator that can intuitively reflect changes in an individual's emotional state. However, challenges persist in the denoising process of pupil signals and accuracy of emotion recognition. To address these issues, this study proposes a dual-filter denoising method and a digital classification method based on machine learning. The study aims to effectively denoise the PD signal while retaining subtle features related to emotions and improve the accuracy of assessing subjects' different emotional states. First, an emotion induction experiment is designed based on auditory and visual stimuli to guide subjects through emotional states ranging from calm to startled, stressed, and pleasant. Simultaneously, eye-tracking devices are used to collect continuous data on the PD signal. To mitigate noise in the data, cubic spline interpolation is employed to compensate for the signal loss caused by blinking and system noise from equipment. Subsequently, a dual preprocessing step using Kalman filtering and wavelet denoising is applied to the raw data. Then, using four key features extracted from the pupil data, the emotional states of the subjects are classified and compared across five classification algorithms, achieving an average accuracy of 84.38%. The performance of each model is evaluated. The Multilayer Perceptron (MLP) demonstrates the best performance, achieving the highest accuracy of 87.07%. Finally, the performance of the four features in distinguishing different emotional states is compared using Receiver Operating Characteristic (ROC) curves.
  • WANG Hailing, JIANG Tingwei, FANG Zhijun, GAO Yufei
    Computer Engineering. 2026, 52(2): 125-135. https://doi.org/10.19678/j.issn.1000-3428.0069633
    Abstract ( ) Download PDF ( )   Knowledge map   Save
    Emotion recognition is one of the most important frontier research topics in the field of Human-Computer Interaction (HCI) emotional intelligence. However, at present, Electroencephalogram (EEG)-based emotion recognition extracts static features and cannot mine the dynamic characteristics of emotions, and it is difficult to improve the emotion recognition ability. In current research on the construction of dynamic Brain Functional Networks (dBFNs) using EEG signals, a sliding window is usually used to form a dBFN by sequentially constructing a functional connectivity network in different windows. However, this method is limited by subjectively setting the window length and cannot extract the connection mode of the emotional state at each time point. Therefore, the loss of temporal information leads to the loss of brain connection information. To solve these problems, this study proposes a dynamic Phase Linearity Measurement (dyPLM) method that can adaptively construct an emotion-related brain network at each time point without sliding windows to characterize the dynamic characteristics of emotions. In addition, a Convolutional Neural Gate Recurrent Unit (CNGRU) emotion recognition model is proposed that could further extract the deep-seated features of the dynamic brain network and effectively improve the accuracy of emotion recognition. In experiments on the public emotion recognition EEG dataset, Database for Emotion Analysis using Physiological signals (DEAP), the four-class classification accuracy is as high as 99.71%, and the recognition accuracy is improved by 3.51 percentage points compared to MFBPST-3D-DRLF. On the SJTU Emotion EEG Dataset (SEED), the three-class classification accuracy is 99.99%, and the recognition accuracy is improved by 3.32 percentage points compared to MFBPST-3D-DRLF. This study demonstrates the effectiveness and practicality of the proposed methods.
  • Computer Vision and Image Processing
  • DAN Chonghong, WEI Honglei, HE Zhou, WU Guanfeng
    Computer Engineering. 2026, 52(2): 136-147. https://doi.org/10.19678/j.issn.1000-3428.0070157
    Abstract ( ) Download PDF ( )   Knowledge map   Save
    Human keypoint detection is increasingly being applied in fields such as motion behavior recognition and human-computer interaction. Taking the case of long jump, this study proposes a multi-scale feature extraction keypoint detection algorithm to improve the accuracy of human keypoint detection and reduce computational and parameter complexity. This algorithm is combined with an implementation of intelligent distance detection. First, the study constructs the LJDataset dataset to fill the gap in the current long jump dataset; then, based on the YOLOv8 training framework, it proposes a new model, SRMpose, with low parameter count and low computational complexity. The model uses StarBlock to build a backbone network; designs Multi-channel Residual Block (MRB) and semi-coupled detection head, SRMhead, modules to extract features; and introduces lightweight sampling operators, ADown and DySample, to improve the processing efficiency of feature maps. The model is validated on three datasets: LJDataset, MPII, and COCO. Compared with YOLOv8n-pose, SRMpose performs better on the three datasets, with mAP@0.5 and mAP@0.5∶0.95 increasing by 2.2 and 1.4 percentage point, 3.6 and 2.6 percentage point, 1.9 and 1.2 percentage point, respectively. On average, parameter quantity increases by 3.3% and GFLOPs decrease by 21.7%. In addition, on the COCO and LJDataset datasets, compared with YOLOv8s, SRMpose's parameter count decreases by an average of 48.3%, GFLOPs decrease by an average of 59.6%, and mAP@0.5 decreases by 1.4 percentage point and increases by 0.3 percentage point, respectively, proving that SRMpose effectively reduces the number of parameters and computations while ensuring model performance. On the LJDataset dataset, the model validation dataset was adjusted to the COCO validation dataset. The results show that the performance gap between SRMpose and YOLOv8s is less than 1 percentage point, proving the comprehensive performance advantage and generalization ability of SRMpose. Moreover, LJDataset dataset has a certain level of complexity and can cover most of the human body keypoint recognition features.
  • ZHANG Xinjia, WANG Fang
    Computer Engineering. 2026, 52(2): 148-157. https://doi.org/10.19678/j.issn.1000-3428.0069729
    Abstract ( ) Download PDF ( )   Knowledge map   Save
    Object detection in Unmanned Aerial Vehicle (UAV) aerial photography images is prone to incorrect or missed detections when the target is small, obstructed, or characterized by dense scales. To address the above challenges, this paper proposes the SNA-YOLOv5s algorithm for small target detection, which is based on YOLOv5s. First, the strided convolution layer in the original model is replaced with the Spatial Depth Transformation Convolution (SPD-Conv) module, eliminating the problem of detail loss caused by strided convolution operations and enhancing the model's ability to extract features from small objects. Second, a novel Average Pyramid Pooling-Fast (AGSPPF) module is designed, and an average pooling operation layer is introduced to address the issue of information loss that occurs while extracting feature information, thereby improving the model's feature extraction capability. Third, a new large-scale detection branch specifically for small targets is added to capture rich details in shallow features and enhance the detection capability for small targets. Finally, the Normalized Attention Mechanism (NAM) is embedded in the backbone network, where feature information is weighted to suppress invalid feature information. The proposed algorithm is trained and tested on the VisDrone2019 and NWPU VHR-10 datasets, on which it achieves mean Average Precision (mAP) of 42.3% and 96.5%, respectively, which is 8.4 and 2.6 percentage points higher than that of the baseline YOLOv5s model. The robustness and accuracy of the proposed model are validated by comparisons with other mainstream deep learning models.
  • WEN Lang, GOU Guanglei, BAI Ruifeng, MIAO Wanyu
    Computer Engineering. 2026, 52(2): 158-166. https://doi.org/10.19678/j.issn.1000-3428.0070136
    Abstract ( ) Download PDF ( )   Knowledge map   Save
    Currently fine-grained image classification faces challenges such as labeling difficulties, scarce sample numbers, and subtle category differences. To address these issues, a few-shot fine-grained image classification method based on neighborhood fusion and feature enhancement is proposed. First, the Discrete Cosine Transform (DCT) and channel attention mechanisms are used to capture global and local information from images, respectively. These features are then concatenated along the channel dimensions. This method of combining spatial- and frequency-domain feature extraction enhances the diversity of sample features and improves model generalization. Second, a feature enhancement module is introduced to compute the correlation between query samples and support class prototypes, generating adaptive weights to guide query information to complement the detailed learning of support sample images. This process effectively captures the differences between images of the same class and suppresses local similarities between different classes. Finally, a dual-similarity measurement module assesses the correlation scores between the support class prototypes and the images to be classified, improving the accuracy of classification performance. The experimental results show that this method achieves accuracies of 79.22%, 87.47%, 79.23%, and 83.71% on the 5-shot tasks in the Mini-ImageNet, CUB-200-2011, Stanford Dogs, and Stanford Cars datasets, respectively, outperforming comparative methods.
  • SONG Chaoqi, LIU Ying, HE Jinglu, LI Daxiang
    Computer Engineering. 2026, 52(2): 167-176. https://doi.org/10.19678/j.issn.1000-3428.0070134
    Abstract ( ) Download PDF ( )   Knowledge map   Save
    Image classification, a fundamental task in computer vision, has achieved remarkable results on large-scale datasets. However, traditional deep learning methods tend to overfit under low sample size conditions, thereby affecting the model's generalization ability. To address this issue, this study presents a novel small sample image classification method to improve classification performance when sample data are scarce. This method is based on the significant position interaction Transformer and the target classifier, specifically leveraging the structure and advantages of the Vision Transformer (ViT) model. The Interaction Multi-Head Self Attention (HI-MHSA) module with significant position selection is introduced, increasing the interaction between each attention head in the multi-head self-attention module, strengthening the model's attention to significant regions in the input image, saving computational resources, and further improving the learning efficiency and accuracy of the model through the supervision and guidance of the target classifier. Experimental results show that on the miniImageNet, tieredImageNet, and CUB datasets, the proposed method achieves classification accuracies of approximately 67.09%, 72.07%, and 79.82% in a 5-way 1-shot task and approximately 83.54%, 85.62%, and 90.35%, in a 5-way 5-shot task, respectively. Therefore, the proposed method can perform well and is highly practical for small sample image classification tasks.
  • WANG Shaojun, WANG Ting, WANG Chao, YANG Wankou, LU Keyu
    Computer Engineering. 2026, 52(2): 177-185. https://doi.org/10.19678/j.issn.1000-3428.0070004
    Abstract ( ) Download PDF ( )   Knowledge map   Save
    Using U-Net as the backbone, a novel medical image segmentation network called SEHC-Net is proposed for medical image segmentation of melanoma. A new structure named Sense and Edge Boost Module (SEBM) is designed specifically to address the challenges in segmenting melanoma images having irregular shapes, diverse sizes, and blurry boundaries. SEBM can expand the receptive fields of features, which enhances the model's ability to extract the target edge information and further capture the connections between pixels. Additionally, a hierarchical compensation module is proposed to solve the problem of information redundancy caused by long connections during information concatenation. This can compensate for the defect that mainstream segmentation networks cannot fully balance spatial contextual information and high-level semantic information in the feature extraction stage. GoogleNet's Inception is used to reduce the parameter increase by reducing the kernel size and increasing the model depth. The segmentation algorithm is verified on the ISIC2018 melanoma dataset. Experimental results show that the Intersection over Union (IoU), sensitivity, precision, Dice coefficient, and accuracy are 79.54%, 86.29%, 90.92%, 84.39%, and 94.83%, respectively. Therefore, the proposed algorithm can effectively improve the melanoma segmentation performance.
  • SONG Quanzhen, CHEN Zuojun, QIN Pinle, ZENG Jianchao
    Computer Engineering. 2026, 52(2): 186-196. https://doi.org/10.19678/j.issn.1000-3428.0070426
    Abstract ( ) Download PDF ( )   Knowledge map   Save
    Existing low-light image denoising methods mainly use the feature extraction and denoising mechanisms of Transformer and Convolutional Neural Networks (CNN). They face two problems: the self attention mechanism based on local windows fails to fully capture the nonlocal self-similarity in images, and the calculation of self-attention in the channel dimension does not fully utilize the spatial correlation of images. To address these issues, this study proposes a superpixel guided strategy for a window partition-based visual Transformer method; the strategy can adaptively select relevant windows for global interactions. First, a Top-N Cross Attention mechanism (TNCA) is designed based on window interactions, the top N windows that are most similar to the target image window are selected dynamically, and the information related to the image windows in the channel dimension are aggregated, fully considering the nonlocal self-similarity of the image. Second, through superpixel segmentation guidance, the expressive power of local features within the window is significantly improved while enhancing the correlation of spatial features in the channel dimension. Finally, a hierarchical Adaptive Interaction Superpixel Guide Transformer (AISGFormer) is constructed. Experimental results show that AISGFormer achieves a Peak Signal-to-Noise Ratio (PSNR) of 39.98 dB and 40.06 dB on the SIDD and DND real image datasets, respectively. Compared with other advanced networks, the PSNR improves by 0.02 dB—14.33 dB and 0.02 dB—7.63 dB, respectively. AISGFormer interacts with local and global information and details more effectively, and it adaptively utilizes self-similarity to suppress region similarity noise.
  • LIU Huilin, FANG Qiong, WANG Yansi, ZHANG Shunxiang, SU Shuzhi
    Computer Engineering. 2026, 52(2): 197-208. https://doi.org/10.19678/j.issn.1000-3428.0069830
    Abstract ( ) Download PDF ( )   Knowledge map   Save
    Existing photorealistic image style transfer algorithms typically do not fully consider the issues of algorithm model size and computational efficiency while pursuing improvements in image realism and stylization intensity. Therefore, applying these methods to low computing power devices is difficult. To address this issue, this study proposes a lightweight real image style transfer algorithm. The VGG19 is replaced with the ShuffleNet V2 lightweight network as the feature extractor, with block-wise training and skip-connection techniques introduced to significantly reduce the number of parameters and improve the speed of image style transfer. To better balance the content and style of the transferred images, the study also proposes a Shuffle Gated Channel Attention Mechanism (SGCAM) and Channel Alignment Whitening and Coloring Transform (CAWCT). SGCAM combines channel shuffling with gating mechanisms efficiently, which not only enhances the realism of generated images but also further maintains the advantage of the lightweight algorithm. CAWCT significantly boosts the stylization intensity of the generated images by introducing binary operations to match the whitened content features and style features for similarity. Experimental results show that the parameter size of the proposed algorithm is only 14.8% of that of PhotoWCT2. It takes only 4.22 s to transfer an image with a resolution of 1 000×750 pixel, which is 0.79 s faster than that achieved by PhotoWCT2. Simultaneously, the quality and stylization strength of the generated images are significantly improved. In performance evaluations, the Structural Similarity Index Measure (SSIM) and Peak Signal-to-Noise Ratio (PSNR) indicators increase by 0.031 dB and 0.066 dB, respectively, while the Content loss, Gram loss, and Style loss metrics decrease by 0.227, 0.138×10-5, and 0.116, respectively.
  • XU Xiaoyang, WEI Wei, GAO Chongyang
    Computer Engineering. 2026, 52(2): 209-220. https://doi.org/10.19678/j.issn.1000-3428.0069919
    Abstract ( ) Download PDF ( )   Knowledge map   Save
    An improved YOLOv7-tiny-based lightweight infrared ship target detection model is proposed to address the issues of low accuracy and high computation load of ship image detection in the infrared range. First, the lightweight model PP-LCNet is employed in the backbone network, which significantly reduces both the number of parameters and computational requirements. Second, an improved Fused-MBConv module and a Coordinate Attention (CA) mechanism are incorporated to construct the ELAN-FM-C module, which is then integrated into the feature fusion layer to comprehensively focus on the spatial and channel information of the feature layer to obtain a large receptive field. Subsequently, the Minimum Distance Points Intersection over Union (MDPIoU) loss function, which compares the bounding box similarity based on the minimum point distance, is adopted to simplify the computation process and improve the detection capability of the lightweight model for infrared targets. Based on this, an R-BiFPN structure is proposed to fuse more effective features, thereby improving the detection performance of the lightweight model across targets of different scales. Finally, a knowledge distillation technique is used to further improve the detection accuracy of the model. The improved model is validated on the Iray Optoelectronics infrared offshore ship dataset, achieving a mean Average Precision (mAP) that is 3.3 percentage points higher than that obtained using the original YOLOv7-tiny model. Simultaneously, the parameter and computational complexities are reduced by 23.0% and 30.3%, respectively, and the model size is reduced by 21.7%. Experiments on publicly available ship datasets, namely SeaShips and Ship Images, reveal that, compared to other mainstream and latest detection models, the improved model demonstrates excellent generalization and robustness and outperforms other models in terms of both detection accuracy and lightweight design.
  • Cyberspace Security
  • CUI Jingsong, GUO Mengwei, GUO Chi
    Computer Engineering. 2026, 52(2): 221-235. https://doi.org/10.19678/j.issn.1000-3428.0069858
    Abstract ( ) Download PDF ( )   Knowledge map   Save
    Current network device identification methods based on hardware fingerprints are not efficient in collecting and extracting features, and device classification methods based on traffic characteristics only consider existing device types and cannot detect abnormal devices. To address these problems, this study proposes a method that extracts the processing time-delay feature of network device based on Global Navigation Satellite System (GNSS) high-precision timing technology. A Bayesian convolutional autoencoder model, called BCNN-AE, is constructed to efficiently identify known types and detect unknown types: the model includes feature extraction, feature reconstruction, and composite prediction modules. First, the proposed method uses GNSS high-precision timing technology to achieve nanosecond-level measurement of network traffic processing time-delays and constructs a device time-delay distribution feature vector. Next, the feature extraction module uses Bayesian convolution to extract time-delay distribution features, and the feature reconstruction module uses an Autoencoder (AE) to learn a compressed reconstruction representation of the time-delay vector. Finally, the composite prediction module makes a comprehensive judgment based on uncertainty and reconstruction error thresholds to identify known types and detect unknown/abnormal device types. Experiments conducted on a dataset collected in a laboratory simulation environment and a public dataset Aalto show that the use of device time-delays can accurately represent different network device types. The results show that the proposed method achieves higher recognition accuracy than that of the baseline model and can effectively detect unknown/abnormal device types.
  • QI Fengyi, ZHANG Xinyou, FENG Li, XING Huanlai
    Computer Engineering. 2026, 52(2): 236-244. https://doi.org/10.19678/j.issn.1000-3428.0070198
    Abstract ( ) Download PDF ( )   Knowledge map   Save
    In recent years, wireless networks have been widely used in healthcare, industry, education, and military applications. However, security threats for these networks are increasing. Traditional cryptographic authentication methods have several limitations, including restricted computational resources, vulnerabilities to quantum computing, and susceptibility to tampering. To address these challenges, a device fingerprint verification scheme based on physical layer information is proposed. This scheme leverages fingerprint features derived from Channel State Information (CSI) for device identification to prevent malicious Wi-Fi connections. The proposed scheme considers both stationary and mobile devices with the aim of improving the terminal identification accuracy and stability. For stationary devices with minimal interference in the authentication scenario, the CSI amplitude information matrix is used as the authentication fingerprint. For mobile devices, where the CSI information varies with device movement, the direct extraction of fingerprint information is infeasible. Instead, the fingerprint features are constructed by extracting the I/Q phase errors for device identification. Self-designed One-Class SupportVector Machine (SVM) based on Confidence Level (OSCL) and isolation Forest (iForest) based on Confidence Level (IFCL) models are employed to train the fingerprints generated by the two schemes, enabling accurate identification of the target devices. The scheme achieves identification accuracies of 99% and 74% for stationary and mobile devices, respectively. This scheme effectively complements cryptography-based device identification methods. Additionally, during the training phase, only positive data are utilized to address the unpredictability of abnormal device fingerprint information and enhance robustness.
  • JIA Jianghao, ZHANG Ziwei, GAO Liting, WEN Juan, XUE Yiming
    Computer Engineering. 2026, 52(2): 245-252. https://doi.org/10.19678/j.issn.1000-3428.0069385
    Abstract ( ) Download PDF ( )   Knowledge map   Save
    Existing text steganalysis models experience difficulty in learning and extracting multilayer effective information that truly exists in encrypted data. To address this issue, a text steganalysis method, HAM-Stega, based on hierarchy-aware matching is proposed. This method utilizes the matching relationship between the relative distance between text information and label information in steganographic data to obtain a feature-matching relationship between text and coarse- and fine-grained labels in a hierarchy-aware manner. Based on this, joint embedding and matching learning loss functions are designed to guide the classification of text feature representations and obtain the final hierarchical classification information. The experimental results show that HAM-Stega's detection accuracy on the Large multidistribution mixed dataset, which is similar to real-world scenarios, improves by approximately 1.25—7.42 percentage points compared to the comparison model, indicating that the proposed model has an effective steganalysis detection capability on mixed datasets. Simultaneously, HAM-Stega can extract and detect other layers of effective information present in the steganographic data, such as steganographic algorithms for encrypted text, embedding rates, and corpus types. It improves the hierarchical classification metrics Macro-F1 and Micro-F1 by 5.41 and 4.36 percentage points, respectively, compared with the pretrained BERT model.
  • CHEN Xianyi, MI Hui, HE Junjie, FU Zhangjie
    Computer Engineering. 2026, 52(2): 253-264. https://doi.org/10.19678/j.issn.1000-3428.0070029
    Abstract ( ) Download PDF ( )   Knowledge map   Save
    Owing to the risk of copyright leakage in Federated Learning (FL) models caused by untrustworthy clients participating in joint training, current watermark embedding methods used by the central server face several challenges, such as incompatibility with secure FL architectures, insufficient traceability, and excessive server computational burden. Therefore, this study proposes a traceable and secure FL copyright protection scheme based on orthogonal constraints, abbreviated as FedSOW. Initially, the server replicates the convolutional layer embedded in the watermark to form a dual-channel layer and selects this dual-channel layer as the initial watermark layer. Subsequently, forward constraint rules are designed based on the principle of Schmidt orthogonalization, guiding the output features of the watermark layer of the client model using the orthogonal constraint. Finally, the client trains the watermark layer to form traceable local models with different orthogonal structures. Experimental results show that, compared with existing watermarking schemes, FedSOW demonstrates strong watermark persistence, ensuring copyright verification in the training round within the secure FL framework. Moreover, FedSOW exhibits excellent performance in terms of traceability, fidelity, and attack resistance.
  • CAO Tianya, ZHANG Yujing, JIA Junjie, ZHANG Yufan, DENG Xiaofei
    Computer Engineering. 2026, 52(2): 265-274. https://doi.org/10.19678/j.issn.1000-3428.0069644
    Abstract ( ) Download PDF ( )   Knowledge map   Save
    Federated learning, as the most commonly used privacy protection framework in deep learning, is widely applied by many institutions. The various participants in this framework achieve the goal of sharing data by uploading model parameter data without leaving the local data. However, in federated learning, privacy leakage occurs when various parties frequently upload and receive parameters. To address this issue, a personalized gradient clipping-based federated learning privacy preserving algorithm (AADP-FL) is proposed. This algorithm calculates the clipping threshold for each layer based on the L1 norm of historical data from different network layers of the participants. The gradient data is then clipped to limit the gradient range and prevent gradient explosion and vanishing gradients. Simultaneously, the contribution of each layer is calculated, privacy budgets are allocated for each layer based on their contribution, and then personalized noise is added. Participants add an appropriate amount of noise when uploading data to conceal the specific content, thereby hiding the contribution rate of each participant and improving the data security for each participant. A series of experiments reveal that the accuracy of this algorithm is superior compared to the commonly used personalized gradient clipping methods, with an accuracy increase of over 3.5 percentage points. This algorithm can also maintain a high accuracy compared with traditional federated learning frameworks. It can effectively protect the privacy of participant data while maintaining high accuracy, achieving a balance between model performance and data privacy.
  • DONG Fanghe, SHI Qiong, SHI Zhibin
    Computer Engineering. 2026, 52(2): 275-286. https://doi.org/10.19678/j.issn.1000-3428.0069846
    Abstract ( ) Download PDF ( )   Knowledge map   Save
    In recent years, deep learning technology has been increasingly used for malicious traffic detection. However, adversarial example attacks pose challenges to deep learning-based malicious traffic detection. To address this problem, this study proposes an adversarial traffic detection method based on ensemble learning and anomaly detection to detect adversarial example attacks against malicious traffic detection. First, a binary ensemble learner is trained for each malicious traffic category. For each base model, different data and feature subsets are used during training to increase the differences between the base models and increase the difficulty for adversarial examples crossing the decision boundaries of all models. Second, the proportion of base models that predict the input sample as normal traffic is used as the confidence score of the learning model; the confidence scores from different binary ensemble learners are then input into the isolated forest model, and anomaly detection is conducted using the isolated forest model to obtain the anomaly score. Finally, a comparison of the obtained anomaly score with the threshold of the anomaly score obtained for a normal example determines whether the example is adversarial. The experimental results show that the proposed method achieves the highest Area Under the Receiver Operating Characteristic Curve (AUC) values of 0.986 9 and 0.989 6 in the feature and restricted spaces, respectively, of the NSL-KDD dataset, and those of 0.999 1 and 0.999 8 in those spaces, respectively, of the CICIDS2017 dataset, which are better than those obtained using the comparative method.
  • Multimodal Information Fusion
  • RAN Tongyu, TANG Mengzi, XIE Qing, LIU Yongjian
    Computer Engineering. 2026, 52(2): 287-298. https://doi.org/10.19678/j.issn.1000-3428.0069814
    Abstract ( ) Download PDF ( )   Knowledge map   Save
    In ordinal classification, class labels have a natural order. It has been widely studied in various fields, such as movie ratings and age estimation. Most existing methods assume that all samples are labeled. However, the unique nature of data often makes the collection of extensive labeled data challenging, thereby affecting the performance of ordinal classification. This study proposes a semi-supervised ordinal classification framework that incorporates additional information. The framework starts by generating partial order information from the relationships among unlabeled samples and constructing a directed graph network. Then, it uses Graph Neural Network (GNN) to aggregate neighbor information, enrich node representations, and capture the order between nodes, thereby recovering global rankings from partial order information. Subsequently, the method applies a Gaussian mixture model for feature weighting according to global rankings and employs clustering to assign pseudo labels by integrating this information into ordered information. Finally, the framework uses supervised learning models for ordinal classification tasks such as age estimation. Experiments on the FGNET, Adience, and UTKFace datasets show that the framework achieves reliable performance with fewer labeled data. It performs better than semi-supervised learning baselines in terms of Mean Absolute Error (MAE) and Accuracy. Specifically, MAE decreases by 0.05, 0.04, and 0.04, and Accuracy increases by 4.8, 4.5, and 3.5 percentage points on the three datasets, respectively.
  • LI Jianlang, WU Xindian, CHEN Ling, YANG Bo, TANG Wensheng
    Computer Engineering. 2026, 52(2): 299-310. https://doi.org/10.19678/j.issn.1000-3428.0070113
    Abstract ( ) Download PDF ( )   Knowledge map   Save
    This study proposes a Common and Differential Cross-Attention Module-Bird's-Eye View (CDCAM-BEV) algorithm that combines 4D millimeter-wave radar and vision fusion to improve target detection accuracy for pedestrian and vehicle target recognition and localization in autonomous driving scenarios. First, a radar cylinder network is designed to encode the 4D radar point cloud into a pseudo image and convert the monocular image into a Bird's-Eye View (BEV) feature through Orthogonal Feature Transformation (OFT). Second, based on the cross-attention mechanism, a Common Information Extraction Module (CICAM) and a Differential Information Extraction Module (DICAM) are used to fully explore the common and differential information between radar and images. Finally, a BEV feature fusion module is designed based on CICAM and DICAM to achieve feature-level fusion of image and radar information in the BEV space. Experiments are conducted on the VOD dataset, and the CDCAM-BEV algorithm is compared with five other 3D object detection algorithms. The experimental results show that CDCAM-BEV achieves better detection performance in multiple modes. In the 3D mode, the average detection accuracy of CDCAM-BEV is 3.65 percentage points higher than that of the second ranked Part-A2; in the BEV mode, it is 5.04 percentage points higher than that of the second ranked PointPillars; in the Average Directional Similarity (AOS) mode, it is 2.62 percentage points higher than that of the second ranked Part-A2. These results show that CDCAM-BEV exhibits excellent performance in all modes, effectively fusing images and 4D radar point cloud features, which significantly improves the accuracy and reliability of object detection.
  • YANG Yuxue, HE Tian, FAN Jinghang, LIU Ruiying, LI Teng
    Computer Engineering. 2026, 52(2): 311-321. https://doi.org/10.19678/j.issn.1000-3428.0070119
    Abstract ( ) Download PDF ( )   Knowledge map   Save
    Image-text retrieval has become an important research direction in cross modal fields. However, the existing methods of aggregating multiple modal features face two major challenges: insufficient feature alignment between modalities and semantic representation loss within modalities. A cross modal image-text retrieval model based on cross attention and feature aggregation is proposed to address the problem of representation of feature information within modalities. This model includes modules such as image and text feature extraction, cross attention, feature pooling, and feature fusion. It combines the triplet loss function to mine local information in images and text, for obtaining image and text feature representations with deep semantic relationships. The model adopts an attention fusion strategy, which regulates the fusion of fine-grained features between images and texts using learnable weight parameters. A feature pooling module that aggregates image region features and text sequence features separately, learns weight parameters through neural networks, and combines multiple similarities to guide model learning is designed. This module can flexibly handle the features of variable length sequences of images and text, enhancing the ability of the model to capture cross modal information. Comparative experiments conducted on the public datasets MS COCO and Flickr 30k, reveal that compared with various image and text retrieval models, this model has higher retrieval performance. It has advantages in semantic feature pooling and dimensionality reduction, providing new concepts for cross modal feature fusion.
  • SUN Yuan, WANG Kangping, ZHAO Mingbo
    Computer Engineering. 2026, 52(2): 322-330. https://doi.org/10.19678/j.issn.1000-3428.0069773
    Abstract ( ) Download PDF ( )   Knowledge map   Save
    With the continuous development of multimodal learning, the field of image retrieval is facing new opportunities and challenges. Most existing clothing retrieval models are based on convolutional neural networks or a Transformer's unimodal retrieval, ignoring the rich textual information corresponding to images. Moreover, the features that the model can learn tend to be relatively single. This study proposes a clothing retrieval method based on multiple prompts and contrastive image-text learning. This study introduces image and text multiprompt learning to guide a multimodal large model, called FashionCLIP, in learning the multidimensional, high semantic, and multimodal features of clothing. To improve the retrieval ability of the model and fully mine its multimodal potential, the model is optimized in two stages. In the first stage, the image and text encoders are frozen and the text prompt is optimized using image and text cross-entropy loss functions. In the second stage, the text prompt and text encoder are frozen, and the image prompt and image encoder are optimized using triple loss, classification loss, and image and text cross-entropy loss functions. Both intra- and cross-domain retrieval experiments were conducted on the Taobao Live multimodal video product retrieval dataset, known as WAB. The experimental results show that the mean Average Precision (mAP) of this method for intra-domain retrieval is improved by at least 6.1 percentage points compared to traditional models, and Rank-1 is improved by at least 3.5 percentage points compared to traditional models. This method improves the mAP compared to traditional models by at least 8.4 percentage points and Rank-1 by at least 6.4 percentage points in cross-domain retrieval. Additionally, the retrieval results are significantly improved, demonstrating the potential for contrastive learning in the field of clothing retrieval.
  • WANG Qingrong, HAO Fule, ZHU Changfeng, WANG Junjie
    Computer Engineering. 2026, 52(2): 331-341. https://doi.org/10.19678/j.issn.1000-3428.0070065
    Abstract ( ) Download PDF ( )   Knowledge map   Save
    In response to the problems of insufficient vehicle feature extraction and single prediction scenarios in existing models, this paper proposes a vehicle trajectory prediction model, called MTF-GRU-MTSHMA, that integrates multiple features in multiple scenarios. The proposed model consists of an encoder module, multifeature extraction module, multifeature fusion module, and trajectory prediction module. In the encoder module, the Gated Recurrent Unit (GRU) is used to encode the historical information of the vehicle to obtain its historical status. In the multifeature extraction module, considering the spatial correlation between surrounding vehicles in the target vehicle area, a multidimensional spatial attention mechanism is proposed to mine the deep features of surrounding vehicles. Additionally, a triple attention mechanism is introduced to extract features from the encoded state vector. In the multifeature fusion module, the extracted multiple features are linearly concatenated and input into the multifeature fusion network for fusion. In the trajectory prediction module, improvements are made to the GRU by proposing a Mixed Teaching Force Gated Recurrent Unit (MTF-GRU) as the decoder, which controls the decoding mode by introducing a teaching rate to improve decoding performance. The fused features are input into the decoder to generate future trajectories. The proposed model is experimentally simulated using the NGSIM dataset. The results show that the average Root Mean Square Error (RMSE) of the proposed model for straight road, intersection, and roundabout road scenarios increases by 8.16%, 10.31%, and 8.37%, respectively, compared with the optimal benchmark model, proving the effectiveness of the proposed model.
  • LIU Chang, LIANG Bingxue, TIAN Rongkun, QIN Yuhua
    Computer Engineering. 2026, 52(2): 342-355. https://doi.org/10.19678/j.issn.1000-3428.0069817
    Abstract ( ) Download PDF ( )   Knowledge map   Save
    In the field of healthcare, existing methods for problem classification suffer from weak text feature representation and often overlook the varying weights of different keywords in multi-class scenarios, thereby affecting classification accuracy. To address these issues, a Medical Problem Classification method based on Multi-Feature Fusion and a Hybrid Neural Network (MPC-MFF-HNN) is proposed. This method aims to enhance the accuracy of the healthcare problem classification. First, the approach combines the RoBERTa-wwm-ext and Word2Vec models to represent text information at both the character and word levels, thus obtaining rich multi-feature information. This approach compensates for the limitations of single-feature representation methods and enables the model to comprehensively understand and characterize complex healthcare texts. Second, a hybrid neural network model named MHA-APTC-BiGRU is designed, incorporating multi-head attention mechanisms with an enhanced Text Convolutional Neural Network (TextCNN) and a Bidirectional Gated Recurrent Unit (BiGRU). This model uses multi-level feature extraction methods to effectively capture deep-level text features, including keyword weights. Finally, the classifier uses these semantically enhanced feature vectors for problem category classification. Experiments on real-world public datasets reveal significant improvements in precision, recall rate, and F1 score metrics compared with other baseline algorithms, demonstrating superior performance in healthcare problem classification.
  • Large Language Models and Generative Artificial Intelligence
  • ZHANG Qiwei, LIN Bin, LIU Yunlong
    Computer Engineering. 2026, 52(2): 356-371. https://doi.org/10.19678/j.issn.1000-3428.0069967
    Abstract ( ) Download PDF ( )   Knowledge map   Save
    Sepsis is a critical condition caused by infection and is a leading cause of death in Intensive Care Units (ICUs). However, in the context of sepsis treatment, actual clinical data are challenging to obtain. To address this challenge, a Sequentially Coupled medical Wasserstein Generative Adversarial Network (SC-med WGAN) with a gradient penalty is proposed in this study. In contrast to existing models that focus on single-step generation, this model emphasizes the sequential generation of sepsis patient statuses and drug doses to improve the simulation of the process of generating clinical data. The SC-med WGAN consists of two coupled generators that coordinate the generation of patient status and drug dose in a unified model. Moreover, the model employs a mixed-loss technique that introduces feature-matched loss and Pearson's correlation coefficient as additional terms to account for the actual distribution of individual variables and the correlation between variables over time. Finally, the model is tested on the Medical Information Mark for Intensive Care-III (MIMIC-III) dataset, which contains 17 898 sepsis patient records. Additionally, the model is validated using anemia data, further demonstrating its accuracy and robustness. The experimental results show that the data generated sequentially by the proposed model are superior to those generated by other models in terms of quality and authenticity. The proposed method reveals a significant interaction between the generation of patient status and drug dose data.
  • LI Bo, JI Baijun, DUAN Xiangyu
    Computer Engineering. 2026, 52(2): 372-382. https://doi.org/10.19678/j.issn.1000-3428.0069767
    Abstract ( ) Download PDF ( )   Knowledge map   Save
    Large Language Models (LLMs) demonstrate a certain level of performance in machine translation tasks. These models can generate translations upon receiving a translation prompt. However, owing to limitations imposed by the quality of pre-training corpora and the distribution of languages, translations generated by LLMs still show quality issues such as mistranslations, omissions, hallucinations, and off-target translations. To mitigate the issue of low-quality translations generated by LLMs, this paper proposes a machine translation method using LLMs based on the correction mechanism of error-prone words in translations. Initially, error-prone words for a particular language direction are defined using model and reference translations from the original training set. Subsequently, a dataset for correcting these error-prone words is constructed based on the error-prone words in the model translations and their corresponding corrections. The correction model is then obtained by fine-tuning a small pre-trained model using the correction dataset. During the inference phase, the correction model is employed to rectify error-prone words in the translations generated by the LLM; subsequently, the LLM performs autoregressive decoding to produce a higher-quality translation. Experiments were conducted using the Llama2-7B model across six language directions (Chinese↔English, German↔English, and Russian↔English) on the WMT2022 test set. The results indicate that the average Crosslingual Optimized Metric for Evaluation of Translation (COMET) and SacreBilingual Evaluation Understudy (BLEU) scores for the X-English translation direction improved by 0.018 7 and 1.26 points, respectively, while those for the English-X translation direction improved by 0.087 9 and 7.67 points, respectively, when compared to translations without correction. These experiments substantiate the effectiveness of the correction mechanism of error-prone words in enhancing the quality of text translation by LLMs.
  • WANG Heqing, WEI Jie, JING Hongyu, SONG Hui, XU Bo
    Computer Engineering. 2026, 52(2): 383-392. https://doi.org/10.19678/j.issn.1000-3428.0070415
    Abstract ( ) Download PDF ( )   Knowledge map   Save
    Large Language Models (LLMs) have made significant progress in dialogue, reasoning, and knowledge retention. However, they still face challenges in terms of factual accuracy, knowledge updates, and a lack of high-quality domain datasets for handling knowledge-intensive tasks in the electricity sector. This study aims to address these challenges by introducing an improved Retrieval-Augmented Generation (RAG) strategy. This strategy combines hybrid retrieval with a fine-tuned generative model for efficient knowledge capturing and updating. The Metadata-driven RAG framework (Meta-RAG) is proposed for knowledge Question Answering (QA) tasks in the electricity domain. This includes data preparation, model fine-tuning, and reasoning retrieval stages. The data-preparation stage involves document conversion, metadata extraction and enhancement, and document parsing. These processes ensure efficient indexing and structured processing of power regulation documents. The Electricity Question Answering (EleQA) dataset, consisting of 19 560 QA pairs, is constructed specifically for this sector. The model fine-tuning stage uses multi-question generation, chain-of-thought prompting, and supervised instruction fine-tuning to optimize the reasoning abilities in specific tasks. The retrieval reasoning stage employs mixed encoding and re-ranking strategies, combining retrieval and generation modules to improve answer accuracy and relevance. Experiments validate the effectiveness of Meta-RAG. Compared to baseline models such as Self-RAG, Corrective-RAG, Adaptive-RAG, and RA-ISF, Meta-RAG shows higher answer accuracy and retrieval hit rates. Meta-RAG with the Qwen1.5-14B-Chat model achieves an overall accuracy of 0.804 3, surpassing the other methods. Ablation and document recall experiments indicate that document retrieval significantly impacts the framework performance, with a 0.292 8 drop in accuracy when the retrieval capability is lost.
  • CHEN Shihang, SUN Yubao
    Computer Engineering. 2026, 52(2): 393-403. https://doi.org/10.19678/j.issn.1000-3428.0069992
    Abstract ( ) Download PDF ( )   Knowledge map   Save
    Generating speaking face videos from speech, involving the processing both audio and visual modalities, is a current research hotspot. A key challenge is achieving precise alignment between lip movements in the video and the input audio. To address this problem, this study proposes an end-to-end, speech-controlled speaking face video generation adversarial model, which mainly consists of a modal affine fusion-based generator, a visual quality discriminator, and a lip synchronization discriminator. The affine fusion-based generator adds audio information during face feature decoding through the Modal Affine Fusion Block (MAFBlock), effectively fuses audio information with face information and enables the audio to be better controlled for speaking face video generation. Spatial and channel attention mechanisms are incorporated to enhance the model's focus on local facial regions. The model employs a dual-discriminator strategy to enhance both visual quality and lip synchronization accuracy. The lip synchronization discriminator constrains lip movements by evaluating the similarity between the audio and the generated lip shapes without changing the overall contour and face details, thereby providing finer control over lip movement generation. The visual quality discriminator assesses the realism of the generated image frames to improve image quality. A comparative experimental analysis is conducted with several existing representative models on two audiovisual datasets. On the LRS2 validation set, the proposed model achieves an LSE-C score of 8.128 and an LSE-D score of 6.112, which are 4.3% and 4.4% higher than those of the baseline, respectively. On the LRS3 validation set, it achieves LSE-C and LSE-D scores of 7.963 and 6.259, representing improvements of 6.2% and 6.9% over the baseline scores, respectively.
  • ZHANG Chenghui, LUO Jing, TU Xinhui, CHEN Yulin
    Computer Engineering. 2026, 52(2): 404-412. https://doi.org/10.19678/j.issn.1000-3428.0070118
    Abstract ( ) Download PDF ( )   Knowledge map   Save
    Corpus Query Language (CQL) is a specialized tool for searching and analyzing linguistic corpora. Automating the conversion of natural language queries into CQL statements significantly lowers entry barriers for corpus users. Although Large Language Models (LLMs) excel in many natural language generation tasks, their performance in generating CQL statements has been suboptimal. To address this issue, a method for automatic corpus query generation based on contextual learning in LLMs, called T2CQL, is proposed. First, this method distills CQL writing rules into a comprehensive yet concise set of Text-to-CQL grammar knowledge standards. This serves as the basis for the LLMs to perform automatic Text-to-CQL conversions, compensating for their lack of domain-specific knowledge. Subsequently, the top k most relevant Text-CQL sample pairs for the current natural language query are selected using an embedding model. These samples serve as reference points and help the LLMs understand the grammar rules. Finally, a calibration strategy to mitigate biases in the LLM's CQL generation is implemented, thereby enhancing its performance. The proposed method is evaluated using multiple LLM on a test set of 1 177 samples. The results demonstrate that T2CQL significantly improves the performance of LLMs in Text-to-CQL conversion tasks, achieving an optimal Execution Accuracy (EX) of 85.13%.
  • MA Jing, CHE Jin, SUN Moxian
    Computer Engineering. 2026, 52(2): 413-422. https://doi.org/10.19678/j.issn.1000-3428.0069611
    Abstract ( ) Download PDF ( )   Knowledge map   Save
    To address the failure of the text encoder to deeply mine text information in text-to-image generation tasks, which leads to semantic inconsistency in the subsequently generated images, a DXC-GAN method for text-to-image generation is proposed. This method introduces the Xtra Long Network (XLNet) pretraining model from the Transformer series to replace the original text encoder, enabling the capture of prior knowledge from a vast amount of text for deep mining of contextual information. A Convolutional Block Attention Module (CBAM) is added to increase the generator's focus on important information in images, thus solving the issues of incomplete image details and incorrect spatial structure. In the discriminator, contrastive loss is introduced and combined with match-aware gradient penalty and unidirectional output in the model, making images with the same semantics closer and those with different semantics further apart, thereby enhancing the semantic consistency between text and generated images. The experimental results show that compared to the DF-GAN model, the Inception Score (IS) and Fréchet Inception Distance (FID) on the CUB dataset for the proposed model improved by 4.42% and 17.96%, respectively. On the Oxford-102 dataset, the IS is 3.97 and the FID is 37.82. Evidently, compared to DF-GAN, DXC-GAN effectively avoids deformities such as multi-headedness and foot deficiency in bird image generation and significantly reduces image quality issues such as missing petals in flower image generation. Furthermore, it enhances the alignment between text and images, significantly improving the completeness and generation effect of images.