Author Login Chief Editor Login Reviewer Login Editor Login Remote Office
Home Browse Just accepted

Just accepted

Please wait a minute...
  • Select all
    |
  • WANG Kaiyuan, SHI Caijuan , GAO Weixiang , ZHANG Yiqiong , ZHANG Yinan
    Accepted: 2026-06-17
    ew-Shot Object Detection (FSOD) aims to detect novel objects using only a few annotated samples. Although existing meta-learning-based FSOD methods have achieved performance improvement through the collaboration of query and support branches, they still encounter three primary bottlenecks. First, fixed multi-scale feature fusion strategies overlook the relative importance of features across different resolutions, making it difficult to handle multi-scale objects; second, class-level prototypes generated via simple average pooling fail to capture the complex intra-class structures and are susceptible to noise interference; third, the semantic scarcity of the support set leads to semantic bias during query-prototype interactions, resulting in false positives or missed detection. To address these challenges, this paper proposes a Feature Fusion and Semantic Enhancement (FFSE) model for few-shot object detection. Built upon the Meta R-CNN framework, FFSE enhances detection performance through three synergistic core modules from three dimensions, i.e., feature fusion, prototype representation, and feature modulation. First, the Dynamic Weight-based Feature Fusion (DWFF) module adaptively assigns weights to features of different scales, effectively integrating local textures with global semantics to strengthen the model's perception of multi-scale objects. Second, to improve class-level prototype quality, the Prototype Graph Network (PGN) mechanism is introduced. By leveraging the message-passing mechanism of graph neural networks, PGN achieves higher-order semantic enhancement, producing refined prototypes with stronger discriminative power and robustness. Finally, inspired by feature linear modulation, the Feature Modulation Driven by Support set (FMDS) module decomposes the fused query features across multiple receptive fields. It then utilizes refined prototypes to generate dynamic scaling and shifting factors for channel-wise affine transformations. The scaling factors amplify target-related features, while the shifting factors guide the query feature distribution toward the support semantic space, effectively correcting semantic biases and enhancing object saliency. Quantitative evaluations have been conducted on PASCAL VOC and MS COCO benchmarks. On PASCAL VOC, FFSE outperforms the baseline method across all three novel-class splits, specifically, for 5-shot and 10-shot settings, the nAP50 increases by at least 2.2%. On the challenging MS COCO dataset, FFSE achieves at least a 5% improvement in nAP over the baseline. Results from multiple experimental runs (mean and standard deviation) demonstrate that FFSE maintains low performance fluctuations and superior robustness while improving accuracy compared to some methods. Qualitative analysis compared to some methods on PASCAL VOC dataset further indicates that FFSE can effectively handle heavy occlusion, diverse tiny objects, and high-similarity background interference, significantly reducing cross-category misidentification. In conclusion, the extensive experimental results validate the effectiveness of the proposed FFSE model. In the future, we will explore the advanced attention mechanisms at the pixel level to effectively suppress background noise for improving the performance of FSOD.
  • Lin Junkai, Yu Jinghu, Wang Qimeng, Zhu Fangyong, Xu Haifeng
    Accepted: 2026-06-17
    Oral diseases seriously affect public health, and timely and effective diagnosis and treatment are of great significance for reducing the risk of disease progression. Conventional diagnosis of oral diseases mainly relies on manual interpretation of imaging data by experienced clinicians, which is often time-consuming and may overlook lesions with blurred boundaries. Therefore, image segmentation techniques are needed to assist the clinical diagnosis of dental diseases. Dental panoramic radiographs can present the overall morphology of teeth and jawbone structures in a single image and are commonly used in clinical dental diagnosis. However, due to low gray-level contrast, blurred lesion boundaries, noise, and artifact interference commonly present in dental panoramic radiographs, multi-class dental disease segmentation, including dental caries, periapical periodontitis, furcation involvement, and impacted teeth, remains highly challenging. To address these issues, this paper proposes Teeth-Net, a network for multi-class dental disease segmentation in dental panoramic radiographs. Based on the TransUNet architecture, Teeth-Net introduces targeted improvements in three key stages: feature extraction, feature reconstruction, and skip connections. In the feature extraction stage, a Cross-Scale Pyramid Fusion Module (CPFM) is introduced to optimize the original encoder. Multi-scale features are extracted through parallel group convolutions with different receptive fields, and the correlations among features at different scales are modeled using a cross-scale attention mechanism, thereby enhancing the model’s ability to capture small lesions and alleviating the loss of detailed features. In the feature reconstruction stage, a Parallel Multi-Kernel Pooling Module (PMKP) is designed to extract local details and global contextual information in parallel through multi-scale max pooling and average pooling. Channel compression and feature fusion are then performed to provide richer semantic information for the decoder. At each skip connection, a Spatial-Channel Collaborative Attention module (SCCA) is embedded to adaptively filter shallow features transmitted from the encoder through spatial and channel attention mechanisms, suppress background noise interference, and improve the quality of cross-layer feature fusion between the encoder and decoder. Comparative and ablation experiments are conducted on a self-built dental panoramic radiograph dataset. The experimental results show that Teeth-Net achieves a mean Dice coefficient, Hausdorff Distance (HD), precision, and recall of 84.22%, 18.546 mm, 94.13%, and 95.96%, respectively. Compared with the baseline TransUNet model, the mean Dice coefficient, precision, and recall are improved by 3.34, 2.89, and 4.21 percentage points, respectively, while the HD value is reduced by 6.869 mm. These results indicate that the proposed method achieves significant improvements in overall segmentation accuracy, boundary consistency, and lesion detection capability. To further evaluate the generalization ability and cross-dataset adaptability of the model, external tests are conducted on two public-source datasets. On the re-annotated MICCAI 2023 STS external test set, Teeth-Net achieves a mean Dice coefficient, HD value, precision, and recall of 80.26%, 19.520 mm, 92.58%, and 93.41%, respectively. Compared with the baseline TransUNet model, the mean Dice coefficient, precision, and recall are improved by 3.32, 4.33, and 3.89 percentage points, respectively, while the HD value is reduced by 6.705 mm. On the public Multi-Center Dental Panoramic Radiography Image (MCDP) dataset, Teeth-Net achieves a mean Dice coefficient, HD value, precision, and recall of 88.99%, 12.126 mm, 90.61%, and 92.45%, respectively. Compared with the baseline TransUNet model, the mean Dice coefficient, precision, and recall are improved by 3.83, 4.03, and 3.33 percentage points, respectively, while the HD value is reduced by 7.222 mm. The results on the self-built dataset and the two external test datasets demonstrate that Teeth-Net achieves better segmentation accuracy, boundary delineation ability, and cross-domain adaptability than the baseline TransUNet model under different data sources and imaging conditions. The proposed method can provide effective technical support for the assisted diagnosis of multi-class dental diseases in dental panoramic radiographs.
  • Yang Benchen, Yao Jia, Jin Haibo, Ren Zhecong, Liu Shiqi
    Accepted: 2026-06-16
    Image steganography embeds secret data into cover images for covert communication and is an important topic in information and multimedia security. With social-media compression, format conversion, image resampling, and active steganalysis, traditional methods face more complex scenarios. Existing deep steganography methods mainly focus on visual imperceptibility and embedding capacity, but pay insufficient attention to message confidentiality, integrity authentication, and error tolerance after extraction. Thus, covert transmission, content protection, and robust recovery are still difficult to unify. To address these problems, this paper proposes an information-encryption-driven high-security image steganography model. It jointly designs authenticated encryption, error-correction coding, key-controlled scrambling, and a deep steganographic network to achieve secure, covert, and reliable transmission over complex channels. At the payload generation stage, an “encryption-error correction-scrambling” defense scheme is built. HKDF-SHA256 is used to derive encryption and scrambling keys. AES-GCM provides authenticated encryption and generates ciphertext with confidentiality and integrity verification. Reed-Solomon coding is introduced to provide symbol-level error correction for the steganographic channel. If the number of erroneous symbols after inverse scrambling is within the RS correction radius, the correct data packet can be recovered. If the error exceeds the correction ability or AES-GCM authentication fails, decryption is stopped to avoid incorrect plaintext output. In addition, CSPRNG-based position and bit scrambling reduce payload correlation and statistical bias, while a sparse bitmap controls embedding positions and reduces structural clues exploitable by steganalyzers. At the embedding stage, a hybrid U-Net combining MS-DiSpAC and ViT is designed. MS-DiSpAC extracts texture, edge, and local structural features through multi-scale convolution, and uses dilated spatial attention to enlarge the receptive field while preserving resolution. It guides high-entropy payloads into complex texture regions. ViT supplements global context modeling and improves long-range dependency representation. The network generates stego images through a residual perturbation map and an intensity map, balancing image fidelity and recovery stability under high payloads. A WGAN discriminator with Wasserstein distance is further used for adversarial distribution alignment, making stego images statistically closer to cover images and reducing detection by SRNet, ZhuNet, and other steganalyzers. Experiments are conducted on ImageNet, COCO, and Visual Genome, including performance, generalization, payload whitening, robustness, and ablation tests. Metrics include PSNR, MS-SSIM, LPIPS, BER, ESR, ACC.1, ACC.2, and Dacc. At 0.4 bpp, the proposed method achieves 38.65 dB PSNR, 0.975 MS-SSIM, 0.036 LPIPS, and 99.14% bit recovery accuracy on ImageNet. Payload whitening results show that, after AES-GCM encryption, RS coding, and dual scrambling, single-bit entropy increases from 0.8932 to 0.9998, and average absolute autocorrelation decreases from 0.1285 to 0.0028. The final payload is close to random. Compared with representative methods, the proposed model achieves a better balance among visual fidelity, information recovery, and anti-steganalysis capability. It also maintains high recovery success under complex distortions within the RS correction range, providing a feasible solution for high-security image steganography in real network environments.
  • Chunyan Shuai, Shunyuan Zheng, Xiaoqi Zhang, Xin Ouyang
    Accepted: 2026-06-16
    Highway traffic flow during holidays exhibits significant spatiotemporal heterogeneity, making accurate short-term Origin-Destination (OD) flow prediction a key technology for enhancing the intelligent level of road network management. To address issues such as the high-dimensional sparsity of OD data, complex spatiotemporal dependencies, and holiday pattern shifts, this paper proposes a short-term highway OD flow prediction method based on spatiotemporal fusion and holiday adjustment, and constructs a Dual-stage Spatio-Temporal Fusion Network (DSTF) model. First, a spatiotemporal feature extraction architecture for multi-source data fusion is designed: a dual-branch Graph Attention Network (GAT) is used to extract and fuse spatial features representing macroscopic travel correlations from the OD perspective and microscopic node state dependencies from the entrance and exit flow perspectives. Then, a gated fusion module combining a Temporal Convolutional Network (TCN) and a Convolutional Long Short-Term Memory network (CNN-LSTM) collaboratively captures both the short-term fluctuations and long-term periodic trends of traffic flow. Simultaneously, a Cross-Attention mechanism is introduced to achieve multi-task collaborative prediction of entrance flow, exit flow, and baseline OD flow. To adapt to the special travel patterns during holidays, the model adopts a two-stage training strategy: the first stage trains the baseline prediction model using sufficient and stable non-holiday data; the second stage introduces a lightweight Sequence-to-Sequence (Seq2Seq) holiday adjustment module, focusing on learning the deviation of holiday patterns from the baseline, and performs adaptive fine-tuning on the baseline OD flow predictions. Experimental results based on real highway toll data show that the proposed DSTF model significantly outperforms various baseline models across multiple evaluation metrics in holiday short-term OD prediction tasks, achieving reductions of 11.7% in MAE and 12.2% in RMSE compared to the best baseline model STGCN in 1-step prediction, demonstrating higher prediction accuracy, stronger robustness, and superior scenario adaptability.
  • WU Guodong, ZHENG Yang, XIE Dongchen
    Accepted: 2026-06-16
    Hypergraph Neural Networks (HGNNs) have emerged as a prominent research direction in recommender systems due to their capability to model high-order interactions and integrate multi-source heterogeneous information. Unlike traditional Graph Neural Networks (GNNs), which are limited to pairwise relationships, HGNNs employ hyperedges to capture high-order associations among an arbitrary number of nodes, thereby preserving complex semantics in user–item interactions, such as many-to-many relationships, group structures, and multimodal information.This paper first outlines the general pipeline of HGNN-based recommendation from four aspects: data input, hypergraph construction, representation learning, and recommendation generation. Furthermore, recent advances in HGNN-based recommendation are systematically reviewed from two perspectives: hypergraph construction strategies and feature propagation mechanisms. These developments are analyzed across multiple application scenarios, including sequential recommendation, multi-behavior recommendation, social recommendation, multimodal recommendation, and group recommendation.In the context of sequential recommendation, existing studies have explored various hypergraph construction strategies, including local dependency modeling based on session interactions, global co-occurrence modeling, cross-session collaborative modeling, and multi-scale spatiotemporal dynamic modeling. Correspondingly, feature propagation mechanisms that integrate hypergraph attention-based denoising and self-supervised contrastive learning have been investigated to enhance temporal representation learning. These approaches help overcome the “neighborhood limitation” inherent in conventional GNNs and enable more accurate modeling of users’ evolving interests and long-range dependencies.For multi-behavior recommendation, hypergraph construction strategies are categorized into behavior-specific modeling, unified behavior modeling, and temporal behavior modeling. Feature propagation mechanisms, such as cascaded dependency propagation, behavior-aware attention, and cross-view contrastive learning-based denoising, have been developed to address data sparsity in target behaviors, facilitate semantic alignment across behaviors, and support knowledge transfer.In social recommendation, existing works focus on hypergraph construction methods based on homophily-driven dual views, heterogeneous semantic relationships, and privacy-preserving mechanisms. Feature propagation strategies incorporating trust-aware attention and dual-channel gated fusion have been proposed, which extend beyond traditional pairwise social modeling and contribute to capturing complex group influence and high-order social structures.For multimodal recommendation, hypergraph construction strategies include modality-specific separation, collaborative semantic association, and multimodal hypergraph optimization. Feature propagation mechanisms based on modality-specific convolutional aggregation and cross-modal contrastive alignment have demonstrated effectiveness in reducing modality noise and enabling high-order reasoning within a unified semantic space, thereby improving representation quality.In group recommendation, hypergraph construction approaches involve multi-view hierarchical alignment, group structure-aware optimization, and tripartite relationship modeling. Feature propagation mechanisms that incorporate cross-level feedback and attention-based aggregation better align with the inherent “inclusion” relationships within groups and provide an effective solution for alleviating cold-start issues in dynamic group scenarios. Despite these advancements, several challenges remain in HGNN-based recommendation. Dynamic hypergraph models often face difficulties in meeting the requirements of real-time recommendation. High-order aggregation may introduce information resolution loss, while noisy pseudo-hyperedges can adversely affect model robustness. In addition, the computational and storage complexity of hypergraphs poses scalability challenges in large-scale applications. Furthermore, issues related to interpretability and fairness in recommendation results remain insufficiently addressed.To address these challenges, this paper discusses several promising research directions for future HGNN-based recommendation systems, including representation learning based on generative self-supervised disentanglement, lightweight and efficient training and inference frameworks, robustness enhancement via causal inference, scenario-aware multimodal fusion, and collaborative integration with large language models. These directions are expected to provide valuable insights for advancing research in this field.
  • YanJie Pan, Chi Mingmin, PENG Bo
    Accepted: 2026-06-16
    Video virtual try-on technology aims to accurately transfer target garments onto human subjects in videos while maintaining high consistency between body motion and garment appearance, serving as a core technology in fields such as e-commerce, virtual reality, and short-video creation. However, existing technical frameworks still face significant challenges in balancing generation quality and computational efficiency. Traditional Generative Adversarial Network (GAN)-based methods often rely on optical flow estimation for garment warping, which are highly prone to texture distortion and visual artifacts when handling complex motions. In recent years, U-Net-based diffusion models have achieved high-fidelity generation by introducing a garment reference branch. However, when such dual-branch architectures are migrated to larger and more expressive Diffusion Transformer (DiT) backbones, they introduce substantial parameter redundancy and VRAM overhead. Furthermore, existing methods typically inject static garment features repeatedly during the denoising process of each frame. This not only significantly exacerbates the computational burden but also, due to the lack of natural temporal correlation in static features, makes it difficult for models to maintain spatiotemporal coherence during non-rigid deformations, resulting in severe flickering artifacts. To address the aforementioned challenges regarding the adaptability, training efficiency, and resource consumption of DiT architectures in video virtual try-on tasks, this study proposes a lightweight framework named OIE (Once is Enough). The OIE framework adopts a novel single-branch strategy featuring first-frame guidance and one-time injection, effectively decoupling garment editing from temporal generation tasks. First, during the garment appearance injection stage, a pre-trained high-fidelity image virtual try-on model, FiT-DiT, is utilized to precisely edit the video's initial frame, yielding results integrated with fine-grained garment textures. Second, to maximally preserve the temporal priors of the DiT model, only the edited first frame is embedded as the starting token into the latent feature sequence of the backbone network. This avoids the dense cross-branch feature interaction modules typical of traditional dual-branch architectures, achieving zero structural modification to the backbone. Additionally, to address the loss of background layout information caused by human motion, this method designs a lightweight background encoder that smoothly accumulates background information into the backbone features via a mask guider. Finally, during the fine-tuning stage, Low-Rank Adaptation (LoRA) is applied to all self-attention, cross-attention, and feed-forward network (FFN) modules of the DiT, enabling dynamic regulation of the large-scale parameter model with an extremely low number of trainable parameters. Experiments conducted on the ViViD and VVT datasets yield quantitative evaluation results demonstrating that, in terms of efficiency, OIE introduces only a 0.50% additional parameter overhead, with FLOPs and FPS remaining virtually unchanged. Its performance significantly surpasses dual-branch methods such as MagicTryOn (15.11% parameter increase) and ViViD (157.10% parameter increase). Regarding quality metrics, OIE achieves competitive video quality scores under both paired and unpaired settings on the ViViD dataset, attaining a VFIDp of 9.3983 and a VFIDu of 17.0831, significantly leading existing mainstream methods. Ablation studies confirm that high-quality first-frame guidance effectively suppresses error generation in the early stages of synthesis, improving the SSIM metric to 0.8466. Through its decoupling strategy, the OIE framework effectively resolves the computational burden of DiT architectures in video generation, achieving an excellent balance among garment fidelity, temporal coherence, and computational efficiency. This method demonstrates that leveraging strong temporal priors within a single-branch architecture can replace high-frequency feature injection, offering a highly valuable lightweight pathway for high-resolution and real-time video editing tasks.
  • Zhao Chengjun, Xu Xian
    Accepted: 2026-06-15
    Industrial steel surface defects exhibit pronounced anisotropic texture characteristics with large intra-class variation, yet existing real-time detection methods lack effective perception mechanisms for such directional local patterns in feature pyramid networks. This paper proposes a direction-aware heterogeneous convolution feature enhancement method based on RT-DETR, incorporating three key technical contributions. First, a Direction-Aware Sparse Convolution (DASC) kernel is designed, which partitions input channels into five directional groups with fixed sparse spatial masks to achieve parallel perception of right, left, down, up, and center directional textures at approximately 11.5% of the computational cost of equivalent standard convolutions. Second, a Direction-aware Interaction and Refinement (DIR) bottleneck is constructed using an expand-activate-compress dual-layer DASC structure to realize hierarchical fusion of directional features across channels, forming the complete Lightweight Feature Enhancement module with Cross-stage 3 modules for RT-DETR (LFEC3-RT). Third, a Cross-scale FPN Consistent Deployment (CFPD) strategy is introduced, globally deploying LFEC3-RT across all four fusion positions in the feature pyramid to eliminate cross-scale feature style inconsistency caused by selective deployment. Experiments on the NEU and GC10-DET steel surface defect benchmarks demonstrate that the proposed method achieves 76.3% mAP@0.5 on NEU (+2.2% over RT-DETR-R18 baseline) and 64.4% mAP@0.5 on GC10-DET (+3.3% over baseline), achieving competitive or superior performance over YOLOv11m on both datasets while requiring only 56.0 GFLOPs and 19.8M parameters. Ablation studies confirm that increasing direction count from 1 to 5 raises mAP from 74.4% to 76.3%, expansion ratio λ=4 is optimal, and CFPD global deployment outperforms selective deployment by +0.9% mAP.
  • Fan Xinggang, Shi Xuegang, Liao Siteng, Zhao Yiyi, Liang Yuzhu, Wang Tian
    Computer Engineering. https://doi.org/260431
    Accepted: 2026-06-15
    There is a profound structural contradiction between the surge in the parameter scale of large language models and the limited physical resources of edge terminals, which restricts their large - scale implementation. Traditional cloud-centralized inference highly depends on network transmission and faces high communication latency, making it difficult to meet the dual requirements of extremely low latency and strict data privacy in scenarios such as autonomous driving and intelligent healthcare. However, edge physical hardware, ranging from microcontrollers to edge gateways, has great heterogeneity, and the general cloud - side compression schemes are difficult to be directly applied. Therefore, based on the heterogeneous physical constraints of edge devices, this paper systematically reviews the technical system of efficient compression and software - hardware collaborative deployment of large models for the edge side. First, this paper analyzes the underlying mechanisms of three core compression technologies, namely model quantization, parameter pruning, and knowledge distillation, in edge scenarios. In terms of quantization, although post - training quantization has deployment agility, it faces the problem of representation collapse caused by the abnormal long - tail activation of large language models. Although quantization - aware training has a certain degree of robustness, it is limited by the lack of retraining computing power at the edge. In terms of pruning, this paper demonstrates the actual energy - efficiency advantages of structured pruning on hardware with limited memory access bandwidth and points out that the high theoretical compression rate of unstructured pruning is easily offset by the index addressing overhead of general - purpose edge chips. In terms of distillation, traditional shallow parameter alignment has the risk of feature loss and bias amplification when crossing the capacity gap between the teacher and edge student models. Overall, single compression technologies show an obvious diminishing marginal return effect under extreme constraints. Second, to alleviate the performance bottleneck of single technologies, this paper summarizes a multi - level hybrid compression paradigm driven by both model architecture and physical scenarios. Three core optimization links are systematically sorted out: the serial pipeline strategy aiming at a high physical compression rate, which is suitable for real - time inference at edge gateways; the deeply coupled joint optimization flow for a strict trade - off between energy efficiency and accuracy, which synchronously updates quantization, pruning, and low - rank decomposition within the same framework and is suitable for mobile terminals with limited power consumption; and the distillation - driven mechanism for the deployment of large - parameter models, which uses teacher priors to guide structure reshaping and quantization. This multi - level paradigm effectively expands the multi - dimensional trade - off space among model scale, computing power consumption, and fidelity. Furthermore, in the face of a wide range of computing power and energy consumption levels, this paper constructs a four - layer software - hardware collaborative design mechanism of "system - model - operator - instruction". It clearly points out that the focus of collaborative optimization needs to be dynamically shifted according to physical base constraints: at the system level, it focuses on resource - aware scheduling and task distribution in the cloud - edge environment; at the model level, it relies on hardware - aware architecture search to achieve structural adaptation; at the operator level, it promotes cross - layer fusion and memory access locality reconstruction; at the instruction level, it focuses on custom - extended instructions for specific micro - architectures (such as RISC - V) to precisely control the underlying energy consumption. Combined with the full - chain deployment process of model conversion, compilation reconstruction, and memory management (such as SwapNet), this mechanism effectively maps compression algorithms to underlying physical execution and improves the comprehensive utilization efficiency of heterogeneous computing power. Finally, this paper prospectively points out the future research challenges in the field of edge - intelligent lightweighting. It emphasizes that the robust compensation mechanism for ultra - low bit - width (4bit and below), hardware - adaptive dynamic semi - structured pruning, and effective knowledge transfer for the deep - level logical reasoning of large models are the core directions to overcome the current lightweighting bottlenecks. At the same time, it is urgent to build a hardware - agnostic unified toolchain based on deep - learning compilers to eliminate the deployment barriers of fragmented heterogeneous devices. Through a systematic review of technologies, this paper provides a solid theoretical support and reference guide for the development of an edge - intelligent ecosystem with low latency and strong privacy.
  • Tong Songsong, Yang Kuiwu, Zhou Gang, Ding Mengd
    Accepted: 2026-06-12
    To address the difficulty of deploying backdoor defenses in Machine Learning as a Service (MLaaS) black-box scenarios, this paper proposes an adaptive image preprocessing defense framework that relies solely on natural image statistics priors. The framework performs multi-dimensional feature analysis on input images to construct a backdoor risk quantification metric. According to the risk level, it dynamically selects and combines multi-level processing operations—including compression–reconstruction, geometric transformations, color perturbations, and dynamic random sequences—to disrupt the activation conditions of potential backdoor triggers. A quality feedback mechanism is introduced to balance defense effectiveness and visual usability. Experiments on the GTSRB, CIFAR-10, and MINI-ImageNet datasets evaluate five representative attacks, namely BadNets, Blended, WaNet, reflection attacks, and WaveAttack, which cover explicit patches, global blending, geometric warping, physical reflection, and frequency-domain perturbations. The results show that the proposed method reduces the average attack success rate to below 10% while preserving the model’s normal classification performance (with an average accuracy drop of no more than 3.5%). Notably, the suppression effect on WaveAttack is significant, achieving a success rate as low as 2.38%. Ablation studies confirm the critical role of the adaptive strategy and the quality feedback mechanism in performance improvement, and the framework exhibits stable performance across three datasets of varying scales, demonstrating strong generalization. This research provides an efficient and practical adaptive backdoor defense solution for MLaaS black-box services.
  • ZHAO Yijing, QIN Na, LIU Yuan, SONG Menghao
    Accepted: 2026-06-12
    Remote sensing image change detection aims to precisely localize land cover changes by comparatively analyzing the spatiotemporal evolution information contained in bi-temporal imagery, and has become a core task in fields such as dynamic monitoring of land resources, urban expansion assessment, and disaster emergency response. However, influenced by multiple factors including complex terrain interference, variations in illumination conditions, seasonal vegetation succession, and sensor imaging noise, change regions often exhibit characteristics such as substantial scale variations, discrete spatial distribution, and ambiguous boundary delineation. Existing change detection models suffer from insufficient exploitation of multi-scale information and inadequate extraction of deep global semantic correlations, rendering it challenging for these models to effectively discriminate genuine land surface changes from pseudo-changes, thereby constraining their discrimination accuracy in open-scene scenarios. To address the aforementioned limitations, a Multi-level Loss-assisted Siamese Network (MLLA_SiaNet) for remote sensing image change detection is proposed. The model adopts a weight-sharing Siamese architecture to extract multi-dimensional features from bi-temporal images separately, and generates hierarchical feature maps through a multi-level differential encoder. To overcome the linear limitations inherent in conventional differencing methods, we introduce a multi-angle difference representation strategy coupled with a channel-spatial hybrid attention mechanism, and design a Differential Fusion Module (DFM) to acquire high-quality difference features, thereby achieving adaptive suppression of background interference and precise focusing on genuine change characteristics. To compensate for the deficiency in global semantic representation, we integrate a spatial pooling pyramid with a Gaussian pyramid and propose a Deep Semantic Pyramid (DSP) module to construct multi-level semantic aggregation features, effectively expanding the receptive field and strengthening long-range contextual dependency modeling. During the decoding stage, the model employs a progressive upsampling strategy combined with a feature fusion mechanism to hierarchically restore spatial details, thereby enabling the reconstruction of high-resolution prediction maps. Furthermore, we introduce a deeply supervised Multi-level Loss-assisted (MLA) strategy to optimize the training process; by imposing auxiliary constraints on the outputs of each decoder layer, this strategy ensures consistency between local edge information and global contextual semantics, thereby constructing an end-to-end feature learning framework. To systematically validate the effectiveness of the proposed model, comparative experiments are conducted and results are comprehensively analyzed on two publicly available benchmark datasets, namely SYSU-CD and LEVIR-CD. On the SYSU-CD dataset, MLLA_SiaNet achieves an F1-score of 82.13%, outperforming seven other comparative methods and surpassing the second-best method, SFEARNet, by 1.3 percentage points; its precision and recall attain optimal values of 83.42% and 80.88%, respectively, achieving simultaneous improvement in both precision and recall metrics. On the LEVIR-CD dataset, MLLA_SiaNet achieves a precision of 89.48%, fully demonstrating the effectiveness of the proposed method in suppressing pseudo-change factors such as illumination variations, shadow effects, and seasonal vegetation changes; the F1-score of our model on the LEVIR-CD dataset reaches 85.87%, outperforming other state-of-the-art methods including SFEARNet (precision 84.89%), BIT (precision 82.80%), and IFN (precision 82.29%).Both quantitative and qualitative analyses of the experimental results demonstrate that the model exhibits superior robustness under varying spatial resolutions and complex land cover conditions. Ablation studies further corroborate the advantages of the DFM, DSP, and MLA modules in enhancing overall model performance, and the effectiveness of each architectural stage is empirically verified through analysis of the visualized response feature maps. In summary, this study mitigates the impacts of several critical challenges in remote sensing image change detection tasks, including insufficient multi-scale feature interaction, weak correlation modeling of global semantic information, and difficulties in suppressing pseudo-change interference. Future work will focus on lightweight model deployment, multi-temporal sequence modeling, and self-supervised pre-training techniques, as well as expanding systematic evaluations of model robustness across diverse application scenarios.
  • ZOU Shengpeng, MA Fuli, LI Yunlong, YU Qinsi, HU Xiaoyan, ZOU Ziming
    Accepted: 2026-06-12
    With the increasing number of space science satellites, the types of onboard scientific payloads have become increasingly diverse, and the volume of downlinked scientific data has grown continuously. However, the available computational resources of ground data processing systems for space science satellites remain limited. Consequently, data processing tasks generated by satellites during in-orbit operations must be completed under constrained resource conditions. Meanwhile, different tasks exhibit significant heterogeneity in terms of timeliness requirements and computational resource consumption characteristics, and the system workload and resource states vary dynamically over time. Therefore, scheduling strategies need to dynamically adjust the execution order of data processing tasks and resource allocation schemes based on real-time system states (including task loads and computational resource utilization) to improve overall processing efficiency and system responsiveness.To address these challenges, we propose an online decision-making deep reinforcement learning–based resource scheduling algorithm, DeepRL-Sched, which is built upon Proximal Policy Optimization (PPO) and models the satellite data processing task scheduling problem as a Markov Decision Process (MDP). To mitigate the short-sighted decision-making issue caused by reinforcement learning methods relying solely on the current system state, as well as the challenges of slow convergence and unstable training, we design two key components: a computational resource demand prediction module and an imitation learning module. The former predicts future task workloads and resource demands to provide the scheduling policy with foresight information, thereby alleviating short-sighted decisions caused by partial observability. The latter employs imitation learning to extract prior knowledge from high-quality expert scheduling strategies, guiding the training of the policy network and significantly improving convergence speed and training stability.Experimental results demonstrate that the proposed algorithm effectively enhances the scheduling efficiency of space science satellite ground data processing systems, reduces the overall task completion time, and significantly improves the timeliness of processing high-priority tasks.
  • Zhenxiong Li, Tingyu Huang , Min Cao, Jing Yang, Linghua Xu, Bo Deng
    Accepted: 2026-06-11
    UAV object detection technology holds great potential for ecological restoration monitoring in photovoltaic (PV) power stations. However, practical applications face challenges such as complex background interference, blurred features, and small object sizes. To address these issues, this paper proposes MDS-DETR, an improved object detection model based on RT-DETR. First, an improved backbone network named CSP-MambaVision is designed. By synergizing the gradient shunt characteristics of CSP with the linear global modeling capabilities of MambaVision, and introducing SFS-Conv and EMA for progressive feature optimization, this backbone significantly enhances the visual semantic modeling capacity in complex environments. Second, a lightweight DTAB is introduced to replace the native AIFI module. Relying on grouped channel control and masked spatial constraints, DTAB expands the receptive field to capture multi-scale contextual information while optimizing the model's perception and discriminative abilities for objects with ambiguous features. Finally, a small object detection module, SOEP-MFM, is proposed. Utilizing cross-scale feature recombination and a dynamic weight adjustment mechanism, this module achieves multi-level preservation of small object features within the network, effectively strengthening their representation and improving the detection accuracy of small objects.Experiments on public datasets demonstrate the significant advantages of MDS-DETR. Compared to the baseline model, Precision, Recall, mAP50, and mAP50-95 increase by 4.96%, 3.04%, 4.09%, and 3.58%, respectively. The model outperforms other mainstream algorithms. Furthermore, the study applies the optimized MDS-DETR to PV ecological restoration monitoring. The results align closely with actual measurements, indicating that the model provides reliable support for ecological restoration planning.
  • XU Jing-wen , TANG Kun , YANG Meng-long , WANG Li-hui
    Accepted: 2026-06-11
    Multi-modal medical image registration aims to achieve accurate spatial alignment of anatomical structures across different imaging modalities. However, due to inherent differences in imaging mechanisms, significant inconsistencies exist in intensity distribution and texture characteristics among modalities, which lead existing methods to suffer from limited accuracy and robustness in complex scenarios. Recently, unsupervised feature disentanglement approaches have partially alleviated the reliance on registration labels. Nevertheless, the lack of explicit constraints often results in insufficient suppression of modality-specific information and potential degradation of key anatomical structures. Therefore, effectively eliminating modality discrepancies while preserving structural integrity remains a fundamental challenge in multi-modal medical image registration. To address this issue, this paper proposes a Feature Decoupling and Structural Reconstruction Network (FDR-Net), which establishes a closed-loop framework consisting of feature disentanglement, deformation estimation, and reconstruction verification. Specifically, a feature encoder with global self-attention is employed to explicitly decompose input images into modality-related style representations and modality-invariant structural representations. A modality discrimination constraint is further introduced to encourage effective removal of style information from structural features. Moreover, a cross-modal feature mixing strategy is designed to artificially introduce modality perturbations, thereby enhancing the robustness of structural representations against modality variations. In the registration stage, a U-Net-based architecture is adopted to predict dense deformation fields from the disentangled structural features. Feature-level and image-level similarity constraints are jointly imposed, together with a smoothness regularization term to ensure spatial continuity and physical plausibility of the deformation field.In addition, a cycle-consistent reconstruction module is incorporated, where reconstructed targets are dynamically generated based on predicted deformation fields. A composite reconstruction loss, consisting of structural similarity (SSIM) and mean squared error (MSE), is used to back-propagate supervision signals to the feature learning process. This design further strengthens structural consistency while suppressing modality discrepancies. Extensive experiments are conducted on two public datasets, SR-Reg and BraTS2021, to validate the effectiveness of the proposed method. On the SR-Reg dataset, the Dice score without registration is 62.24%, while the proposed FDR-Net achieves 79.58%, outperforming the second-best method BSF_Fusion (77.86%) by 1.72 percentage points. The HD95 and ASSD are 2.89 mm and 0.90 mm, respectively, and the deformation fields show smoother and more stable performance in critical anatomical regions such as ventricles. On the more challenging BraTS2021 dataset, which includes complex tumor-induced deformations, FDR-Net achieves a Dice score of 86.85%, outperforming BSF_Fusion (84.98%) by 1.87 percentage points, with HD95 and ASSD reduced to 4.12 mm and 1.79 mm, respectively. Notably, these improvements are achieved with only approximately 1.0M additional parameters. Ablation studies further demonstrate that removing the cross-modal mixing strategy, modality discrimination constraint, or cycle-consistent reconstruction module leads to Dice drops of 5.3, 4.8, and 6.1 percentage points, respectively. Feature analysis also confirms that the proposed method effectively reduces modality separability in structural representations, enabling stable modality-invariant feature learning. In conclusion, the proposed FDR-Net effectively disentangles modality-specific style information from anatomical structure representations through explicit feature decoupling, cross-modal mixing, multi-level discrimination constraints, and cycle-consistent reconstruction. It significantly improves registration accuracy and robustness while preserving structural integrity. Without relying on generative image translation or handcrafted similarity metrics, the proposed method provides an efficient and generalizable solution for multi-modal medical image registration in complex clinical scenarios.
  • LI Bo, LIU Shouwen, YUAN Mengting
    Accepted: 2026-06-11
    Deploying Mixture-of-Experts (MoE) networks on resource-constrained edge FPGAs faces severe memory wall and load imbalance challenges. Existing dynamic scheduling or batch processing solutions struggle to meet the strict real-time requirements of streaming inference. To address these issues, a load-aware hardware-software co-optimization method is proposed. Leveraging the long-tail distribution characteristics of expert activations, a Probability-Aware Static Locking (PASL) strategy is designed to minimize memory access latency under limited capacity via a hierarchical storage mechanism. Simultaneously, a statistics-driven automated Design Space Exploration (DSE) engine is constructed to achieve the optimal non-uniform allocation of computational resources. Furthermore, to tackle the macro distribution drift challenge prevalent in real-world edge scenarios, a load-evolution-oriented hysteretic hardware-software co-reconfiguration mechanism is proposed, which effectively filters out micro-semantic noise and prevents cache thrashing. Experimental results demonstrate that in single-frame streaming inference scenarios, the proposed method achieves up to a 2.22× throughput improvement over the uniform allocation strategy and up to a 1.52× improvement over the state-of-the-art Edge-MoE solution. In terms of energy efficiency, it surpasses CPU and GPU baselines by up to 2.9× and 3.1×, respectively, while achieving an end-to-end latency as low as 16.33 ms when processing complex Vision Transformers. When confronted with dynamic distribution drift, the proposed mechanism delivers a 17.3% throughput improvement over the static baseline while maintaining zero additional overhead in steady-state random scenarios. Ultimately, this approach effectively resolves the bottlenecks of real-time performance, energy efficiency and dynamic environmental adaptability in edge MoE network deployments.
  • Wu Yongqing, Zhang Han
    Accepted: 2026-06-05
    Named Entity Recognition (NER) aims to accurately identify entities with predefined semantic categories and clear boundaries from text. In Chinese NER, the absence of explicit word boundaries, the complexity of semantic expressions, and the widespread presence of polyphonic and visually similar characters often lead to semantic ambiguity. Existing methods predominantly rely on character- or word-level information, with insufficient utilization of key linguistic features such as pinyin and radicals, and multi-source heterogeneous feature fusion is typically performed via simple concatenation or weighting strategies, which fail to capture deep semantic correlations among different features and thus limit further performance improvements. To address these issues, this paper proposes a Chinese NER method based on Multi-Feature Hierarchical Fusion (MFHF) to achieve collaborative modeling and deep semantic integration of multi-dimensional linguistic features. Specifically, in the feature representation stage, four types of embeddings—character, pinyin, radical, and lexical—are constructed, where character embeddings are derived from a pre-trained language model to capture contextual semantic information and long-range dependencies, pinyin embeddings encode phonetic sequences to model pronunciation differences and alleviate polyphonic ambiguity, radical embeddings employ a convolutional neural network to model character structures and extract fine-grained semantic features from the glyph level, and lexical embeddings incorporate word-level information via a lexicon matching mechanism to enhance the model’s ability to detect multi-character entity boundaries, thereby improving character representations from phonetic, glyph, and lexical semantic perspectives. To address insufficient interaction and coarse granularity in multi-source feature fusion, a hierarchical cross-attention mechanism is designed, where at the local level, two groups of cross-attention—pinyin–radical and character–lexical—are constructed to model the intrinsic relationships between phonetic and glyph information as well as the structural dependencies between character-level and word-level semantics through bidirectional attention interactions, enabling fine-grained alignment and complementarity among heterogeneous features, and at the global level, the locally enhanced multi-source features are concatenated and further modeled using a multi-head self-attention mechanism to capture long-range dependencies across features, achieving deep semantic integration and generating semantically enriched representations. On this basis, a joint optimization strategy combining multi-task learning and adversarial training is introduced, where auxiliary tasks of pinyin prediction and radical prediction are designed to strengthen feature learning, and gradient-based adversarial perturbations are applied in the embedding space to improve robustness and generalization under complex conditions. Finally, the fused representations are fed into a Bidirectional Long Short-Term Memory (BiLSTM) network for sequence modeling, and a Conditional Random Field (CRF) layer is employed for global decoding to obtain entity recognition results. Experiments conducted on three public Chinese NER datasets, MSRA, Weibo, and Resume, demonstrate that the MFHF model achieves F1 scores of 96.78%, 96.14%, and 71.80%, respectively, outperforming several representative baseline models, with improvements of 1.09, 1.55, and 1.68 percentage points over CPL-NER, GS-Lexicon, and Lattice-LSTM on the respective datasets. In summary, the proposed approach effectively enhances semantic modeling capability and model robustness for Chinese NER through multi-feature hierarchical fusion and joint optimization strategies.
  • Zhao Chao, Wen Jin Hui, Yu Guo, Zhao Yan Nan, Du Xia Wei, Hu Chen, Liu Wei, Yin Ze Ming, Liu Yu Hai
    Accepted: 2026-06-05
    Low-precision training for large language models helps reduce training cost and improve hardware utilization. However, existing high-efficiency low-precision training frameworks mostly rely on native FP8 hardware support, making them difficult to migrate directly to domestic AI accelerators that lack FP8 execution capability. Therefore, a key challenge is how to reconstruct a low-precision training path suitable for domestic accelerators without relying on dedicated FP8 hardware units, while still maintaining training stability and achieving practical end-to-end performance gains. To address this issue, this paper proposes an INT8 dynamic-quantization-based efficient Transformer Engine training scheme for domestic hardware. The proposed scheme redesigns the original FP8 linear-layer computation flow around the integer matrix multiplication capability already available on domestic accelerators, thereby enabling low-precision pretraining of large language models without dedicated FP8 hardware support. In terms of method design, the proposed scheme preserves the dynamic scaling management principle of Transformer Engine and reconstructs the original FP8-dependent linear-layer computation flow into a cross-precision execution path consisting of dynamic quantization, INT8 matrix multiplication, INT32 accumulation, and fused dequantization recovery. This design maps the most computation-intensive matrix multiplication operations onto the underlying integer compute units. To balance implementation feasibility and execution efficiency, a tensorwise dynamic quantization strategy is adopted, in which activations and weights are scaled online according to the dynamic range of each tensor. Combined with the native support of domestic SIMT accelerators for INT8×INT8 integer matrix multiplication and INT32 accumulation, this design enables the domestic adaptation of the core linear-layer operators in Transformer Engine. Furthermore, to address the problems of activation–gradient scale mismatch, quantization error amplification, and convergence degradation that easily arise in numerically sensitive modules such as the input embedding layer and output layer under uniform INT8 quantization, this paper analyzes the numerical characteristics of these layers from the perspectives of gradient propagation and error propagation, and accordingly proposes a hierarchical precision quantization strategy. Specifically, the input embedding layer and output layer remain in BF16 precision to ensure stable gradient propagation and reliable parameter updates; computation-intensive intermediate modules, including attention projection layers and feed-forward networks, adopt an INT8 low-precision path to fully exploit the throughput of integer compute units; scaling factors and some critical intermediate variables are retained in higher precision to balance numerical stability and practical acceleration. On this basis, the proposed scheme is integrated into the Megatron-lm distributed training framework and validated through multi-model pretraining experiments on domestic accelerators. The experiments evaluate Llama2-7B, Llama2-13B, Llama3.1-8B, Qwen3-4B, Qwen3-8B, and Mixtral-8x7B-8L, the last of which is an 8-layer pruned version based on the Mixtral-8x7B architecture. Under the same number of training iterations, the proposed INT8 scheme is compared with the BF16 baseline. The results show that the proposed method maintains training loss curves overall close to those of the BF16 baseline across different models, without obvious oscillation, divergence, or convergence stagnation, indicating that the reconstructed INT8 training path can effectively preserve convergence stability during large-model pretraining. In terms of end-to-end training efficiency, the achieved speedups for Llama2-7B, Llama2-13B, Llama3.1-8B, Qwen3-4B, Qwen3-8B, and Mixtral-8x7B-8L are 1.21, 1.16, 1.17, 1.07, 1.20, and 1.12, respectively, demonstrating stable efficiency gains across models of different scales and architectures. Overall, the proposed method effectively reconstructs the low-precision training path of Transformer Engine on domestic accelerators without native FP8 hardware support. Through the coordinated design of dynamic quantization, an INT8 computation path, and a hierarchical precision quantization strategy, the method achieves stable end-to-end acceleration while maintaining convergence stability. The experimental results indicate that, under current hardware conditions, software-level computation-path reconstruction combined with model-structure-aware precision configuration can effectively unlock the potential of integer compute units, providing a practical solution for efficient pretraining of large language models on domestic platforms.
  • Anran Fang , Lemen Chao
    Accepted: 2026-06-03
    This study aims to investigate the degradation risks of Generative Artificial Intelligence (GAI) models in self-training loops, with a focus on two core phenomena: content homogenization and the widening divergence between human and machine-generated texts. We select two representative generative models with distinct architectures and build an iterative self-training framework, using the proportion of human data in the training set (α) as the key hyperparameter. Under different initial values of α, we conduct controlled experiments combining two typical dynamic strategies—linear decay and exponential decay—and systematically evaluate the quality, diversity, and human-likeness of generated content using multidimensional performance metrics. The results show that, during self-training, GAI models exhibit a persistent decline in performance, a marked reduction in output diversity, and a gradual increase in the gap between human and machine-generated texts. The linear decay strategy can effectively slow down the decline of information entropy and help maintain content diversity, but it becomes increasingly vulnerable to the cumulative impact of model-generated data pollution in later stages. In contrast, although the exponential decay strategy leads to more pronounced performance fluctuations in the early phase, it achieves superior stability in the long run. Moreover, lightweight unidirectional language models (GPT2) are more prone to falling into a vicious cycle of noise amplification during self-training, whereas bidirectional encoder models (BART), endowed with stronger global modeling capacity, demonstrate greater robustness in the presence of synthetic data contamination. These findings provide important empirical support for optimizing dynamic data-mixing strategies in GAI self-training.
  • Wang Lihui, Li Yuan, Liu Zefeng, Wei Yachuan
    Accepted: 2026-06-03

    Unmanned aerial vehicle (UAV) power inspection images often contain cluttered backgrounds and variable target scales. These factors limit the image retrieval accuracy. To solve these problems, this paper proposes a power image retrieval network named Swin-FMG. The network is based on frequency-domain coordinate synergy and multi-scale gating. The method uses Swin Transformer as the backbone architecture. First, it proposes a Frequency-domain Coordinate Collaborative Attention (FCCA) mechanism. FCCA combines global spectrum filtering and orthogonal space projection. It effectively suppresses environmental noise and restores the physical continuity of target geometric features. Second, the method designs a Semantic-Guided Multi-Scale Convolutional Gated Fusion (MSCGF) module. MSCGF uses deep semantics to adaptively filter shallow multi-scale textures. It also constructs a dual-stream retrieval representation. This module greatly enhances the perception ability of the model to cope with cross-view scale changes. Finally, the method introduces Low-Rank Adaptation (LoRA) fine-tuning and a joint loss function with hard-sample triplets. These strategies mitigate the overfitting risk on small samples. They also optimize the inter-class separability of the feature metric space. The method is evaluated on a self-built power inspection image retrieval dataset. Experimental results show that the mean Average Precision (mAP) of Swin-FMG reaches 63.15%. The Recall@1 reaches 71.04%. Compared with the baseline Swin Transformer, the mAP of Swin-FMG increases by 4.19%. In conclusion, Swin-FMG effectively strips complex environmental interference and captures scale-invariant features. It significantly improves the image retrieval performance of power equipment while maintaining computational efficiency. The experimental results verify the effectiveness of the proposed method.

  • HU Yunfei, GU Fei, HAN Puyu
    Accepted: 2026-06-03
    In the process of compiler optimization for dynamically typed languages, a large number of runtime check nodes must be inserted due to the uncertainty of runtime types. Existing Redundant Code Elimination (RCE) algorithms based on reachability analysis generally treat all control flow nodes as potentially having side effects. As a result, computations and control flow structures associated with these check nodes are preserved during analysis, making it difficult to safely remove semantically redundant computations and control flows. To address this issue, a semantics-driven RCE method for the Ark runtime is proposed based on a systematic analysis of its compilation workflow and Intermediate Representation (IR) structure. The method begins with observable program semantics, where program behavior is abstracted as a sequence of observable events, including input/output operations, exception throwing, system calls, and termination with specific return values. Based on this abstraction, the RCE problem is formulated as the removal of IR subgraphs under the constraint that observable program semantics remain unchanged. On this basis, a criterion for eliminating runtime check nodes is introduced: a check node and its dependent computations can be safely removed if they produce no side effects and their results are not used by any node that affects observable program behavior. This criterion overcomes the interference introduced by runtime checks and allows the removal of computations that are preserved by traditional approaches despite being semantically redundant. Around this criterion, a semantics-constrained live node propagation mechanism is designed. The process starts with initializing an live node set containing side-effect nodes, followed by expanding this set along data dependency relations. Only nodes that may influence observable program behavior are retained in the set, enabling the identification and elimination of redundant computations and their associated check nodes. Furthermore, to address redundant control flows that cannot be handled by existing methods, control flow graph construction, dominance analysis, and loop structure identification algorithms are incorporated. A method for detecting and eliminating redundant loops and redundant branches is proposed, enabling the overall removal of structures such as empty loops and branches. The proposed method has been integrated into the Ark runtime compilation framework, achieving optimization of redundant computations and control flows at the IR level. Experimental results demonstrate that, in terms of executed instructions, the average reduction across all test cases reaches 3.4%, while representative cases containing redundant control flows achieve an average reduction of 27.4%, with a maximum of 98.26%. In terms of execution time, an average reduction of 3.4% is observed across all test cases, with representative cases achieving an average reduction of 26.4%, and up to 99.99% reduction in loop-intensive programs. Regarding compilation overhead, the execution time of the proposed algorithm accounts for only 2.28% of the total compilation time on average, indicating low additional cost. In overall performance evaluation, the total time of compilation and execution decreases in most representative cases, with a maximum reduction of 94.55%. In addition, validation on 913 runtime unit test cases and 19749 test262 standard test cases shows that no semantic deviation is introduced. Compared with source-level redundancy elimination approaches, further performance gains are achieved at the fine-grained computation and control flow levels, demonstrating the unique advantages of IR-level optimization. The proposed method effectively overcomes the limitations imposed by runtime checks on RCE in dynamically typed languages, significantly improves execution efficiency while preserving semantic equivalence, and maintains low compilation overhead, providing a new approach for compiler optimization in dynamic language runtimes.
  • ZHENG Cheng, TAO Wenhao
    Accepted: 2026-06-02
    Aspect-based sentiment analysis (ABSA) aims to judge the sentiment polarity of specific aspects in texts. Existing methods usually adopt graph neural networks and attention mechanisms to encode the syntactic dependency information and semantic information of sentences. However, only the dependency relations between words can be captured by the syntactic dependency tree, and phrase-level syntactic structure cannot be expressed, thus limiting the model's utilization of phrase-level syntactic information. Moreover, when attention mechanisms are used to capture the semantic features of sentences, they are usually interfered by irrelevant contexts, thus generating semantic noise. Therefore, this paper proposes an aspect-based sentiment analysis model based on syntactic enhancement and semantic denoising. In the syntactic branch, a syntactic constituent tree is introduced to construct a syntactic constituent graph, so as to supplement phrase-level syntactic information. Two types of syntactic information are encoded by the syntactic constituent graph and the syntactic dependency graph respectively, and are dynamically weighted and aggregated through a syntactic fusion mechanism to obtain syntactically enhanced representations. In the semantic branch, a differential attention mechanism is introduced to reduce the attention weights of irrelevant contexts, thereby reducing semantic noise and obtaining denoised semantic representations. In addition, representations fused with external knowledge are obtained by concatenating external knowledge embeddings at the end of word embeddings, so as to help the model better understand sentence semantics. Finally, a multi-feature fusion module is used to fully integrate the three features. Experimental results show that compared with baseline models such as S2GSL, the proposed model improves the accuracy by at least 0.36, 0.83 and 3.13 percentage points on the Laptop, Restaurant and Twitter datasets, respectively, and boosts the F1-score by at least 0.56 and 2.96 percentage points on the Laptop and Twitter datasets, respectively, which verifies the effectiveness of the model’s syntactic enhancement and semantic denoising methods.
  • WANG Chao, WANG Yijing , DAI Cheng
    Accepted: 2026-06-02
    Continual anomaly detection focuses on incrementally learning new classes while retaining historical memory. However, the spectral bias and high-frequency artifacts encountered in generative replay severely constrain the fine-grained segmentation of subtle anomalies. To address this, this study proposes DenoiseCAD, a noise-resistant framework based on a cascaded purification architecture, aiming to eliminate feature shifts caused by generative artifacts and prevent the model from capturing spurious features unrelated to defects. First, the study proposes a feature prototype-guided latent space correction mechanism. During the reverse diffusion process, it utilizes the feature prototypes of normal classes as semantic anchors and iteratively rectifies latent variables by calculating feature metric gradients, thereby suppressing distribution shift noise from the source. Second, a task-driven frequency filter is constructed based on parameter sensitivity experiments, implementing a multi-granular spectral joint constraint strategy tailored to data source characteristics to effectively block the propagation of high-frequency artifacts. Finally, anchor-based weight consolidation is implemented. Through isotropic parameter distance constraints, it prevents the model from overfitting to residual noise, thereby establishing a full-pipeline denoising framework from source to terminal. This effectively balances the model's plasticity and stability, successfully alleviates the catastrophic forgetting dilemma, and provides a reliable new paradigm for complex intelligent industrial inspection scenarios. Extensive experiments demonstrate that DenoiseCAD achieves state-of-the-art performance on both the VisA and MVTec datasets. Notably, it yields significant improvements of 2.8% and 1.5% in P-AP over previous state-of-the-art methods.
  • Peng Yanfei, Bai Yihui, Wang Ziying, Chen Xiaozhu
    Accepted: 2026-06-02
    Unmanned aerial vehicle (UAV) object detection has been playing a crucial role in such fields as intelligent transportation and environmental monitoring. Nevertheless, due to the constraints of multiple factors including target size variations and diverse shooting angles, small object detection in UAV aerial imagery is confronted with prominent problems of drastic scale changes and easy feature attenuation. To tackle the aforementioned issues, this study proposes an improved object detection algorithm for UAV aerial-view scenarios based on YOLOv11n, namely DBD-YOLO. In the feature extraction stage, the DWR multi-scale structure is introduced, which combines dilated convolutions with multiple dilation rates and adaptive channel allocation. This structure can effectively expand the receptive field with low computational overhead and enhance the contextual representation of small objects. In the neck network, a new P2 feature layer is incorporated into the feature fusion process. Bi-directional Feature Pyramid Network (BiFPN) is adopted to realize cross-scale bidirectional weighted fusion, so as to improve the collaboration efficiency between shallow detailed features and deep semantic features. Meanwhile, traditional upsampling is replaced by Dysample point resampling, which not only reduces memory consumption and latency but also maintains fine-grained features. Finally, the DynamicHead, a dynamic adaptive detection head, is introduced. It integrates scale awareness, spatial awareness, and task awareness into a unified framework, and effectively applies the attention mechanism in the object detection head, thereby comprehensively improving the classification and localization performance of small objects in aerial imagery. Experimental results on the VisDrone2019-DET dataset show that the proposed DBD-YOLO algorithm achieves 45.2% in mAP50 and 27.4% in mAP50-95, representing an increase of 12.1% and 8.1% compared with the baseline , respectively. At the same time, the number of model parameters remains roughly at the same level as the baseline, realizing a dual breakthrough in both detection accuracy and computational efficiency.
  • WANG Jin, ZHANG Jiancheng , XU Cheng , XU Bingxin , ZHANG Cheng , LI Tianci
    Accepted: 2026-05-29
    To address the challenges of large-scale variations in student behaviors, dense target distributions, and insufficient recognition accuracy for back-row students in classroom scenarios, this paper proposes an improved student behavior recognition algorithm, termed MSD-YOLO, based on the YOLO11n baseline model. First, a Multi-Scale Behavior Perception module is introduced into the backbone to enhance the network’s ability to perceive behavior features at different scales, thereby alleviating the scale inconsistency between front-row and back-row students during the feature extraction stage. Second, a Semantic–Spatial Deep Fusion module is designed in the neck to strengthen the interaction between high-level semantic information and low-level spatial details, improving the discriminative representation of features in dense classroom scenes. Finally, a Dual-Scale Context Aggregation module is embedded before each detection head. By integrating global contextual information with a feature re-calibration mechanism, the proposed module further enhances the network’s ability to distinguish small-scale student behaviors, thereby improving the recognition accuracy of back-row students during the detection stage. Experimental results demonstrate that, compared with the YOLO11n baseline, MSD-YOLO achieves improvements of 3.2% and 3.7% in mAP@0.5 and mAP@0.5:0.95, respectively, on the self-constructed dataset. On the public STBD-08 dataset, the corresponding improvements reach 2.4% and 2.6%. Moreover, with only a slight increase in computational cost and model parameters, the proposed method maintains favorable real-time performance, validating its effectiveness and practical value for classroom student behavior recognition tasks
  • HE Yaojie, FU Xiaodong
    Accepted: 2026-05-29
    Online service reputation measurement aggregates user feedback to generate service reputation, which helps users judge service credibility in the absence of sufficient information. However, due to the dynamic evolution of the service environment, service quality, user quantity and user preferences keep changing over time. Reputation measurement methods that only focus on a single time point cannot reflect these changes timely and accurately. In addition, service reputation measurement mechanisms that fail to consider the maximization of user group satisfaction are difficult to attract users to give evaluations consistent with their real experience. This leads to some services being assigned false reputation values. To address these issues, this paper proposes an online service reputation measurement method for maximizing user group satisfaction. First, this paper models online service reputation measurement in dynamic environments as a Partially Observable Markov Decision Process (POMDP) optimization problem for maximizing user group satisfaction. Second, aiming at the inconsistency of user group evaluation criteria, this paper adopts large language models to calculate the reward function and measure user group satisfaction accordingly. Finally, this paper uses the Rainbow DQN algorithm to solve the optimization problem. Experiments are conducted on two public datasets, namely Movielens and Yelp, and multiple large language models are used for evaluation. Results show that the proposed method can generate reputation measurement results consistent with the preferences of most users, thus achieving the maximization of user group satisfaction and verifying the effectiveness of the method.
  • Wang caizhi, Wang yang, Yang guanci
    Accepted: 2026-05-29
    Stochastic Configuration Networks (SCNs) incorporate randomized learning mechanisms into neural network training to enhance modeling efficiency and employ a data-dependent supervisory mechanism to ensure the universal approximation capability of the model. However, during the incremental construction process, the computation of hidden layer output weights after each newly added hidden node relies on repeated calculation of the pseudoinverse of the hidden layer output matrix, which limits the training efficiency to some extent. In addition, while randomized learning improves modeling efficiency, it inevitably introduces potential redundant hidden nodes and parameters. To address these issues, this paper proposes a group-sparse learning method for Incremental Regularized Stochastic Configuration Networks (GSL-IRSCN). First, to improve the training efficiency of regularized SCNs in the incremental modeling process, an incremental output-weight updating strategy for L2-regularized SCNs is developed based on the Woodbury matrix identity, thereby avoiding repeated computation of the inverse of the regularized normal matrix and effectively reducing the computational cost of the model. Then, to mitigate redundancy in hidden nodes induced by the randomized learning mechanism, group L1/2 regularization is introduced and optimized via the Alternating Direction Method of Multipliers (ADMM), achieving sparsity constraints on redundant nodes and simplifying the network architecture. Experimental results on four benchmark datasets from UCI and KEEL demonstrate that the proposed GSL-IRSCN outperforms existing comparative methods in terms of both training efficiency and model compactness.
  • XU Han, YE Shan, DAI Qiuju, DING Yajun, WANG Runmin
    Accepted: 2026-05-29
    参考伪装目标检测(Ref-COD)旨在依托参考图像或文本,精准分割指定伪装目标,是伪装目标检测领域的新型任务。大部分现有方法仅采用单一模态参考信息,在多源参考信息融合及跨模态特征适配方面存在明显局限,难以充分发挥参考指导价值。为此,本文提出一种基于文本-图像多模态融合的Ref-COD网络(TIFNet),实现多源信息高效利用与精细检测。首先,通过金字塔视觉Transformer(PVT)编码器、冻结显著目标检测(SOD)编码器及对比语言-图像预训练(CLIP)编码器,分别提取输入图像、参考图像及参考文本的多阶段特征;设计多键值参考融合模块(MRFM),完成跨模态特征对齐与深度融合,强化参考信息定向指导作用;引入参考空间通道增强模块(RSCM),从双维度实现融合特征与参考特征的双向互增强,消解模态差异;最后利用参考自适应归一化模块(RANM),聚焦关键像素细节,提升模型对多样化伪装场景的自适应能力。大量实验结果表明,该方法相较于近年来主流最优(SOTA)方法,已在R2C7K数据集上的 、 、 、 评价指标上分别取得了0.869、0.929、0.786、0.022的结果,展现出了显著的优势,有效提升了复杂场景下指定伪装目标的分割精度与鲁棒性,为多源信息驱动的伪装目标检测提供了新思路。
  • Fu Su, Wang Shuaiqun
    Accepted: 2026-05-29
    To address challenges in thyroid ultrasound nodule segmentation, including blurred boundaries, low contrast, and highly variable small lesions, this paper proposes an improved model named MAD-UNet. The model improves contour delineation by strengthening cross-layer feature transfer consistency and deformable context modeling. A Multi-Directional Separable Attention Module (MDSAM) is embedded in the skip connections between the encoder and the decoder. MDSAM applies direction-aware channel–spatial joint attention to reweight key edge responses. This design enhances the consistency between shallow spatial details and deep semantic information. It strengthens boundary localization and alleviates gradient attenuation during deep network training. In addition, the Transformer encoder depth is extended to 24 layers to better model long-range dependencies and global context. Furthermore, a Deformable Adaptive Multi-Scale Context Module (DAMCM) is introduced. DAMCM combines deformable modeling with multi-scale context aggregation. It enables adaptive fusion of local structure alignment and global context supplementation. It improves representation of irregular contours and small targets. On the TN3K, DDTI, and Shanghai Sixth People's Hospital THN-L datasets, the Dice scores reach 89.10%、90.53% and 91.17%, respectively. The overall performance exceeds the TransUNet baseline on all datasets. Complexity evaluation shows 215.27M parameters, 65.96G FLOPs, and an inference speed of 111 FPS. Visualization analysis shows stronger robustness for nodule contours under complex ultrasound conditions. The experimental results verify the effectiveness of the model in fine boundary delineation and small lesion recognition. The method provides a basis for subsequent deployment and optimization in clinical application scenarios.
  • WANG Han, LI Shen, DU Xiawei, SHU Yanjun, HU Chen, YU Guo, LIU Yuhai
    Accepted: 2026-05-29
    To address the issues of poor adaptability of static strategies, strategy space explosion, and performance jitter in collective communication on domestic GPGPU platforms, this paper proposes an offline automatic tuning, communication strategy optimization, and consolidation method for domestic heterogeneous computing platforms. The proposed method constructs a multidimensional performance space model over communication primitives, message sizes, and node scales, and obtains performance data through systematic offline benchmarking. To mitigate the impact of system noise in heterogeneous environments, a filtering mechanism based on default strategy comparison and significance thresholding is designed. Specifically, the default strategy is first used as a baseline to evaluate performance differences, and statistical analysis is then applied to identify communication strategy combinations with significant performance advantages, thereby enabling communication strategy optimization.Furthermore, an interval-based strategy model is constructed to map discrete sampling points into continuous message size ranges, and the optimized strategy mapping is embedded into the internal decision logic of the RCCL communication library. Experimental results on domestic heterogeneous clusters demonstrate that the proposed method enables automatic and accurate strategy selection without introducing any additional runtime overhead. Compared with default strategies, the average bandwidth utilization of Reduce and AllReduce operations is improved by 22.4% and 24%, respectively. By leveraging offline tuning and strategy consolidation, the proposed approach effectively avoids the overhead and instability caused by dynamic search, and provides an efficient and practical solution for improving communication efficiency and system stability in large-scale distributed training systems.
  • CUI Liqun, WANG Xiaohan, JIN Haibo
    Accepted: 2026-05-26
    To address the problems of generator training confusion, insufficient image detail restoration and incomplete haze removal in existing unsupervised image dehazing methods based on CycleGAN, an unsupervised image dehazing network based on High-frequency Information Enhancement (HIE-Net) is proposed. First, a Multi-Branch Dehazing Network (MBDN) is constructed. The network implements unified encoding of the image feature space through a shared encoding module, and adopts a multi-branch decoding module to achieve differentiated adaptation and precise decoding for features corresponding to different haze densities. Meanwhile, unsupervised constraints are established based on the Atmospheric Scattering Model (ASM) to regularize the training process of the generator. Secondly, a High-Frequency Multi-scale Enhancement Module (HMEM) is designed. A bidirectional guidance mechanism is built based on a large-kernel grouped attention gate. Through the bidirectional interaction between hazy region features and enhanced high-frequency information, the module synchronously achieves the multi-scale enhancement of both hazy region features and high-frequency information including image textures and edges. Finally, a Channel Feature Purification Module (CFPM) is introduced. A channel cross-attention mechanism is adopted to accurately screen haze-sensitive channels, suppress the interference caused by haze residuals in the feature fusion stage, and optimize the allocation of channel feature space. Meanwhile, a spatial cross-attention mechanism is leveraged to capture the haze density correlation and spatial dependency across different regions, thus achieving the fine-grained purification of deep features. Experimental results demonstrate that on the BeDDE dataset, HIE-Net achieves 21.20 dB, 0.779, and 0.198 in PSNR, SSIM, and LPIPS, respectively, which provides a novel insight for the field of image dehazing.
  • TANG Zhi-Wen, HU Xing-Chen, HU Yi-Hui, GUO Tian-Xiang, LI Shuo-Hao, HUANG Jin-Cai
    Accepted: 2026-05-26
    In traffic monitoring and public security scenarios, relying solely on ground-view or aerial-view vehicle re-identification often fails to meet the requirements of large-scale, complex, and multi-scene perception. Ground-view images contain rich visual details but suffer from limited field of view and frequent occlusions, whereas aerial views offer wide-area coverage but usually depict vehicles with small sizes and insufficient details, leading to degraded recognition performance. Therefore, fusing ground and aerial viewpoints for cross-view vehicle ReID has become a key research direction for enhancing large-scale traffic perception. However, this task is confronted with several challenges, including severe scale variations, large cross-view appearance discrepancies, intra-class distances exceeding inter-class distances, and limited cross-scene data. To this end, we propose a large model-based semantic enhancement method for cross-view vehicle re-identification. Built upon the CLIP-ReID multimodal framework, the proposed approach first employs Qwen-VL-Plus multimodal large model to automatically generate fine-grained structured descriptions for vehicle images, and then leverages Qwen-Max language large model to fuse semantic information from ground and aerial viewpoints, yielding a unified and stable cross-view semantic representation. This representation is further injected into a two-stage image–text contrastive learning scheme to strengthen the model’s domain generalization ability under cross-scene and cross-platform conditions. To promote practical deployment and subsequent research, we also construct a cross-view ground–aerial vehicle image dataset covering multiple flight altitudes, acquisition devices, and scene conditions, and design domain-generalization-oriented data splits and evaluation protocols as a new benchmark. Experimental results demonstrate that the proposed method significantly outperforms pure visual baselines on multiple metrics and achieves superior performance to state-of-the-art algorithms in cross-scene domain generalization tests, validating the effectiveness of semantic enhancement for cross-view vehicle re-identification. The proposed method shows strong application potential and engineering value in intelligent traffic surveillance, UAV-based patrol, and regional security.
  • CHEN Xin, SUN Yicheng, TAN Cheng
    Accepted: 2026-05-26
    As the scale and complexity of complex intelligent systems represented by high-performance computing systems and embedded systems continue to grow, automated anomaly detection of logs, as core operational data, has become critical to ensuring reliable system operation. Traditional log anomaly detection methods driven by machine learning and deep learning focus mostly on log sequence modeling, and suffer from insufficient semantic understanding and limited generalization ability. Large language models have effectively overcome this limitation with their outstanding semantic understanding and contextual reasoning capabilities. Since the rise of large language model technology, relevant research has emerged rapidly, but achievements are scattered across multiple technical paths and lack a systematic review. This paper provides a comprehensive survey of log anomaly detection methods based on large language models, selects 35 core literatures, and establishes a unified technical classification framework. Existing methods are categorized into five technical routes: prompt engineering, retrieval-augmented generation, domain fine-tuning, reinforcement learning, and large-small model collaboration. Research shows that supervised fine-tuning is the most widely used technical route at present, while the large-small model collaborative architecture, as an emerging paradigm, is shifting the research focus from pure pursuit of detection accuracy to balancing inference efficiency and industrial deployability. Current evaluation systems focus heavily on detection performance metrics, with insufficient attention to efficiency overhead and interpretability. Finally, this paper identifies the inference latency bottlenecks and data privacy challenges of large language models in processing ultra-long massive log streams, and proposes insights into frontier directions such as lightweight deployment and online continuous learning.
  • LIU Shuohan, WU Youxi, ZHANG Yajie, LIU Jingyu, LI Yan
    Accepted: 2026-05-26
    Causal relationship mining aims to reveal latent causal mechanisms from complex data. Existing studies, mostly based on Bayesian network frameworks or simple filtration of association rules, generally face challenges such as low mining efficiency and difficulty in controlling unobserved confounding variables, resulting in insufficient accuracy and robustness of causal identification. To address this, this paper proposes a fast causal rule mining algorithm. This algorithm utilizes a prefix-tree structure for frequent pattern mining and integrates multiple pruning strategies to significantly enhance mining efficiency. Furthermore, it introduces a covariate mechanism and a matched transaction pair technique to effectively control confounding factors, thereby improving the reliability of causal rules. Experimental results demonstrate that the computational efficiency of the proposed algorithm is improved by 3 to 4 orders of magnitude compared to baseline algorithms. On large-scale datasets, its execution time is further reduced by 30%–50% compared to similar variants. In terms of accuracy, compared with baseline causal methods, the proposed algorithm maintains a stable Precision in the range of 0.69–0.90 and generally achieves an improvement of over 40%–60% in F1-score. These results fully validate the efficiency and superiority of the proposed algorithm in large-scale causal rule mining tasks.
  • WANG Shengming, YANG Weiwei , MA Yan, CHEN Mao
    Accepted: 2026-05-26
    Problem understanding is a critical prerequisite for achieving automated geometric theorem proving. However, existing approaches commonly suffer from excessive reliance on feature engineering and limited generalization capabilities, making them inadequate for effectively supporting automated problem solving. To address this challenge, this paper proposes a large language model-based method for geometric problem understanding by fine-tuning the Qwen2.5 base model and integrating chain-of-thought reasoning with k-nearest neighbor (KNN) retrieval-augmented generation. Furthermore, to enhance the accuracy of semantic translation, we introduce an agent-based hallucination detection and correction mechanism, which significantly mitigates hallucination issues during problem understanding. Experimental results demonstrate that the proposed method achieves an accuracy of 88.85% and a recall of 89.12% on the intent understanding task of the self-constructed dataset, significantly outperforming the baseline model. On the Geometry3K dataset, it attains an accuracy of 94.86% and a recall of 94.18%, exhibiting superior performance compared to the Inter-GPS method. Additionally, comprehensive ablation studies and comparative analyses under various parameter configurations further validate the superior performance and adaptability of our multi-strategy hybrid approach.
  • LIU Chang, WANG Guoyu, ZHU Guoqiang, LIU Shaoyu, LI Yongchao, QIAO Junpeng
    Accepted: 2026-05-22
    The core issue in underwater optical imaging lies in the scattering effects of water, particularly backscattering, which creates an approximately uniform haze-like background during image formation, severely obscuring the structural information of targets. This significantly limits the effective application of underwater vision systems in high-turbidity environments. To address this challenge, this paper proposes an underwater imaging framework that deeply integrates physical processes with computational imaging methods. The core concept is to transform the originally difficult-to-model problem of global strong scattering into a locally separable problem with well-defined geometric and statistical characteristics, through physical scanning and light field redundancy constraints. In terms of implementation, the framework first decomposes wide-area scattering into localized scattering across sequential frames using line-structured light scanning. Subsequently, virtual aperture technology is employed to preprocess light field data based on structured light geometric priors, thereby constraining scattering regions. Furthermore, the angular redundancy of the light field is utilized to construct Epipolar Plane Images (EPI), and low-rank decomposition is applied to separate the backscattering component, which exhibits low-rank properties, from the target signal, which exhibits sparsity. Finally, high-quality underwater images are obtained through sequential frame stitching and luminance homogenization. System experiments were conducted within a turbidity range of 10–30 NTU (Nephelometric Turbidity Units). The results demonstrate that the proposed method consistently outperforms comparative approaches under various turbidity conditions, showing stable improvements in Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index (SSIM), and no-reference image quality assessment metrics. Particularly under high-turbidity conditions, the method exhibits stronger robustness, with significantly less degradation in imaging quality as turbidity increases compared to other methods. This validates the effectiveness of the physics-computation collaborative imaging framework in complex scattering environments.
  • Sun Tanbo, Zhong Shuai, Hu Xinao, Wang Liping
    Accepted: 2026-05-22
    With the widespread use of digital images in social media, they have become the core carrier of information dissemination. The rise of powerful and easy-to-use image editing software and generative artificial intelligence technology has lowered the threshold for creation while also providing a more covert way for malicious image tampering, leading to the accelerated spread of false information. Tampering will leave specific tampering features in the image, which constitutes the core basis of image tampering detection technology. In the face of increasingly complex and diverse tampering methods, most of the existing reviews focus on a single technical route, and lack systematic comparison and integrated analysis of image tampering detection techniques. To this end, this paper constructs a three-dimensional classification system of "feature traceability-extraction method-detection task", summarizes the image tamper detection technology into two categories based on manual features and deep learning features, and carries out the following work: First, the system reconstructs the classification framework of image tamper detection technology based on manual features, and integrates the scattered manual features in traditional research into three categories: camera system features, pixel-level features and format-related features. The physical mechanism and improvement effect of the performance optimization strategies of 14 typical image tamper detection technologies are deeply analyzed, and the shortcomings of the existing review in the systematic analysis of image tamper detection technology based on manual features are thoroughly analyzed. Second, the image tamper detection technology based on deep learning is sorted out in architecture, and the generative image tamper detection technology is analyzed. Third, the composition, characteristics and limitations of the existing tampered image datasets are summarized and commented on, and the selection of datasets is provided with a selectable basis. Finally, this paper summarizes and looks forward to the future research direction and development trend of this field, and points out some key scientific issues that need to be solved urgently, in order to provide reference for subsequent research.
  • SHEN Yixiang, SUN Yongqi, ZHAO Sicong, HU Conggang
    Accepted: 2026-05-21
    To address the issues of identity and audio consistency in existing talking face generation models, a Transformer-based diffusion for talking face generation is proposed. First, to improve identity consistency, a global-local collaborative identity alignment module is designed. This module utilizes attention pooling to aggregate global identity representations and introduces a learnable positional encoding matrix to accurately capture local facial geometry, thus significantly enhancing the ability to preserve identity information. Second, to improve audio consistency, a multi-level feature staggered fusion method based on a diffusion Transformer is proposed. Audio and identity features are deeply fused in each Transformer layer, and a multi-stage training strategy is combined to make the generated lip movements more natural. Experimental results on the public datasets LRS3 and HDTF show that, compared with existing methods, the proposed model achieves better performance in terms of the Sync-C and CSIM metrics.
  • YANG Xinyi, MA Jianmin , MA Yupo
    Accepted: 2026-05-21
    In multi-label fuzzy data, feature redundancy, complex interaction relationships between features, and unequal feature contributions are commonly present, which affect the classification performance of multi-label learning. To address these issues, ReliefF-β algorithm is proposed to assign feature weights, and a multi-label feature selection method based on feature weighted interaction is presented. Firstly, feature similarity and label similarity are constructed for multi-label fuzzy data. A regulating parameter β is introduced to fuse the two similarities and construct a global sample similarity, then ReliefF-β algorithm is proposed for feature weighting. On this basis, multi-label weighted fuzzy rough set is introduced based on feature weights, and uncertainty measures such as weighted fuzzy entropy and weighted fuzzy mutual information are defined. The related properties and relationships among these measures are studied. Furthermore, a feature weighted evaluation function is defined by considering feature relevance, redundancy, and interaction, then a multi-label feature selection algorithm based on feature weighted interaction is proposed. Finally, comparative experiments are conducted under two classifiers. The results show that, compared with other comparison algorithms, under ML-KNN, the proposed method improves Average Precision (AP) by 8.79% on average, while Hamming Loss (HL), Ranking Loss (RL), Coverage (CV), and One-Error (OE) are reduced by 5.06%, 15.33%, 10.97% and 23.06%, respectively. Under BRDT, AP is improved by 4.06%, and HL, RL, CV, and OE are reduced by 8.60%, 10.28%, 7.19% and 5.89%, respectively. Ablation studies and statistical tests further verify the effectiveness of the proposed method.
  • XIE Binhong, SUN Xiaosong, ZHANG Rui
    Accepted: 2026-05-20
    Small object detection in complex scenarios has long grappled with two major technical bottlenecks: the propensity for weak object features to attenuate within deep neural networks, and the severe interference caused by environmental background noise. To address these challenges, this study proposes WF-DETR, an end-to-end real-time small object detection model. In the feature extraction stage, a Feature Weaving Network (WeaveNet) is designed. Diverging from simple hierarchical stacking, WeaveNet employs a heterogeneous feature weaving strategy. Leveraging a cross-level feature mutual correction mechanism, it tightly interweaves and bidirectionally calibrates deep semantic information with shallow geometric details. This approach effectively suppresses the attenuation of spatial information during feature transmission and mitigates small object feature loss, all while maintaining high-level semantic strength. Inspired by human visual physiological mechanisms, the neck network incorporates a FoveaFormer module. By simulating the human foveal imaging mechanism via adaptive sparse attention and gating units, this module dynamically filters redundant background noise and focuses on high-value target regions, significantly enhancing feature purity. Furthermore, a Haar Wavelet Downsample (HWD) operator is introduced to reconstruct the downsampling process. From a frequency domain perspective, this overcomes the irreversible loss of high-frequency texture details caused by traditional pooling, further augmenting the discriminability of small object features. Experimental results on the VisDrone2019 benchmark dataset demonstrate that the proposed model achieves mAP@0.5:0.95 of 23.7% and an inference speed of 166.3 FPS. These results fully validate the real-time performance and superiority of WF-DETR in small object detection tasks within complex backgrounds.
  • HE Ruiying, TIAN Youliang, XIANG Axin, ZHOU Feng, LIU Kaiqi
    Accepted: 2026-05-20
    Cloud computing offers efficient data storage and management, facilitating convenient data sharing and access. However, ensuring data security and user privacy in open cloud environments remains a critical challenge. Ciphertext-policy attribute-based encryption (CP-ABE) has been widely adopted to enforce fine-grained access control over data stored in cloud servers. Nevertheless, existing schemes still face limitations in handling hierarchical data and tracing malicious ciphertexts, making it difficult to simultaneously achieve efficient multilevel access and data provenance assurance. To address these challenges, this paper proposes a hierarchical attribute-based access control scheme with traceability over ciphertext. First, a hierarchical CP-ABE framework is employed to construct an efficient multilevel access mechanism. By integrating multiple hierarchical access trees into a unified structure, the scheme enables encryption and decryption of data at different levels under a single policy, significantly reducing computational overhead. Second, a zero-knowledge proof-based signature mechanism is introduced to securely bind ciphertexts with their creators while preserving data owner anonymity, enabling accurate tracing of malicious ciphertext sources. Finally, security analysis demonstrates that the proposed scheme can effectively resist chosen-plaintext attacks. Experimental evaluation shows that, compared with existing approaches, the scheme achieves lower encryption and decryption overhead, making it well suited for secure, efficient, and traceable data sharing in cloud environments.
  • ZHU Yanbin, ZHANG Hanling, WANG Runmin
    Accepted: 2026-05-20
    Micro-expressions are fleeting, involuntary facial muscle movements that can reveal genuine emotions individuals attempt to conceal. However, micro-expression recognition faces numerous challenges, including short duration, low intensity, prominent local features, limited scale of public datasets, and significant individual differences, which constrain the recognition accuracy and generalization capability of traditional methods. To address these issues, this study proposes a single-stream fine-grained micro-expression recognition method based on dynamic routing experts. Inspired by the mixture-of-experts model, this method replaces the traditional multi-head self-attention layer in Transformers with a dynamic routing expert mechanism. It dynamically selects expert networks through a sparse activation strategy and leverages a collaboration mechanism among experts to enhance feature representation capability, thereby improving model representational capacity while maintaining computational efficiency. Additionally, a multi-grained asymmetric aggregation module is designed, which integrates orientation-aware convolution and channel attention to effectively decouple spatial features and adaptively adjust feature granularity at different network levels, enabling more precise capture of subtle directional movements and local texture variations in micro-expressions. Experiments conducted on three public datasets, SAMM, SMIC, and CASME II, demonstrate that the proposed method significantly outperforms mainstream approaches. On the composite dataset, the method achieves an unweighted average recall of 87.65% and an unweighted F1-score of 87.21%. The experimental results validate the effectiveness of this method in capturing subtle dynamic features of micro-expressions, providing reliable technical support for emotion recognition in complex scenarios.
  • Nie Zeli, Sun Danfeng, Zhao Jianyong, Wu Huifeng
    Accepted: 2026-05-19
    The widespread application of robots and vision systems in factories has promoted mixed-line production characterized by small batches and multiple product varieties, while also sharply increasing the diversity of target size specifications and the uncertainty of arrival sequences, thereby making stacking tasks at many transition stages of production lines still highly challenging. With the increase in the number of targets in the sequence, it becomes difficult to guarantee both the solution time and the solution quality of the stacking task. To address the above issues, a hybrid optimization algorithm with stimulus memory for sequence stacking tasks is proposed. The algorithm decomposes the sequence stacking task into two subtasks: combinational block knowledge base construction and stacking decision optimization. First, basic target combinations satisfying quality thresholds are searched for in the initial sequence of targets to be stacked, so as to construct a combinational block knowledge base. During this process, a stimulus memory mechanism is introduced to dynamically update the existing combinational knowledge. Subsequently, each combinational block is equivalently treated as a macro-target, and the placement sequence and placement orientation of all targets are jointly optimized. Comparative experimental results on datasets with different size distributions show that, compared with the baseline algorithms, the proposed algorithm can reduce the solution time of stacking plans by at least 4.94% while achieving the optimal average filling rate of stacking space, which verifies its effectiveness in sequence stacking tasks. The ablation experimental results show that the proposed complete algorithm achieves the best performance in terms of solution time, which validates the rationality of the proposed algorithmic architecture.
  • Lu Shibo, Li Jing
    Accepted: 2026-05-19
    To address the problem that, in radar emitter individual identification, a single continuous-pulse model cannot simultaneously capture global temporal information and fine-grained single-pulse features, while a single-pulse model lacks global dynamic information, thereby limiting recognition performance in complex electromagnetic environments, this paper proposes a dual-branch lightweight fusion recognition method. First, the original pulse sequence is segmented into two types of data, namely continuous pulse sequences and single pulses, through continuous pulse segmentation. Corresponding datasets are constructed for the inter-pulse sequence branch and the single-pulse branch, and a continuous-sequence model and a single-pulse model are trained separately to extract inter-pulse temporal features and fine-grained intra-pulse features, thus enabling complementary modeling of the two types of information. Subsequently, two fusion strategies, namely feature-level fusion and decision-level fusion, are designed. In the feature-level fusion strategy, a gating mechanism is introduced to learn the importance weights of features from different branches, so that continuous-pulse features and single-pulse features can be adaptively weighted to construct a joint feature representation. In the decision-level fusion strategy, the probability outputs of the two models are integrated by soft voting to improve recognition stability. To verify the effectiveness of the proposed method, comparative experiments and ablation studies are conducted on a measured radar dataset. The results show that both fusion strategies outperform the individual models. Specifically, decision-level fusion improves the recognition accuracy by approximately 8 percentage points over the single continuous-pulse model and by about 3 percentage points over the single single-pulse model. Moreover, feature-level fusion achieves the best recognition performance while reducing the number of model parameters by two orders of magnitude compared with the baseline model. The results demonstrate that the proposed method can maintain high recognition accuracy while also exhibiting favorable lightweight characteristics and strong potential for engineering applications.
  • KANG Panpan, CAO Yuecheng, TENG Liping, CHEN Junjie, LI Hongjun
    Accepted: 2026-05-19
    In recent years, although self-supervised skeleton-based action recognition has made progress, it still faces two types of training bias under strong augmentations: imbalance in local perturbation allocation can easily lead to over-perturbation of critical motion segments and insufficient variation in low-dynamic regions; in multi-positive contrastive learning, non-target positive samples participate in normalization competition, which can easily cause target conflicts and weaken representation aggregation. To address this issue, this paper proposes DCD-CLR, a self-supervised contrastive learning framework for collaborative optimization of view construction and objective construction, namely Dual-end Collaborative Debiasing Contrastive Representation Learning, to improve the quality of skeleton representation learning from the two aspects of augmentation allocation and contrastive objective. On the view side, Continuous Dynamic Saliency Augmentation (CDSA) is designed to integrate frame-difference energy and data-level joint motion priors, construct a frame-joint dynamic intensity map, and perform continuous, region-level, and sample-adaptive scheduling of spatiotemporal perturbation magnitudes, thereby improving view diversity while preserving critical motion segments. On the objective side, Target-Isolated InfoNCE (TI-InfoNCE) is proposed as a target-isolated debiased multi-positive contrastive objective, which removes the remaining positive samples when computing the normalization term of the target positive sample, so as to reduce competition interference among positive samples and improve the boundary clarity of the representation distribution. Under the linear evaluation setting, the proposed method achieves recognition accuracies of 85.9%, 79.6%, and 92.6% on NTU60 xsub, NTU120 xset, and PKU-MMD I, respectively; combined with the results of representation distribution visualization, transfer evaluation, and noise interference experiments, it is shown that the proposed method has good stability, generalization ability, and robustness.
  • Chen Hong, Wang Jinwei, Jin Haibo, Wu Cong, Yang Zi
    Accepted: 2026-05-19
    With cyberattacks becoming increasingly sophisticated and covert, improving the representation and recognition of complex traffic patterns has become an important issue in intrusion detection. Although existing methods have improved detection performance, directly modeling complex network traffic still suffers from insufficient feature representation. To enhance local correlations and structural information among features, many studies transform one-dimensional traffic features into two-dimensional image-like representations for deep feature learning. However, limited by feature dimensionality and encoding schemes, such traffic images are usually small and structurally constrained, making fixed enhancement strategies insufficient for capturing differences among attack patterns. Meanwhile, class imbalance further restricts the recognition of minority attack classes. To address these issues, this paper proposes a network intrusion detection method based on dynamic selective feature enhancement. At the representation level, a multi-scale feature enhancement module adaptively fuses features with different receptive fields to alleviate the representation limitations of small traffic images. At the decision level, a dynamic adaptive module combined with minority-class attention selectively strengthens key responses to improve minority-class recognition. Experimental results show that the proposed method achieves 96.49% accuracy, 95.11% precision, 96.32% recall, and 95.50% F1-score on NSL-KDD. It also maintains good detection performance on UNSW-NB15 and shows stable performance in a simulated streaming environment built on TON-IoT-Network.
  • CAO Qi, LI Shaodong, LU Shuaiyan, ZHANG Zhehao, YANG Guokai
    Accepted: 2026-05-15
    In recent years, RGB-based hand mesh reconstruction has attracted extensive attention. Existing methods mainly rely on stacking complex visual modules to improve reconstruction accuracy, but this often incurs high computational cost and makes it difficult to satisfy the requirements of real-time applications. To address this issue, this paper introduces natural language information during training, injecting high-level prior knowledge into the network to enhance visual feature representation. Since the text branch is used only for supervision during training, it does not increase the number of parameters of the main network, thereby preserving real-time performance. To further enhance visual representation, a dual-scale text generation module is proposed to describe hand features from both global and local perspectives. Specifically, the global text prompt models the overall hand pose based on the bending degree of each finger, while the local text prompt describes local hand features according to the spatial positions of individual joints. In addition, contrastive learning is employed to enforce consistency between multi-scale text features and image features in a shared semantic space. Considering that the CLIP model is highly sensitive to textual formulation, manually designing prompts usually requires extensive tuning and still cannot guarantee sufficient alignment with image features. To this end, this paper adopts a combination of fixed text prompts and learnable word vectors, where the fixed prompts are used to summarize the main semantic information, and the learnable word vectors are used to adaptively refine the prompts, thereby improving the suitability of the text descriptions for the hand mesh reconstruction task. Experimental results show that, compared with real-time methods, the proposed method achieves excellent reconstruction accuracy while maintaining real-time performance. On the FreiHAND dataset, the PA-MPJPE and PA-MPVPE reach 5.5 mm and 5.8 mm, respectively; on the DexYCB dataset, they reach 5.4 mm and 5.2 mm, respectively. The inference speed reaches 68 fps. Ablation studies further demonstrate that the dual-scale text prompts play a key role in hand mesh reconstruction.
  • Song Chengchen, Wu Qi, Miao Wang
    Accepted: 2026-05-15
    With the proliferation of digital platforms, the forms of offensive memes have become increasingly complex and diverse. This phenomenon has exacerbated the scarcity of high-quality annotated data, making modal semantic alignment bias under small-sample conditions a core issue constraining detection performance. To address this issue, this study proposes an offensive meme detection method via Cross-Modal Meta-Learning with Unimodal Rectification(CMML-UR). This method uses a cross-modal dual-gradient meta-learning framework, leverages hierarchical image features to provide multi-level visual semantics, and combines them with low-noise textual representations generated through multi-regularized modeling. At the decision fusion stage, by evaluating the output confidence of each modality at the sample level, the method introduces a unimodal confidence-gated rectification mechanism to dynamically calibrate the final prediction. Experimental results on the MultiOFF dataset demonstrate that the proposed method achieves a weighted F1-score of 74.6%, which is an improvement of 4.3 percentage points over the state-of-the-art (SOTA) model. In few-shot generalization tests, it maintains a weighted F1-score of 69.3% (5.6 percentage points higher than the baseline model at 63.7%), verifying its efficiency in complex cross-modal semantic understanding and robustness in noise suppression within few-shot scenarios.
  • YUN Jian, WANG Songnan, ZHANG Xueyi
    Accepted: 2026-05-15
    This paper addresses the dual challenges of system heterogeneity and data heterogeneity faced by federated learning in the medical image classification task, and proposes an adaptive federated optimization framework named SEFedProX based on reinforcement learning. This framework employs the Soft Actor-Critic algorithm in an heterogeneous environment, based on key state features such as client data distribution and performance feedback, and dynamically adjusts the proximal term coefficients in the continuous action space, effectively overcoming the quantization errors and model oscillation problems caused by discrete action spaces, and achieving precise and smooth control of the local training intensity. At the same time, an EfficientNetV2B2 pre-trained on ImageNet is introduced as the feature extraction network, which improves the model's representation efficiency and discrimination ability while significantly reducing the deployment requirements for resource-constrained medical edge devices, alleviating the overfitting risk in small-scale medical data. Systematic experimental results based on four different system heterogeneity settings and four medical image datasets and a general dataset show that SEFedProX significantly outperforms existing baseline methods in terms of classification accuracy, convergence speed, stability, and robustness. Ablation experiments further verify the effectiveness of the SAC continuous regulation mechanism and the EfficientNetV2B2 network, as well as their collaborative enhancement effect in the framework. This research provides a stable, efficient, and highly adaptive technical solution for the construction of distributed intelligent diagnostic systems in heterogeneous medical environments.
  • Kedong Zhang, Xusheng Qian, Zhiyong Zhou , Yakang Dai
    Accepted: 2026-05-15
    Multimodal vision-language foundation models show great potential in the medical domain, yet face notable limitations due to complex medical semantics and challenging cross-modal modeling. Patient-level rigid alignment ignores semantic similarity, causing unreasonable negative repulsion and degrading learning, while the lack of unified hierarchical modeling between reports and images hinders fine-grained cross-modal alignment. To address the above issues, this paper proposes a global-local collaborative alignment (GLCA), which achieves an improved medical vision-language classification model. GLCA consists of two main components: semantic-driven cross-patient soft global alignment and progressive three-granularity intra-patient local alignment. The semantic-driven cross-patient soft global alignment leverages cross-patient semantic sample pair mining and correlation-weighted contrastive penalty to construct a more continuous feature space that better reflects authentic semantic relationships. The progressive three-granularity intra-patient local alignment aligns visual and textual features at three levels-coarse (report-image), mid (sentence-region), and fine (word-patch)-via progressive query fusion, enabling effective cross-modal interaction. Global-local collaborative alignment first builds a semantically consistent feature space through inter-patient soft global alignment, then performs layer-wise matching via intra-patient multi-granularity alignment, ensuring continuous and precise cross-modal semantic correspondence. Extensive experiments are conducted on four chest X-ray datasets. The results demonstrate that GLCA significantly outperforms existing methods in both zero-shot prediction classification and few-shot fine-tuning classification tasks. On the public 14-class ChestXray14 dataset, the zero-shot prediction classification achieves improvements of 1.2%, 2.0%, and 2.2% over the second-best method in terms of AUC, F1, and ACC, respectively.
  • ZHONG Hang, ZHANG Qinghua, LUO Nanfang, GUO Ruili
    Accepted: 2026-05-15
    Multimodal emotion recognition in conversations integrates language, acoustic, and visual information to automatically identify the emotions in dialogues, thereby enhancing the naturalness and emotional understanding in human-computer interaction. However, existing methods have limitations in modeling multi-layer contextual dependencies of emotions. Multimodal feature fusion often introduces redundant information and noise, and these methods cannot effectively capture the uncertainty of emotions, which limits the recognition of complex emotional categories. To address these issues, this paper proposes a multimodal emotion recognition model that combines hybrid encoding and fuzzy modeling. The model uses a hybrid encoding module to capture both global dialogue context and local utterance-level dependencies, which strengthens the representation of emotional temporal features. In addition, a hierarchical gated fusion mechanism integrates features from different modalities and layers with dynamic weighting to suppress redundancy and noise and improve multimodal feature discrimination. For emotion classification, a fuzzy neural network initialized with linearly spaced parameters models the boundaries of emotion categories using fuzzy membership functions, capturing the uncertainty and fuzziness of emotional expression. Experimental results show that the proposed model outperforms baseline methods on all metrics across the IEMOCAP, MELD, and CMU-MOSEI datasets. It achieves an accuracy of 72.67% on IEMOCAP, 67.37% on MELD, and 54.96% for 7-class accuracy and 86.78% for 2-class accuracy on CMU-MOSEI, respectively, which validates the effectiveness of the proposed method in multimodal sentiment analysis.
  • LIU Xiangbin, ZHU Youhua, PENG Feng
    Accepted: 2026-05-15
    Handwritten mathematical expression recognition is an important task in computer vision and plays a significant role in intelligent education, industrial applications, and related fields. Existing encoder-decoder-based methods typically rely on standard convolutions and conventional attention mechanisms for feature extraction. However, the fixed-grid sampling of standard convolution cannot effectively adapt to the geometric deformations of handwritten symbols, which often leads to confusion between visually similar characters. In addition, traditional attention mechanisms usually involve limited cross-dimensional interaction, making it difficult to capture long-range structural dependencies in complex mathematical expressions. To address these issues, this paper proposes a handwritten mathematical expression recognition model based on an encoder-decoder architecture, termed DDTAFF, which integrates deformable dilated convolution and triplet attention feature fusion. Specifically, deformable dilated convolution incorporates learnable dilation rates into both the offset learning process and the customized convolution operation of deformable convolution, enabling more accurate offset prediction and adaptive expansion of the receptive field. Meanwhile, triplet attention feature fusion adopts a similarity-guided dynamic fusion strategy to enhance cross-dimensional feature interaction and improve the extraction of discriminative features. In the encoder, deformable dilated convolution is used to capture multi-scale features and broader contextual information, while triplet attention feature fusion effectively fuses features at different levels to strengthen the representation of critical regions. In the decoder, a Transformer-based structure is introduced to enhance long-range dependency modeling. Experimental results on the CROHME 2014, CROHME 2016, CROHME 2019, and HME100K datasets show that the proposed model achieves recognition accuracies of 59.34%, 59.77%, 59.63%, and 68.94%, respectively, representing improvements of 2.34%, 3.71%, 4.75%, and 1.63% over the baseline model. These results demonstrate the effectiveness and superiority of the proposed method.