World models are generally believed to understand and represent the external world and predict future states based on current world states and actions. Large models leverage massive training data and vast parameter scales to exhibit outstanding capabilities in learning, understanding, representing, and generating textual knowledge, as exemplified by language large models such as GPT-4 and LLaMA. In recent years, research on world models has attracted significant attention from both industry and academia, leading to significant research and commercial achievements in domains such as autonomous driving, social simulation, embodied intelligence, and video generation. Moreover, researchers have applied the remarkable results of various large models to world models, further enhancing their performance. This paper comprehensively reviews world models built using large models across different domains, covering both language large model- and Vision Large Model (VLM)-based approaches. Several important application areas, including embodied intelligence, smart cities, social simulation, and physical environment simulation, are selected to introduce relevant models. This paper classifies world models based on the modality of the large models used, highlighting the functional differences between world models based on different modalities. Subsequently, important open-source resources and benchmarks for world models are presented to help researchers in related fields understand and utilize world models quickly. Finally, this paper is summarized and future research directions are presented.
Resource Public Key Infrastructure (RPKI) is an important mechanism for safeguarding Border Gateway Protocol (BGP) routing security, which verifies the legitimacy of BGP announcements through Route Origin Authorization (ROA) and Route Origin Validation (ROV). As RPKI continues to advance globally, its deployment status and defense effects have become prominent research hotspots. In recent years, researchers have extensively studied ROA configuration problems and ROV deployment measurements extensively, demonstrating the operational status and protection capability of RPKI in real networks from different dimensions. Current surveys mainly focus on theoretical research on RPKI systems, emphasizing architectural vulnerabilities without systematically organizing or summarizing the key challenges and related studies encountered in their actual deployment. This review systematically summarizes recent studies on the deployment issues of RPKI systems. First, it focuses on classifying common error types in ROA configurations, including benign ROA conflicts and loose ROA registrations, and analyses their causes and impacts on routing security systematically. Then, it comprehensively summarizes and compares existing ROV deployment measurement methods and reviews evaluation methods for assessing ROV validation effectiveness and its impact on path propagation. Finally, the review outlines future research directions to address RPKI deployment issues, providing a theoretical foundation and methodological reference for subsequent research on RPKI deployment optimization, security assessment, and strategy research. The findings can promote the widespread adoption of RPKI and enhance the defense against BGP prefix hijacking.
Short-term action anticipation, a crucial task in video understanding, involves transforming observed physical motions into inferences about action intentions and goals by modeling the spatiotemporal and semantic features of historical actions. It enables the precise prediction of interactive behaviors within the next few seconds and has broad application prospects in human-machine collaboration, security surveillance, autonomous driving, and augmented reality. Recent advances in deep learning, particularly innovations in feature extraction models and the construction of high-quality datasets within the field of video understanding, have propelled the development of this domain. This progress has shifted short-term action anticipation has transitioned from a knowledge-driven machine learning paradigm to a data-driven deep learning paradigm. This survey systematically reviews the latest advancements in deep learning methods for short-term action anticipation, providing references and insights for related research and practical application analysis. For this purpose, a classification framework is first constructed from three perspectives: model architecture innovation, training strategy application, and contextual modeling methods. Within this framework, key technologies and challenges in the field are analyzed, and the characteristics, applicable scenarios, and research progress of each method category are elaborated. Next, datasets commonly used for this task are summarized, and the performances of various methods are compared on mainstream datasets. Finally, the current challenges and future research directions are outlined, including multi-view collaborative prediction, real-time model inference verification, weakly supervised learning from untrimmed data, few-shot class-incremental generalization, dynamic open-scene adaptation, and variable time interval prediction.
With the popularization of the Internet and the diversification of applications, fine-grained classification of massive network traffic has become key to optimizing quality of service and analyzing user behavior patterns. This paper presents an overview of Machine Learning (ML)-based and pretrained model-based network traffic analysis methods to promote further research and development in this field through multidimensional comparison and analysis. First, the complete traffic classification pipeline is deconstructed, covering data acquisition, preprocessing, and feature extraction, and the practical value of data balancing techniques is examined. The data format, scale, and scene suitability of mainstream public datasets are introduced, compared, and analyzed from multiple perspectives, highlighting their data distribution, feature redundancy, and timeliness problems. Second, it summarizes the limitations of traditional algorithms in handling high-dimensional data and meeting real-time requirements, and outlines the trend of applying pretrained models in traffic analytics, through a focused comparative analysis of experimental results. This review includes breakthroughs in Transformer-based pretrained models, their fusion with Deep Learning (DL) models, and advances in lightweight pretrained models for traffic classification. Finally, by considering dynamic research trends, the opportunities and challenges in future applications of pretrained models are discussed, and their limitations in terms of computational cost and privacy protection are analyzed.
Model quantization technology effectively reduces model storage and computational overhead by mapping high-precision floating-point data to low-bit discrete spaces. A core focus of model quantization research is how to rationally account for the characteristics of parameter distributions to construct superior mapping schemes. Existing Post-Training Quantization (PTQ) schemes nearly universally assume that the data distribution of non-activation layers follows a symmetric bell-shaped curve, but overlook the fact that small biases introduced by the model's activation layers and inputs induce distributional asymmetry. Consequently, the resulting quantization mapping is skewed to one side due to this subtle asymmetry, leading to significant approximation loss. This paper investigates quantization schemes for image Super-Resolution (SR) and proposes improvements to the widely recognized two-stage PTQ scheme. First, the max-min-based equal partitioning employed in the pre-search for quantization bounds is modified to a sorting-based non-uniform partitioning approach. Second, a bias term is introduced during the pseudo-quantization process, where a portion of the data and its mean are adaptively shifted to mitigate estimation loss caused by data bias. The improved scheme outperforms the original counterpart across all performance metrics while maintaining comparable high compression ratio and acceleration ratio—compared to the original SwinIR-light model, it reduces the parameter count by more than 60% and accelerates the SR process by more than 3 times.
Entity redundancy, where multiple nodes represent the same real-world entity due to heterogeneous data sources or extraction errors, severely affects the quality and utility of Knowledge Graphs (KG). To address the problem of Entity Canonicalization (EC) within a single knowledge graph, we propose a two-stage method whose core innovations are threefold. 1) We propose a Contrastive Representation-Guided Clustering (CRGC) method that performs contrastive learning by leveraging the dual-view information (context and definition) of entities and adaptively cuts the hierarchical clustering results using the Minimum Description Length (MDL) principle, thereby avoiding the need for manual threshold setting. 2) We design a Submodular Redundancy Minimization (SRM) algorithm that formulates the representative entity selection problem as a submodular coverage maximization under partition matroid constraints. This method, denoted as CRGC-SRM, provides an approximation guarantee while explicitly optimizing the trade-off between the Knowledge Coverage Rate (KCR) and redundancy. 3) Tailored for the EC task, we introduce a type-consistency penalty and a hard-negative mining strategy to effectively suppress the ″over-merging″ problem caused by homographic (or polysemous) entities. Experiments on multiple public and internal datasets demonstrate that CRGC-SRM improves clustering quality by approximately 2.7 percentage points over the strongest baselines, subsequently reducing the Entity Redundancy Rate (ERR) from 29.7% to 7.8% on average (reducing redundancy by 73.7% relative to that of the original graph) while maintaining ≥98% KCR. Furthermore, CRGC-SRM significantly improves query performance, increasing the Mean Reciprocal Rank (MRR) by approximately 15.4%, Hits@1 by approximately 18.5%, and reducing the 95th Percentile (P95) query latency by 27.7%—35.9%. CRGC-SRM offers an efficient, theoretically grounded, and practical solution for single-graph EC.
The multi-view subspace clustering algorithm, a type of multi-view clustering algorithm, emphasizes discovering potential subspaces in multi-view data and clustering based on these subspaces. The Multi-view Subspace Clustering algorithm with Grouping Effects (MvSCGE) is also a type of multi-view subspace clustering algorithm. The basic principle is to learn the subspace representation of each view through smooth regularization, while ensuring cross view consistency, and ultimately learning a consistent clustering index matrix. The clustering results are obtained after processing. However, this algorithm only considers the local structure of a single view and has certain limitations. To further explore the diversity between views, this paper proposes a Diversity-induced Multi-view Subspace Clustering algorithm with Grouping Effect (DiMvSCGE). It preserves the local structure of each view while using the Hilbert-Schmidt Independence Criterion (HSIC) to measure the diversity between views, and iteratively uses alternating direction minimization on this basis. Based on the clustering index matrix obtained after the iterations, k-means clustering is performed to obtain the final result. Comparative experiments with several advanced algorithms on four public datasets show that this algorithm offers advantages, such as low parameter sensitivity and fast convergence speed, and demonstrates good performance on different datasets.
Unsupervised transfer learning has been widely used in cross-subject mental workload recognition studies based on physiological signals; however, the model performance is limited by the lack of labeled target domain data. To address this problem, a cross-subject mental workload recognition method that combines transfer learning with active learning is proposed. Using an Electroencephalogram (EEG) as the research object, source domains with distributions similar to the target domain are selected by calculating the maximum mean discrepancy between the source and target domains. Second, a one-to-one cross-subject mental workload recognition model is constructed for each selected source and target domain. The feature distributions of the two domains are brought closer by an adversarial network, and a small number of target domain samples, considering both uncertainty and diversity, are labeled by active learning based on uncertainty-weighted clustering, which participated in the subsequent training of the model classification layers. Finally, ensemble learning is utilized to synthesize the recognition results of multiple single-source domain models. Experiments on the publicly available WAUC dataset reveal that source domain selection reduces the incidence of negative transfers and computational costs. The introduction of active learning effectively improves the performance of cross-subject transfer learning. Compared to unsupervised transfer learning, the average recognition accuracy is improved by 14.7% in the task of recognizing mental workload under different levels of physical workload. Ensemble learning overcomes the shortcomings of the limited knowledge learned by single-source domain models, further improving the recognition performance of the model and achieving an average recognition of 86.1%.
Mean Teacher is a highly regarded and widely used framework for semi-supervised medical image segmentation. However, methods based on the Mean Teacher do not selectively accept the supervision of the student network over the teacher network during training. This implies that even if the performance of the teacher network is inferior to that of the student network, the student network is still supervised by the teacher network. This results in an accumulation of errors. Moreover, all these methods use a fixed threshold for pseudo-labels to obtain correct information from the predictions of the teacher network. Although this filters out most incorrect information, it also eliminates much of the correct information, which greatly limits the availability of pseudo-labels. To address these issues, a semi-supervised medical image segmentation model based on selective supervision and dynamic threshold, named SSDT, is proposed. This model allows the student network to choose when to accept supervision from the teacher network, preventing the teacher network from supervising the student network when its performance is insufficient. The network can select a pseudo-label threshold suitable for the current training stage using the newly designed dynamic threshold module, thereby maximizing the retention of the correct information in the teacher network output. On the LA and ACDC datasets with 20% labeled data, SSDT achieves Dice coefficients of 90.94% and 89.93%, respectively. Extensive experiments on four medical image datasets demonstrate that SSDT has superior segmentation performance compared with several state-of-the-art methods.
Sequence recommendation utilizes user historical sequence behavior to model user interests and provide content recommendations, and is commonly employed in sectors such as news, advertising, and e-commerce. Self-supervised sequence recommendation based on contrastive learning is a current research hotspot. However, real sequence data are dynamically uncertain, and sampling biases exist in contrastive learning, which limit the performance of recommendations. To mitigate these issues, this paper proposes a self-supervised sequence recommendation method based on stochastic self-attention and momentum contrastive learning. Stochastic self-attention is used to alleviate the uncertainty of sequence dynamics, and momentum contrastive learning is used to mitigate the sampling bias problem in contrastive learning. To validate the performance of the model, experiments are conducted on three datasets: Beauty, Office, Yelp, and Toys. The results demonstrate that the proposed method outperforms other baseline models across several metrics, including HR@K and NDCG@K, indicating significant improvements in both accuracy and robustness.
This study proposes a polarization image fusion method based on multi-scale transformation and image enhancement to address the difficulty in capturing and processing defect information on reflective metal surfaces. Polarization imaging technology is used to suppress reflections, and images containing polarization information are captured using a polarization camera. The Constrained Contrast Adaptive Histogram Equalization (CLAHE) algorithm is improved by means of sub-block division and a smooth mapping table. These improvements significantly enhance the degree of polarization, polarization angle, and contrast of the visible images. A Laplacian Pyramid (LP) decomposition is applied to the images. Bilateral filtering and Laplacian sharpening are performed on the high-frequency layers, where defects such as scratches and pits are located, to enhance high-frequency details. In the image fusion stage, a fusion strategy based on adaptive luminance weights is proposed. The fusion weights are adjusted dynamically according to the luminance distribution of each image, ensuring that the defects are not blurred owing to luminance differences. The fused image pyramid is then reconstructed to obtain the final fused image. Experiments are conducted on the front sealed doors of washing machines. The results show that, compared with other image fusion methods, the proposed method achieves better performance in objective evaluation metrics such as Information Entropy (IE), Peak Signal-to-Noise Ratio (PSNR), and Average Gradient (AG). In particular, the PSNR and Structural Similarity Index (SSIM) reach the maximum values of 65.304 3 dB and 0.472 7, respectively. The fused image exhibits a high Signal-to-Noise Ratio (SNR) and contrast, effectively highlighting the defects on the reflective metal surfaces.
The high similarity between spatial targets and background features in dense star fields can easily lead to numerous false alarms during detection. In addition, space targets often exhibit weak and dim features under long-distance detection conditions and are obstructed by bright stars during motion, resulting in difficult detection and high missed detection rates. In response to these issues, this study proposes a dense stellar background dark and weak spatial object detection model FRR-YOLOv8 based on the YOLOv8 framework, which uses a large kernel depth and separable convolutions at different levels and combines grayscale and connected domain discrimination to refine object segmentation. First, the C2f convolutional layer is used to replace the ordinary convolutional layer of the Spatial Pyramid Pooling Fast (SPPF) module in the original YOLOv8 network. By convolving feature maps at different levels, the model can obtain more contextual information, promote the detection of small objects and low signal-to-noise ratio targets, and solve the problem of high missed detection rate caused by weak target features. Second, the Real Time Models for object detection (RTMdet) structure is used as the head part of the YOLOv8 network. Large kernel depth separable convolutions are introduced into the basic units of the model structure to increase the receptive field, balance the computational and parameter requirements between different resolution levels, and increase the feature extraction ability of the basic construction units. In this module, the grey level and connected domain judgment are combined to finely locate the target area from the overall frame selection to the individual neighborhood, solving the detection interference caused by dense background stars. The improved algorithm is tested on both simulated and real image datasets. The mAP@0.5 index reaches 94.6% on image datasets with signal-to-noise ratios ranging from 0.5 dB to 1 dB, which is 10.8% higher than that of the original YOLOv8 model. This result proves that the FRR-YOLOv8 model can detect targets in dark and weak spaces.
To address the issue of interference factors such as seasonal changes, climate, and illumination that affect high-resolution remote sensing images of the same geographical space but different temporal phases, a remote sensing image building Change Detection (CD) method based on a multi-temporal ChangeFormer is proposed. This method uses multiple remote sensing images from different temporal phases and fuses the latest temporal remote sensing image with multiple prechange remote sensing images at different scales for feature difference extraction. Additionally, it focuses on both the comprehensive semantic features of the images and the details of the semantic information between the images. This approach helps reduce false detections caused by changes in factors such as season and illumination. Additionally, the method fuses the feature differences of multiple prechange remote sensing images from different temporal phases and introduces a regularization term as a loss function. This eliminates interference from nonbuilding changes and illumination shadows in the nonchanging areas of buildings, thereby enhancing the generalization ability of the model. A three-temporal remote sensing image dataset covering changes from agricultural land to construction land is constructed. The experimental results show that, compared to the current optimal BIT method, the multi-temporal ChangeFormer method improves the F1 value, Intersection over Union (IoU), precision, and recall by 9.04%, 9.87%, 15.27%, and 3.4%, respectively, thus significantly enhancing detection accuracy. Furthermore, it outperforms classical CD methods in terms of detailed information processing.
Reconstructing High Dynamic Range (HDR) images from multiple Low Dynamic Range (LDR) images with varying exposures is challenging, especially in scenarios involving camera and object motion, where motion regions often introduce artifacts, thereby affecting the quality of the final reconstructed image. This issue primarily arises from the misalignment of content among multiple LDR images, where geometric differences between images significantly impact the reconstruction outcome. To address this issue, we propose an HDR image reconstruction network based on feature pre-alignment for enhancing HDR reconstruction quality. This network has two stages: feature pre-alignment and HDR reconstruction. In the feature pre-alignment stage, a Feature Pre-Alignment Network (FPAN) guides the alignment of features from input images with those from the reference image, thereby reducing artifacts caused by motion. In HDR reconstruction stage, a selective state space model is employed for modeling the global context of the pre-aligned features, and a simplified HDR restoration network generates the final HDR image. Extensive experiments are conducted on two datasets to evaluate the performance of the proposed network. The results show that the proposed network outperforms comparative methods across multiple objective evaluation metrics, exhibits satisfactory subjective visual effects, and demonstrates certain generalization capabilities.
When convolutional networks are used in the image inpainting of cultural relics, the convolution kernel's limited receptive field poses challenges, which results in a weak comprehension of the global context and complex structures. Moreover, the convolution operation does not adequately handle the intricate geometrical shapes of relic surfaces owing to its translation invariance; hence, convolution-based inpainting is prone to irrelevant structures and artifacts. In the case of Transformer models with self-attention mechanisms, which process the details and local features of relic images, the insufficient attention to specific regions makes it difficult to capture the deep features necessary for precise and detailed inpainting. Additionally, Transformers cannot adequately capture long-range semantics, which results in a suboptimal visual quality of the inpainted images. This paper proposes a relic image inpainting model based on the SwinTransformer, called the Dynamic Mask on SwinTransformer (DMSWT). The model introduces several improvements to the self-attention module within the network to optimize its structure. First, layer normalization is removed, and fully connected layers are replaced with residual connections to enhance the deep feature extraction capabilities of the network. Second, a dynamic mask mechanism is introduced to mitigate the issue of reduced effective pixels caused by default sampling in the inpainting of images with large-scale missing regions. Finally, the loss function is improved with a focus on enhancing the perceptual realism, leading to an improvement in the visual quality of the inpainted images. Experimental results for different scenarios show that the DMSWT model can learn more structural prior information and generate inpainted images that align with real-world intuition. Additionally, quantitative evaluations demonstrate significant improvements in performance metrics.
Decentralized storage offers high availability and scalability. However, owing to the decentralized storage of data across multiple nodes, issues such as slow data access and complex operations arise, resulting in a poorer user experience compared to centralized storage. To address this, a data availability sampling technology is employed, which maintains the decentralized nature of the method while incorporating the advantages of centralized storage. In data availability sampling technology, multiple nodes obtain a smaller, randomly selected subset of data from a single data owner. This technology is often combined with erasure coding to enhance data availability. Based on data availability sampling technology, decentralized storage providers are introduced to serve users on a one-to-one basis, and data guarantors supervise storage providers and provide guarantees for user data. A comprehensive storage method is designed to achieve highly available data storage, and blockchain and smart contracts are employed to enhance decentralization. By supporting a repledging model and adopting a storage-proof algorithm with low computational resource consumption, the willingness of the nodes to join is increased. To resolve the contradiction between large data scales and the limited bandwidth resources of data guarantors, a delayed confirmation mechanism is proposed. Experimental and analytical results show that under this method, the probability of malicious node collusion is only 2.43×10-3, the probability of untrustworthy data availability sampling results is only 2.93×10-4, the number of data unavailability occurrences is 0 in 3 million simulation experiments, the number of centralized nodes is 0, and generating storage proofs for a 1 MiB file takes only 3.51 ms. This method achieves highly available data storage while improving user-friendliness and node-friendliness, providing a feasible technical path for optimizing decentralized storage.
Cyberattacks are becoming increasingly diverse, and traditional intrusion detection methods exhibit limitations in capturing the spatio-temporal characteristics of complex network traffic. Most traditional methods rely primarily on static feature analysis, making it difficult to adapt to the ever-changing intrusion behaviors in dynamic network environments. When analyzing network traffic, existing deep learning methods often overlook the topological structure between network nodes and the temporal dynamics of traffic. To address these issues, this paper proposes a novel intrusion detection method based on a dynamic spatio-temporal Graph Neural Network (GNN), named DSTG-IDS. By segmenting network traffic through time windows, data packets within each time period are modeled as nodes in a graph, and connections are established based on the relationship between the source and destination IP, thereby constructing a sequence of dynamic graphs over time. To better capture the temporal characteristics of traffic, graph data are encoded with a temporal position to enhance the temporal information expression ability of the nodes within different time periods. In terms of model design, first, Graph Convolutional Network (GCN) are utilized to extract the spatial features of network traffic, and Graph Attention Network (GAT) are incorporated to enhance the focus on key node information. Second, Bidirectional Gated Recurrent Unit (Bidirectional GRU) are employed to model the temporal sequence of traffic, effectively capturing the dynamic characteristics of data changes over time. Finally, a multi-head attention mechanism is utilized to fuse spatio-temporal features and perform classification. Experiments on three widely used datasets—BoT-IoT, ToN-IoT, and NF-CSE-CIC-IDS2018—demonstrate that DSTG-IDS achieves accuracies of 99.69%, 98.61%, and 93.26%, respectively. Compared with other intrusion detection methods, DSTG-IDS exhibits significant advantages in terms of accuracy, recall, False Alarm Rate (FAR), and F1 value.
With the rapid application of metaverse technology, identity security and trust mechanisms in virtual spaces are facing severe challenges. In response to the high concurrency and low latency interaction characteristics of the metaverse environment, as well as the need to achieve malicious behavior supervision while protecting user privacy, this paper proposes a metaverse identity security authentication scheme based on software—hardware collaborative technology and a signcryption algorithm. First, in terms of protocol design, a traceable identity authentication protocol that combines device hardware fingerprints and national cryptographic algorithms is designed. This protocol achieves distributed authentication and identity tracking while ensuring user anonymity. Second, in terms of system implementation, to handle the computational bottlenecks caused by complex cryptographic operations, a software—hardware collaborative computing platform based on Field Programmable Gate Array (FPGA) is constructed. By leveraging the parallel processing advantage of FPGA, hardware acceleration is applied to the core operations, namely modular and point multiplications, in the signcryption algorithm, effectively offloading the computational pressure on the Central Processing Unit (CPU) and significantly improving signcryption efficiency. Finally, the scheme is comprehensively evaluated by constructing an identity authentication protocol verification platform. Experimental results demonstrate that the proposed scheme combines high security and efficiency. Compared to traditional CPU software implementation, the signcryption computation performance improves by 13.6 times after FPGA hardware acceleration. The scheme can balance efficiency and security in metaverse authentication and provide key technical support for building a trusted and manageable metaverse ecosystem.
Existing color image watermarking algorithms are applied between channels that are independent of each other, ignoring the intrinsic correlation between the channels. To achieve the embedding of robust reversible watermarking in color images, a color image robust reversible watermarking algorithm based on the Trinion of Exponential Fourier Moments (TEFM) is proposed. First, the trinion theory and exponential Fourier moments are used to construct the TEFM transform, and then, a two-stage robust reversible watermarking algorithm is used to embed the watermark information in the TEFM domain. In the first stage, the standardized TEFM transform coefficients are modified through Quantization Index Modulation (QIM) to embed the watermark; in the second stage, the distortions caused by embedding robust watermarks are embedded into the robust watermarked image as compensatory information through the prediction error expansion method. Experimental results show that the average Peak Signal-to-Noise Ratio (PSNR) of the final watermarked image obtained using this algorithm is greater than 44 dB. The original image can be recovered losslessly under unattacked conditions. The Bit Error Rate (BER) is reduced to different degrees compared with that for existing algorithms under regular attacks on the image, such as median filtering, mean filtering, and salt-and-pepper noise. Under the attack of 3×3 median filtering, average BER reductions of 16.7%, 7.6%, and 6.8% are achieved compared with other three state-of-the-art methods, respectively. These results confirm that the proposed watermarking algorithm offers invisibility, high capacity, reversibility, and high robustness.
The study of adversarial examples can promote innovation in defense methods, identify gaps, and thus improve the robustness of a model. Most of the existing studies on object detection against attack methods suffer from poor black-box migration ability and insufficient generalization ability of the generated adversarial examples. To solve these problems, a algorithm called GM-DEC is proposed to enhance the mobility of adversarial examples and inhibit the correct classification of object detectors. First, GridMask, a data augmentation method, is introduced into the gradient iteration-based adversarial example generation process to obtain more generalized gradient information, thereby helping to enhance the robustness of the attack and avoid falling into local optima and overfitting white-box models with generated adversarial examples. Second, to further enhance the transferability of the adversarial examples, an attention-based region-of-attention suppression loss function is designed, which makes the model focus on other non-targeted regions by suppressing the size of the attention heatmap, thus leading to incorrect predictions. Finally, the momentum term in Momentum Iterative-Fast Gradient Sign Method (MI-FGSM) is introduced during the iterative updating process to accumulate velocity vectors, thus stabilizing the updating direction and achieving faster convergence. Experiments are carried out on the Pascal VOC2007 dataset, and the results show that the proposed algorithm can effectively attack object detectors such as Faster R-CNN, YOLO, and SSD. The success rate of the black-box attack is improved by approximately 10-30 percentage point compared with the current attack algorithms for object detection, accompanied by better transferability.
To reduce privacy leakage in the model aggregation process of federated learning, an effective federated learning privacy protection algorithm is proposed to improve data availability. This algorithm aims address the unfairness of the aggregated model to each client caused by the imbalance of client data quality and the low data availability caused by incomplete server-side data aggregation. It adopts a removable random mask perturbation technique for the model parameters of the client, avoiding the risk of privacy leakage during data upload to the server without affecting the aggregation effect. Considering the uneven data quality among different clients, it dynamically adjusts the weights of clients during data aggregation on the server side to improve data availability. Simultaneously, the Shamir(t, n) threshold secret sharing method is used to distribute and reconstruct the uploaded model parameters. This prevents incomplete aggregation results caused by network delays or unsuccessful client data uploads, which can lead to a decrease in data availability. Experiments on the MNIST and CIFAR-10 datasets reveal that the proposed algorithm can not only prevents client privacy leakage, reduce algorithm time overhead, and ensure accuracy but also effectively improves data availability and model performance while achieving privacy protection.
Multimodal sentiment recognition aims to improve the accuracy and robustness of sentiment detection by integrating information from different modalities such as text, audio, and video. However, existing methods face challenges in handling discrepancies and complementarities between modalities, as well as in capturing the dynamic features of temporal sequences, often resulting in suboptimal sentiment recognition performance. To address these issues, this paper proposes a multimodal sentiment recognition model based on cross-modal enhancement and a time-step gating mechanism. The model employs a cross-modal cross-attention mechanism to learn correlations between different modalities, thereby enhancing the complementarity of features across modalities. The model integrates information from text, audio, and video through interactions between modalities, mitigating the limitations of single-modality sentiment expressions. Subsequently, the time-step gating mechanism dynamically adjusts feature weights at each time-step, focusing on critical time-steps that contain more relevant sentiment information, thereby improving the model's temporal sequence modeling capability. Finally, fused features are fed into a classifier for sentiment prediction. Experimental evaluations on publicly available CMU-MOSEI and CMU-MOSI multimodal sentiment recognition datasets show that the proposed model achieves sentiment recognition accuracies of 82.41% and 82.60%, respectively, significantly outperforming current mainstream models such as ALMT and TETFN. These results demonstrate that cross-modal enhancement and time-step gating mechanisms effectively improve the ability to fuse multimodal features and process temporal sequences, validating the effectiveness and robustness of the method in multimodal sentiment recognition tasks.
Medical Visual Question Answering (Med-VQA) aims to accurately predict answers based on medical images and related questions. This task requires the simultaneous extraction of problem features and medical image features and fusing two features to obtain the final answer. Existing Med-VQA methods mainly focus on the extraction and interaction of overall features, which cannot effectively capture the correlation between questions and key areas of an image and lack the ability to understand fine-grained image information. To address this problem, this study proposes a model based on context awareness and multi-level feature fusion for Med-VQA, known as CAMF, which fully focuses on fine-grained image features and performs multi-level feature interaction. The model first enhances text and image features through two types of Guided Attention (GA), then uses the context awareness module to capture key fine-grained image information featrue, and finally realizes the mutual promotion of three features through multi-level feature fusion to obtain more effective features for answer prediction. The experimental results show that the overall accuracy of the CAMF model on the VQA-RAD dataset is 1.5 percentage points higher than that of the baseline model of the same type and that the overall accuracy on the SLAKE dataset is 0.4 percentage points higher than that of the baseline model of the same type. Moreover, it achieves a performance comparable to that of medical domain pre-training methods on both datasets. At the same time, it can be seen from the feature map visualization results that the CAMF model can effectively focus on key areas in the image and make full use of image information to obtain answers.
Single-modal images are easily affected by light, weather, and other environmental conditions in all-weather ship detection. This leads to a low ship detection accuracy and high leakage rate. To address these issues, this paper proposes a ship detection method, VIF-RTDETR, which fuses visible light and infrared image information. The method fully utilizes the rich details and color information of visible images and the stable performance of infrared images in low-light environments, and constructs a four-channel input model. The complementary fusion of varied modal information is realized by designing the fusion module VIF such that it makes more reasonable use of the information from the two modalities (visible light and infrared) in the detection network. The channel attention in the backbone feature extraction network is combined to further optimize the feature extraction capability by dynamically assigning different weights to the channels, thereby enhancing the feature expression capability of the channels. To further enhance the detection performance of small targets in ship detection, a weighted bounding box loss function is designed so that the model can effectively focus on the feature expression of targets of different sizes and improve the detection accuracy under different target sizes. The experimental results show that in the visible and infrared datasets for the ships, the detection precision AP0.5∶0.95, AP0.5 of the model reaches 78.3% and 98.5%, respectively, reflecting improvements by 4.7 and 9.2 percentage points relative to AP0.5∶0.95 the single-modal visible and infrared models. Further, the recall rate AR0.5∶0.95 reaches 85.2%, reflecting improvements by 3.1 and 7.3 percentage points relative to the single-modal visible and infrared models, respectively. Thus, the findings contribute to significantly improving the precision of ship detection and reducing the leakage rate.
Although significant progress has been made in the design of heuristic models for 360° image quality assessment, existing methods still exhibit a substantial gap from human subjective perception owing to insufficient consideration of how humans view 360° images. To address this limitation, this paper proposes a panoramic image quality assessment method that integrates both image quality and aesthetic features. This method aims to provide a comprehensive evaluation of images from a human perception perspective and accurately reflect the overall quality of the panoramic images. The approach consists of two main stages. First, a multimodal large language model is used to analyze an image dataset and generate textual descriptions that encapsulate both image quality and aesthetic features, thus constructing an image-text pair dataset. This process combines image quality and aesthetic evaluation, enabling the model to gain a holistic understanding of the images. In the second stage, a dual-stream multimodal quality perception model is designed to effectively fuse textual and visual features and thoroughly explore the multimodal information of the image. Additionally, Triplet Loss is incorporated on top of the traditional L2 loss function to better capture the subjective quality differences between samples. Experimental results on the CVIQD and OIQA benchmark datasets demonstrate that the proposed algorithm achieves significant improvements in image quality assessment performance across the Spearman's Rank Correlation Coefficient (SRCC), Pearson Linear Correlation Coefficient (PLCC), and Root Mean Square Error (RMSE) metrics, outperforming other state-of-the-art methods.
A reasonable preventive strategy for the location and distribution of emergency materials within a logistics network is key to ensuring the supply of materials after urgent events, such as the recent public health crisis. This paper investigates a class of multi-category emergency material storage distribution models under scenarios with uncertain post-event demand distribution. Based on limited historical sample data, a fuzzy set containing information about random factors before a disaster is established, and a distributionally robust optimization model of two-stage material planning is constructed, with the optimization goal of minimizing the expected cost under all distributions defined in the fuzzy set. The model includes two stages of collaborative optimization: warehouse network planning, and material scheduling and distribution. The first stage focuses on the problem of preventive warehouse location and material pre-storage with an uncertain demand distribution. The second stage plans emergency material scheduling and distribution in the warehouse network. Nonlinear distributionally robust optimization is transformed into linear optimization by applying a dual method, and a Lagrangian L-Shaped Method (LLSM) is designed to solve the two-stage model. The robustness of the model and algorithm is verified by constructing an example set, and the sensitivity of warehouse network location and material distribution decision in addressing material gaps after different levels of disasters is analyzed.
The quality of bridges, as crucial transportation infrastructure, is an important consideration during its construction and usage. Open caissons are widely used as basic structures in bridge construction. During the open caisson construction process, real-time and accurate prediction of the sinking attitude helps reduce accident risks and may improve project quality. Commonly used prediction methods, such as statistical models and machine learning models, have difficulty coping with the nonlinear spatiotemporal features of time series data, such as structural stress and open caisson sinking attitude, resulting in inaccurate prediction results. Deep learning models can capture the spatiotemporal features of data and have been widely used in time-series data prediction. However, they have not yet been applied to related tasks, such as open caisson sinking attitude prediction. In this study, we propose a Multi-indicator Prediction Model (MiPM) based on a Graph Neural Network (GNN). MiPM dynamically establishes the graph adjacency matrix between time-series data sequences using the self-attention mechanism and the Gate Recurrent Unit (GRU) and combines it with the convolutional neural network to extract the temporal and spatial features of time-series data. The spatiotemporal features are fused using an interleaved network structure, and the prediction results are mapped. To verify the effectiveness of the MiPM, a real open-caisson bridge construction project is used as an empirical case study. The results show that, compared with 13 baseline models, MiPM has a better prediction performance; for example, its Root Mean Square Error (RMSE) is at least 5.6% lower.
By deeply mining medication data and clinical features of patients from electronic medical records, this study leverages deep learning models to predict drug combinations. This study aims to enhance the accuracy and safety of medication recommendations in clinical disease management and proposes a graph-enhanced attention drug recommendation model that integrates the clinical features of patients to enrich patient representations. This model uses Graph Neural Network (GNN) to capture drug combination and Drug-Drug Interaction (DDI) knowledge. Through a two-stage attention mechanism, the model generates novel patient representations that combine historical medication information with DDI knowledge. Finally, a multi-label learning approach is employed for drug recommendation. Experiments on the MIMIC-Ⅲ public dataset demonstrate that the proposed model achieves a Jaccard similarity, Precise—Recall Area Under the Curve (PR-AUC), F1 value, and DDI rate of 0.517 2, 0.766 1, 0.673 1, and 0.070 3, respectively. Compared to recent state-of-the-art drug recommendation models, the proposed model reduces the DDI rate by at least 0.004 7, and it improves Jaccard similarity, PR-AUC, and F1 value by 0.004 5、0.006 1 and 0.012 1 or more, respectively. Comparative experiments on real-world datasets further validate the model's performance. The model outperforms recent state-of-the-art drug recommendation models with a Jaccard similarity, PR-AUC, F1 value, and DDI rate of 0.450 2, 0.702 3, 0.612 8, and 0.085 7, respectively. These experimental results indicate that the proposed model exhibits superior performance and clinical applicability, providing significant value in assisting physicians in developing more scientifically effective medication plans.
In Printed Circuit Board Assembly (PCBA), defect detection is key to improving production line efficiency. However, after assembly, printed circuit boards are usually inspected manually, leading to labor and time wastage, as well as missed and false detections. To address these issues, this paper proposes an improved lightweight YOLOv8s network that effectively reduces model complexity while enhancing the accuracy of PCBA defect detection. First, owing to the lack of publicly available PCBA-related datasets, a dataset called PCBA-DET is constructed for post-assembly PCBA defect detection. Various data augmentation techniques are applied to PCBA-DET to simulate real-world production scenarios and improve the dataset balance. Second, the last C2f module of the YOLOv8s backbone is replaced with a Re-parameterized Large Kernel convolution Network (RepLKNet) to reduce computational cost and increase the effective receptive field of the model. In addition, in the neck network of YOLOv8s, a P2 small object detection layer and Ghost Convolution are introduced to capture more detailed information and effectively reduce the number of model parameters. On the augmented PCBA-DET dataset, the improved model achieves an increase of 2.6 and 0.1 percentage points in terms of mean Average Precision (mAP)@0.5∶0.95 and mAP@0.5, respectively, compared with the baseline model, whereas the number of parameters is reduced by 36.8%.
In women who have undergone a cesarean section, the gestational sac can get implanted at the surgical incision site during subsequent pregnancies, which can lead to uterine rupture and severe hemorrhage, threatening the fertility and even life of the patient. Currently, uterine artery embolization is the preferred treatment for such conditions; however, the procedure relies heavily on the expertise of the doctor and is challenging to customize based on individual patient differences. Therefore, a simulation scheme for uterine artery embolization is proposed. First, a semantic segmentation algorithm for uterine artery vessels is introduced, utilizing a dual-branch encoding structure and a feature focus fusion model to enhance the use of global features by the neural network, thereby achieving semantic segmentation of uterine artery vessel CT images. Second, a combined refinement and tracking centerline extraction method is used to construct a three-dimensional model of the vessels based on the centerline. Finally, a numerical simulation method combining computational fluid dynamics and discrete element methods is employed to simulate the uterine artery embolization. The experimental results show that the semantic segmentation algorithm significantly improves the segmentation accuracy of uterine artery vessel CT images. The centerline extraction-based three-dimensional vessel model reconstruction retains the true structure of the vessels while avoiding cumbersome postprocessing. The numerical simulation of the embolization of the uterine artery vessels intuitively demonstrates the embolization formation process and provides a reference for doctors in formulating surgical plans.
Precise crop row detection is crucial in intelligent agriculture because it significantly affects the navigation and harvesting capabilities of unmanned harvesters. Crop row extraction accuracy is affected by factors such as slanting, displacement, and lodging during lettuce growth. This study transforms this issue into a target detection problem focusing on the core area of mature lettuce and proposes a target detection algorithm. This algorithm is based on YOLOv5s, a widely adopted target detection framework, and incorporates a dynamic convolution module into its backbone network. By dynamically filtering out background interference from feature maps, it preserves important detail features in local areas, thereby enhancing the network's ability to learn the features of the lettuce core. Additionally, the Feature Pyramid Network (FPN) structure introduces a multiscale fusion module based on dilated convolution and weight sharing, ensuring effective retention of target structural information after multiple downsampling processes, which is beneficial for detecting small targets such as lettuce cores. Furthermore, the CARAFE upsampling operation is introduced to fully utilize the contextual information during the feature extraction process, thereby enhancing the network's ability to extract small target features. Moreover, a new loss function based on the Wasserstein distance and SIoU is proposed to address the sensitivity of traditional IoU methods to the positions of small targets and accelerate the fitting speed of the network. Experimental results demonstrate that the improved algorithm achieves an average precision and recall of 0.586 and 0.574 for lettuce core extraction, representing increases of 6.1 and 6.3 percentage points compared to those achieved by YOLOv5s, respectively. After detecting the coordinates of the lettuce core, the algorithm uses the least squares method to fit the coordinate points into a straight line, thereby obtaining the central line of the lettuce crop row. Experimental results indicate that this algorithm significantly improves the performance of the original YOLOv5s model in detecting lettuce cores under different lighting conditions, thereby enabling a more accurate extraction of the crop row centerline.
Transmission towers support power transmission lines and are critical to the operation of power systems. However, no dataset is currently available for detecting transmission tower status. To detect and warn about dangerous towers, this study first constructs an image dataset of tower status. The dataset comprises seven categories including foreign object intrusion, animal nests, base obstruction, and external damage. Next, the study proposes a lightweight algorithm, CT-YOLO, for detecting transmission tower status. This algorithm includes: 1)a lightweight backbone network, L-ELANnet, which reduces the parameter count by 3/4 while ensuring no significant change in detection accuracy; 2)a spatial pyramid pooling module based on the Efficient Channel Attention (ECA) mechanism, which achieves feature fusion at different scales with fewer parameters; 3)k-means++ to optimize the model's prior boxes, which improves the model's ability to learn about slender targets, such as debris and cranes, in the dataset; 4)Wise-IoU, a bounding box loss function, which provides dynamic non-monotonic gradient gains for data of different qualities, thereby improving training accuracy and convergence speed. Ablation and comparative experiments are conducted to verify the effectiveness and superiority of the improved model. Experimental results show that, compared to the original model, the proposed lightweight YOLO k-means++ algorithm increases mAP@0.5 from 94.9% to 95.4%, with a 21.5% increase in detection speed, reaching Frames Per Second (FPS) 113.6 frame/s. Furthermore, the model size is only 14.9 MB, which is 1/5 of the original model's. Overall, the improved model has higher detection accuracy and faster detection speed. Moreover, the proposed algorithm outperforms mainstream target detection algorithms in transmission tower condition detection.
Diabetes, one of four major global noncommunicable diseases, has seen an increase in mortality rates over the years. Patients with diabetes and chronically high blood glucose levels may experience various complications and serious adverse consequences. The accurate prediction and control of blood glucose levels are critical for the diagnosis and treatment of diabetes. Although Continuous Glucose Monitoring (CGM) technology has alleviated some challenges associated with manual detection, it remains costly and vulnerable to external interference. This study proposes the PBI-CLA model, which is based on deep learning, for predicting blood glucose concentration levels in patients. First, the Convolutional Neural Network (CNN) layer extracts data features from blood glucose concentration and insulin dose sequences through one-dimensional convolution. Subsequently, the Long Short-Term Memory (LSTM) layer learns the correlations between time series increments. Finally, the attention layer assigns different weights to the insulin dosage injected at each time node to measure the blood glucose concentration and outputs the predicted blood glucose concentration value. In extensive experiments, the model reduces the Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and Mean Absolute Percentage Error (MAPE) substantially. Compared with other glucose concentration prediction models, the RMSE and MAPE values achieved by PBI-CLA for one-hour glucose concentration prediction decrease by 12.82 and 10.24 percentage points, respectively.
In complex farmland environments, machines used for recycling residual films may pick up plastic films, leading to low recovery efficiency. Existing deep learning models show low accuracy in identifying residual films. To address these issue, this paper proposes a farmland residual film detection model based on YOLOv5s, called YOLO-SDI. First, the Spatial Pyramid Pooling (SPP) structure is combined with the Efficient Layer Aggregation Network (ELAN) attention mechanism to better focus on key local features and improve the recognition rate of small targets. Next, the UpSample module is replaced with the DySample module, which enhances the feature information of small targets and improves the accuracy of model recognition. Subsequently, the InceptionNeXt module is introduced to capture information at different scales through parallel convolutional layers. This enhances the model's attention to global features, thus improving detection robustness. Finally, Soft Non-Maximum Suppression (Soft-NMS) is used instead of Non-Maximum Suppression (NMS) to gradually attenuate the confidence of overlapping boxes; this allows to more finely adjust the position and confidence of the target box and improve the positioning accuracy of the anchor box. The experimental results show that, compared with the YOLOv5s model, YOLO-SDI improves the precision, recall, F1 value, and mean Average Precision (mAP) by 1.2, 0.2, 0.6, and 7.2 percentage points, respectively. The findings of this study indicate that the YOLO-SDI model has potential for practical applications such as agricultural residue management and field cleanliness evaluation and can provide strong technical support for improving the recovery rate of agricultural residues.
Blood pressure is a key indicator of cardiovascular health. To monitor daily blood pressure at home, researchers continue to use non-end-to-end measurement methods, such as combining multiple physiological signals, in addition to traditional blood pressure meters. These methods present some disadvantages: collecting multiple physiological signals is difficult and expensive, and maintaining the time synchronization of signal collection is challenging. Currently, the end-to-end method, which uses face videos to predict blood pressure, has widened the application areas to an extent; however, in most methods, selecting the areas of interest and limited prediction accuracy remain unresolved issues. To solve these problems, this paper proposes a remote blood pressure prediction method based on a multi-scale attention structure in visible light, called EBP-Net. First, each face video candidate window is classified, the effective skin region is extracted by regression, the remote Photoplethysmography (rPPG) signal is extracted from the continuous effective face region by optical flow-based technology, and the complete rPPG signal is filtered by wavelet transform. Robust rPPG signals are extracted using methods such as detrending. Second, EBP-Net introduces a new Efficient Multi-scale Attention (EMA) module and Multi-Scale Fusion (MSF) module, which can enhance the features of depth vision representation without reducing the channel dimension and significantly improve the model's ability to understand and predict physiological signals by capturing and hierarchically expressing multi-scale features. In experiments, Systolic Blood Pressure (SBP) is categorized as grade C according to the British Hypertension Society (BHS) on both datasets, while Diastolic Blood Pressure (DBP) is categorized as grade B. The Mean Absolute Error (MAE) values are 6.82 and 5.17 mmHg for systolic and DBP, respectively, which are lower than those in recent comparable studies. Compared with other models, this method has better generalization ability, a lower error rate, and provides effective methods and suggestions for the detection blood pressure level from face videos.
The extraction of the Aortic Dissection (AD) centerline is crucial for its quantitative diagnosis and treatment. However, owing to the complex anatomy and diverse vascular morphology of AD, this task is highly challenging, and current quantitative evaluation methods are limited. Most existing approaches require presegmentation or distance map computation, followed by extraction using minimum path or skeleton algorithms; however, these often result in broken centerlines owing to incomplete segmentation. We propose a Deep Q-Network (DQN)-based centerline tracking algorithm that integrates an attention-embedded dilated residual module with a channel attention mechanism, thereby enabling more effective vascular feature extraction and automatic tracking of complex vessel centerlines. Additionally, an improved reward function is designed to guide accurate centerline tracking. Experiments on public datasets show that our method outperforms others in terms of centerline overlap metrics, with an average extraction speed of 5 s per case, indicating its strong clinical potential.