Traditional Retrieval-Augmented Generation (RAG) methods predominantly focus on pure-text scenarios. In these scenarios, their retrieval and generation mechanisms encounter difficulties in effectively modeling common visual elements, spatial layouts, and structural semantics within multimodal documents. This drawback restricts their performance in tasks related to text-image hybridization, long documents, and cross-document reasoning. To tackle this issue, Multimodal Retrieval Augmented Generation (MRAG), by integrating text, image, and layout structure modeling, and incorporating multimodal evidence retrieval and scheduling during the generation process, has already developed into a core technical paradigm for Question & Answer (Q & A) and reasoning in visually-rich documents. This paper conducts a systematic review of research progress in MRAG applications for document Q & A tasks. Firstly, based on the practical requirements for multimodal document understanding, we analyze the key challenges in MRAG implementation, including multimodal alignment, long-context modeling, evidence traceability, and system robustness. Secondly, from the perspective of how MRAG systems support the generation process, we compare representative methods across four dimensions: embedding paradigms, document retrieval scope, layout-aware mechanisms, and multimodal retrieval strategies. We focus on how design choices influence generation stability, reasoning accuracy, and system complexity. Thirdly, we summarize the characteristics and limitations of existing multimodal document Q & A datasets and evaluation frameworks, and analyze the current constraints in evidence granularity and reasoning explainability. Finally, we point out that MRAG is evolving from static similarity-matching retrieval mechanisms to dynamic evidence planning paradigms centered on generation and reasoning needs, and should continuously enhance the reliability and explainability of complex document Q & A systems through collaborative multimodal modeling with multi-granularity approaches.
Model intellectual property protection is an issue that cannot be ignored in model security. Watermarking technology, as the core means of model traceability, provides technical support for copyright verification by embedding special identifiers into model parameters or generated content. However, trained watermarked models can easily be copied and spread, which enables attackers to destroy or remove the watermarks embedded in Deep Neural Network (DNN) models using specific technical means such as fine-tuning, pruning, or adversarial sample attacks, making the verification of model ownership impossible. To gain a deeper understanding of model watermarking attack methods, this study begins by introducing model watermarking attacks and proceeds to classify these methods into two categories, white-box watermarking attacks and black-box watermarking attacks, based on the attacker's access rights and information acquisition capabilities regarding the target model. It also sorts and analyzes the motives, hazards, attack principles, and specific implementation methods of DNN model watermarking attacks. Moreover, it compares and summarizes existing research on model watermarking attacks from the perspectives of attacker capabilities and performance impacts. Finally, it explores the potential positive roles of neural network model watermarking attacks in future research and provides suggestions for in-depth research in the fields of model security and intellectual property protection.
With the continuous advancement of education digitalization, intelligent education has developed rapidly. As a core research task in the field of intelligent education, Knowledge Tracing (KT) aims to capture students' mastery of knowledge concepts based on their historical learning data to provide personalized learning paths and resources to meet the objectives of Artificial Intelligence (AI)-assisted education. Traditional KT methods primarily rely mainly on Bayesian and logic models, which have good scientific explanatory properties but exhibit limited performance when processing massive amounts of educational data. Because of its excellent feature extraction ability and performance, deep learning technology is more suitable than traditional KT methods for capturing learners' knowledge status from massive data. Therefore, a comprehensive review of research on deep learning-based KT in the field of intelligent education is conducted. First, the relevant concepts, research backgrounds, and current development status of KT in intelligent education scenarios are introduced. The KT methods based on deep learning in recent years are then analyzed and divided into four categories: Recurrent Neural Network (RNN), self-attention network, memory-enhancing neural network, and Graph Neural Network (GNN). The basic ideas and algorithm processes of these four classical and mainstream methods are systematically classified and sorted in terms of learner and exercise characteristics. Subsequently, the public education datasets currently available to researchers are introduced, and the performance of different methods on these datasets is compared. Finally, this paper summarizes deep learning KT in intelligent education and discusses possible future research directions in this field.
Convolutional Neural Networks (CNNs) are widely used in the field of object detection, earning widespread acclaim in scholarly circles due to their precision and scalability. It has spawned numerous notable models, including those in the Region-based Convolutional Neural Networks (R-CNNs) (such as Fast R-CNN and Faster R-CNN) and You Only Look Once (YOLO) series. After the success of Transformers in the field of natural language processing, researchers began exploring their application in computer vision, leading to the development of visual backbone networks such as Visual Transformer (ViT) and Swin Transformer. In 2020, a Facebook research team unveiled DEtection TRansformer (DETR), an end-to-end object detection algorithm based on Transformers, designed to minimize the need for prior knowledge and postprocessing in object detection tasks. Despite the promise shown by DETR in object detection, it has limitations including low convergence speed, relatively low accuracy, and the ambiguous physical significance of target queries. These issues have spurred a wave of research aimed at refining and enhancing the algorithm. This paper aims to collate, scrutinize, and synthesize the various efforts aimed at improving DETR, assessing their respective merits and demerits. Furthermore, it presents a comprehensive overview of state-of-the-art research and specialized application domains that employ DETR and concludes with a prospective analysis of the future role of DETR in the field of computer vision.
In the development of most existing gait-based emotion recognition methods, feature fusion has not been sufficiently studied; therefore, these methods have failed to fully utilize various gait features, resulting in poor performance. In this study, an emotion recognition method is developed, based on the adaptive fusion of multiple gait features. In this method, spatiotemporal features, reconstructed features, and psychology-based affective features are extracted from gait data. Spatiotemporal features capture the dynamic changes in gait patterns, reconstructed features focus on the structural aspects of gait, and psychology-based affective features provide insights into an individual's emotional state. Subsequently, an adaptive fusion strategy is used to dynamically weigh the importance of the three gait features, thereby achieving a more comprehensive representation of the individual's emotional state. Finally, ten-fold cross-validation is performed on a dataset containing four emotion categories, followed by training and testing the model on a real-world emotion-gait dataset. Experimental results show that in multi-label classification tasks, the model proposed herein improves the mean Average Precision (mAP) by two percentage points compared with the state-of-the-art TAEW method. Furthermore, in multiclass classification tasks, the accuracy of the model proposed herein is 1.88 percentage point higher than that of the STEP method. These results indicate that the method effectively leverages the spatiotemporal, reconstructed, and psychology-based affective features of gait, thereby providing a robust and accurate approach to emotion recognition.
The path following mechanism of fixed-wing Unmanned Aerial Vehicles (UAVs) is crucial in the UAV domain. In the field of six-Degrees of Freedom (DOF) dynamics, the fixed-wing UAV is presented as a nonlinear system, wherein the high dimensions of its continuous state and action spaces make it challenging to control and guide. A novel hierarchical reinforcement learning framework is proposed to address the complex issues in fixed-wing UAV path following. The basis of this framework is to decompose path following into separate control and guidance problems. For the control problem, a Proximal Policy Optimization with Differential Compensator (PPO-DC) algorithm is introduced by incorporating a differential compensator, which demonstrates a faster convergence speed and control stability. Experimental results reveal that the proposed PPO-DC algorithm improves convergence speed by approximately 2.5 times compared to the standard PPO algorithm and achieves better control accuracy. Moreover, models trained for specific control tasks exhibit strong adaptability when handling other control tasks. For the guidance problem, the fixed-wing UAV guidance is modeled, and an effective guidance strategy is proposed. Additionally, a cumulative reward design is proposed to address the sequential learning of multiple objectives in reinforcement learning tasks, ensuring effective convergence of training. Experimental results show that the proposed hierarchical reinforcement learning framework performs exceptionally well in various complex path-following scenarios, maintaining an average path-following error of less than 20 meters for fixed-wing UAVs.
Federated Learning (FL), a distributed machine learning technology, has achieved significant results in privacy protection. However, in practical applications, client drift phenomena occur because of the Non-Independent and Identically Distributed (Non-IID) nature of data sources, leading to slow model convergence and performance degradation. To address this issue, this study proposes a Federated Local Momentum accelerated learning (FedLM) algorithm combined with the attention mechanism. FedLM introduces a global momentum term into local model updates, utilizing the global gradient information from previous rounds to smooth the current update process and correct the divergence of parameter update directions among heterogeneous clients, thereby reducing gradient oscillations and alleviating data heterogeneity issues. The attention mechanism dynamically adjusts the weight of each client in the global model update to improve the quality of the aggregation model. Experimental results show that FedLM achieves significantly better accuracy and stability than existing federated learning algorithms such as SCAFFOLD, FedCM, and Moon in image classification tasks with different levels of data heterogeneity, model structures, and datasets.
Repeated consumption behavior is a common phenomenon in many recommendation scenarios, such as e-commerce repurchases and interest point punching; this behavior includes both the possibility and timing of a repurchase. This study mainly focuses on the prediction of a single factor (prediction of either the possibility or timing of repurchase). However, this does not address the specific questions of when and what to buy again. The main challenges associated with this type of problem are as follows: the types of repurchase items are very diverse, different items have different purchase cycles, and repurchase behavior is often sparse; these challenges make prediction very difficult. Furthermore, repurchase behavior includes two dimensions—time and items—and using the information from these two dimensions for prediction purposes is also difficult. A solution to these problems is explored from the perspective of user-personalized dynamic attenuation characteristics and a fusion model based on repurchase behaviors and time intervals. First, the user's interest in an item decreases over time and recent behavior has a stronger potential correlation with repurchase behavior; therefore, a modeling item sequence is proposed to obtain the user expression vector, and the information of the neighboring sequence is used to solve the problem of repurchase behavior sparsity. Second, by reasonably designing the neural network module, the user's personalized repurchase cycle and the item's repurchase cycle are captured, and the information fusion problem of time and items is solved. A large number of experiments are conducted on multiple public datasets, the results of which confirm that the model developed in this study is superior to existing benchmark models related to this study.
Knowledge graph completion aims to address the problems of knowledge deficiency and incompleteness, by predicting missing entities or relationships in a knowledge graph. Compared to traditional knowledge graphs, commonsense knowledge graphs are typically sparser, making them insufficient for representing entities solely based on structural information. Therefore, some studies enrich commonsense knowledge graphs by utilizing semantic representations in addition to structural information. However, these methods typically focus only on the semantic representation of individual entities and ignore the semantic associations between entity sets. To address this issue, this study proposes a new method called relation-constrained contrastive learning for common-sense knowledge graph completion. First, the method uses relations to divide entities into different sets and selects positive and negative sample pairs from these sets for contrastive learning, to obtain the basic representations of the entities. It further learns comprehensive entity representations by constraining the similarity between individual entity semantic representations and the central representations of the sets to which the entities belong. The completion task is performed based on these comprehensive representations. Experiments on two public datasets show that the proposed model outperforms baseline models. Compared to the second-best model, CPNC, the proposed model improves the Mean Reciprocal Rank (MRR) by 1.09 and 2.48 percentage points and Hits@1 by 1.02 and 1.55 percentage points on the CN-100K and ATOMIC datasets, respectively.
Relationship Extraction (RE) tasks in the biomedical field often face issues such as data scarcity, class imbalance, and multiple labels. To address these issues, a method that combines data augmentation with a dynamic threshold strategy is proposed. First, the GPT model is fine tuned using a custom loss function and new data is generated based on the Word2Vec model by obtaining feature templates. Second, the BERT classifier is used to screen the generated data, combining high-quality samples with the original dataset to form a richer training set. Finally, a learnable dynamic threshold strategy is proposed to dynamically adjust the classification threshold based on document length and the difference between model output and real labels, enabling the model to flexibly handle multi-label document problems. Experimental results on two publicly available medical datasets showed that the method achieved F1 values of 84.1% and 69.3%, which were 1.6 and 1.1 percentage points higher than those of the ATLOP method, respectively, verifying the effectiveness of the method.
By focusing on the traditional multi-Knapsack Problem (KP) in typical logistics system operations, this study abstracts a Heterogeneous Multiple Knapsack Problem (HMKP) and formulates an improved Deep Deterministic Policy Gradient (DDPG) algorithm to solve it. The DDPG algorithm tends to fall into a local optimum when solving the 0-1 KP. To address this issue, a Dynamic Randomization Mechanism (DRM) and Dynamic Penalty Mechanism (DPM) are adopted and embedded with an improved Transformer module to optimize the algorithm. Then, a Dynamic DDPG (TDP-DDPG) algorithm is proposed based on the improved Transformer module. First, a tabu list is added to prevent repeated searches. The TDP-DDPG algorithm demonstrates efficient search capability across several experimental algorithms, finding the ideal optimum in 39 classical algorithms in test sets 1 and 2 from low to high dimensionality, as well as in higher dimensionality test set 3 and in three out of six algorithms in large-scale test set 4. Experiments show that the TDP-DDPG algorithm has stronger optimization seeking ability after incorporating the improved strategy. Next, the BPD-DDPG algorithm is designed based on the TDP-DDPG algorithm to solve the HMKP with higher complexity and is analyzed and evaluated in high-dimensional arithmetic by combining several classical 0-1 KP examples. The results show that the BPD-DDPG algorithm is more accurate than Gurobi in three low-scale cases; however, the solution time is longer. The BPD-DDPG algorithm can efficiently solve high-dimension, large-scale HMKP at a low computational cost within an acceptable time.
Traditional recommendation models based on contrastive learning first perform data augmentation on the original interaction graph and then strive to improve the consistency of representations encoded from different views. Although this method has been proven effective, recent research has found that graph augmentation often introduces bias owing to the power-law distribution of node edges in graph data: such biases are detrimental to contrastive learning. In addition, the graph structure distribution makes the processing of large-scale datasets computationally intensive, limiting the flexibility of contrastive learning models. To address these challenges, this study proposes a High-Low Variance Separation feature enhancement method (HLVS), which not only avoids direct perturbations to the graph structure but also alleviates the semantic bias problem that exists in traditional feature perturbation methods. Simultaneously, to alleviate the issue of popularity bias in recommendation systems, popularity metrics are introduced into the main task, and a new loss function, Popularity Bayesian Personalized Ranking (PBPR) loss, is designed to balance the representation of popular and unpopular nodes. Finally, by integrating contrastive learning, HLVS, and PBPR, a lightweight and parameter-free graph contrastive learning framework, eXtremely Simple Graph Contrastive Learning (XSGCL), is designed, which can be naturally integrated into recommendation models to improve training efficiency and performance. Extensive experiments on five public datasets prove that integrating XSGCL into LightGCN not only significantly improves training efficiency but also achieves a performance that is better or comparable to that of advanced models. For example, on the Yelp2018 dataset, compared to LightGCN, the proposed model improves training efficiency by 91.2%. On the Alibaba-iFashion dataset, Recall@10 and NDCG@10 indicators increase by 32.21% and 33.73%, respectively.
Remote sensing semantic image segmentation technology has significant applications in resource management, natural disaster management, and environmental monitoring and protection. However, different remote sensing image datasets often exhibit issues such as spectral confusion between different objects and spectral variations within the same object. These issues significantly reduce the generalization performance of deep learning models, and cross-domain performance degradation in remote sensing semantic image segmentation algorithms poses a significant challenge. To address these issues, optimizations are performed from two perspectives: neural network architecture and domain adaptation strategies. First, a TransConv network based on a hierarchical multihead self-attention mechanism and multiscale feature fusion is proposed. This network effectively enhances feature extraction and fusion capabilities through sliding window patching, multilayer self-attention modules, and a lightweight feedforward neural network, thereby improving the model's generalization performance. Second, a self-training-based domain adaptation technique is introduced, which optimizes the image input, model parameters, and learning process. As a result, labeled source domain knowledge is successfully transferred to the unlabeled target domain, significantly improving the segmentation performance in the target domain. Experimental results demonstrate that the improved TransConv network significantly outperforms other algorithms in terms of generalization performance. In addition, it excels in domain adaptation tasks with the self-training-based domain adaptation technique. The proposed approach thus enhances the accuracy and generalization capability of remote sensing image semantic segmentation, reduces the impact of erroneous pseudo-labels, and addresses the class imbalance problem, providing more reliable technical support for practical applications.
Pavement anomaly detection holds significant practical importance for ensuring driving safety, optimizing traffic management, and enhancing driving experience. To address the challenges posed by variations in the size, shape, and color of pavement anomalies, and complex environmental interferences that lower detection accuracy and efficiency, this study proposes an improved Real-Time Detection Transformer (RT-DETR)-based technology for pavement anomaly target detection. First, a Large Receptive Field Element-wise Multiplication Block (LRFEM_Block) is designed to replace the BasicBlock module in the original backbone network, effectively enhancing feature expression capabilities based on the element-wise multiplication principle. Next, a Generalized Efficient Layer Aggregation Network (GELAN) is introduced and combined with multi-scale LRFEM_Block modules to design a Multiplicative-based Layer Aggregation Intra-scale Feature Interaction (MLA-IFI) structure, which improves the computational efficiency and performance of the neck network for deep features and optimizes the gradient propagation path. Additionally, the Selective Boundary Aggregation (SBA) concept is employed to construct a Bidirectional Adaptive Boundary Fusion Feature Pyramid Network (BABF-FPN) multi-scale feature fusion module, adaptively aggregating features of different resolutions bidirectionally and promoting the refinement of small object boundaries. Experimental results show that the improved method achieves a 3.4 and 4.7 percentage point increase in mAP@0.5 on a self-built dataset and the RDD2022 public dataset, respectively, outperforming other models. Moreover, it reduces the number of parameters and computational load by 24.5% and 11.2%, respectively, with a detection speed of 74 frame/s, thereby satisfying the deployment requirements for in-vehicle pavement anomaly detection.
Feature extraction from remote sensing images with complex backgrounds is challenging, and he accuracy is low due to the high density of small targets and significant scale variations. To address these challenges, this paper proposes a multi-scale information-enhanced target detection algorithm based on YOLOv5s: Deep Learning YOLO(DL-YOLO). First, the improved algorithm employs cavity convolutional fast spatial pyramid pooling designed based on Spatial Pyramid Pooling-Fast (SPPF) at the top of the backbone network. This improves the feature extraction capability of the network by fusing the detailed information of the multi-scale targets with the semantic information through the Receptive Field Enhancement Block (RFEB). Second, the improved algorithm incorporates a Lightweight and Efficient Detection Head (LEDH), which is based on the Decoupling Head (DH) of YOLOv6. The original detection head is replaced with the LEDH, which features a lightweight cavity Global Depth Convolution (GDConv) module, to improve the correlation learning of classification and regression tasks. The LEDH also employs lightweight convolution for lightweighting purposes, which enhances the target detection accuracy at different scales and reduces the number of decoupling head parameters. The results of the experiment on the DIOR dataset demonstrate that the proposed DL-YOLO algorithm increases precision, recall, mAP@0.5, and mAP by 1.6, 2.1, 2.1, and 4.7 percentage points, respectively, compared with YOLOv5s. The all-around score of the proposed algorithm surpasses those of several current exceptional target detection algorithms; hence, it is feasible for detecting targets in remote sensing images at multiple scales.
The RSD-YOLO algorithm, based on YOLOv8s, is proposed to address the challenges of low detection performance, severe occlusion, difficulty of small target feature extraction, and large number of model parameters inherent in Unmanned Aerial Vehicle (UAV) aerial images. First, the Receptive Field Attention (RFA) module CSP-RFA is designed to replace the C2f module for enhancing the capability of small target feature extraction, effectively addressing the insensitivity of traditional convolutional operations to positional changes. Second, the backbone and feature fusion networks are made lightweight, a new large-size feature map detection branch is added, and a Receptive Field Pyramid Network (RFPN) is proposed to optimize the feature flow direction and improve feature representation. Third, the detection head module is optimized by integrating multi-scale features with a multi-level attention mechanism and the loss function is updated to improve the model's detection performance for small targets. Finally, in terms of model compression, Layer-Adaptive Magnitude-based Pruning (LAMP) algorithm is employed to further reduce the number of parameters and model size. The experimental results demonstrate that the lightweight RSD-YOLO model significantly outperforms the baseline model on the publicly available VisDrone2019 dataset, with a 10.0 percentage point increase in precision, a 9.5 percentage point increase in mAP@0.5 (equivalent to a 24.1% increase), and a 6.9 percentage point increase in mAP@0.5∶0.95 (equivalent to a 29.4% increase). The number of model parameters is reduced from 11.12×106 to 4.05×106, representing a 63.6% reduction, and the computational cost is reduced from 42.7 GFLOPs to 25.5 GFLOPs, showing a 40% reduction. Furthermore, for a newly filtered dataset focusing on small occluded targets, RSD-YOLO shows improvements of 9.1, 16.1, and 10.7 percentage points in terms of precision, mAP@0.5, and mAP@0.5∶0.95, respectively.
Multi-label image classification studies tend to use label semantic information and label co-occurrence probability as prior knowledge to guide the learning of multi-label classification models. However, most of these methods rely on additional semantic information, which makes it difficult to handle the information mismatch problem between different modalities. The calculation of label co-occurrence probability is also susceptible to data imbalance and noise. To address these issues, this study proposes a multi-label image classification method based on label visual prototype learning, which utilizes only the visual information of an image and constructs a multi-label classifier by generating label visual prototypes. This method reduces the reliance on prior knowledge and fully utilizes the visual information, effectively improving classification performance. First, an attention module based on class-specific activation maps is designed to guide the model to focus on image regions that are more relevant to the class and generate class-specific feature representations. Second, by capturing the visual prototype representation of each label, a label visual prototype dictionary is constructed to fully leverage the adaptability of visual feature information to image classification tasks. Finally, using this dictionary as a multi-label classifier, the visual features of the input image are reconstructed to obtain the predicted probability of the labels. Experimental results show that this method improves classification accuracy compared with similar methods on three standard multi-label image classification datasets.
Multi-view 3D reconstruction based on neural implicit surface learning includes inherent ambiguities in representing the geometric shape and appearance of complex objects. Consequently, the fine geometric details of an object are prone to being lost in sparse texture areas, boundaries, and large smooth surfaces, making accurate recovery difficult. To address this issue, this study proposes a novel neural implicit surface reconstruction method based on multi-view mixed consistency constraints. This method uses Multi-View Stereo (MVS), multi-view photometric consistency, feature consistency, and volume rendering techniques to optimize the implicit surface representation, enabling the reconstruction of object models with fine geometric details. First, a dense point generation module based on MVS is proposed to supplement detail information in the sparse texture areas and boundaries of the object surface, achieving multi-view geometric optimization of the object surface. Second, a multi-view mixed consistency constraints module is introduced, which uses the Signed Distance Function (SDF) to locate the zero-level set. It applies multi-view photometric consistency constraints to impose geometric constraints on the smooth regions of the object, supervising the extracted implicit surface. Additionally, multi-view feature consistency constraints are applied to surface points at the zero-crossing of the linearly interpolated SDF, compensating for pixel matching errors in texture-sparse or structurally complex regions, thereby optimizing the object reconstruction model. Finally, volume rendering technology is applied to produce high-quality image renderings from the implicit SDF, enabling precise surface reconstruction of objects. Experimental results show that, compared to methods such as Colmap, the proposed method increases the Peak Signal-to-Noise Ratio (PSNR) by over 40.3% on the DTU dataset and successfully enables accurate surface reconstruction of the objects.
To address the issues of low accuracy in vehicle paint defect detection, excessive parameters in detection algorithms, and the uneven distribution of easy and hard samples, a vehicle paint detection method based on an improved YOLOv8 is proposed. To enhance scratch defect detection capabilities and reduce model size, a Deformable Attention Transformer (DAT) mechanism is introduced into the backbone network, and Ghost Convolution (GhostConv) replaces the standard Convolution (Conv) modules. Subsequently, to improve feature extraction capabilities and further reduce model size, a C2f Based on Efficient Multiscale Attention (EMA) (C2f-E) module is proposed by combining the FasterBlock module and the EMA attention mechanism. Moreover, to enhance the detection performance for small objects, a network based on the Bidirectional Feature Pyramid Network (BiFPN) is designed. Additionally, by adding a small-object detection head and a multiscale feature fusion branch, a neck pyramid structure named BiFPN with Small Object Detection Head (BiFPN-D) is proposed. Finally, to address the balance issue between difficult and easy samples and improve the detection performance for small object defects, Wise-Intersection over Union version 3 (WIoUv3) is employed as the loss function for training the network. The improved network is trained on a self-built dataset of vehicle paint defect images and subjected to comparative experiments. The results show that, the improved model achieves an increase of 5.5 percentage points in terms of mean Average Precision (mAP@0.5) and a reduction of 1.4×106 in terms of parameter count, compared to YOLOv8n.
In 3D object detection from point clouds, the inherent sparsity of Light Detection And Ranging (LiDAR) data poses pronounced challenges for small objects. A small number of effective points lead to weak structural cues and blurry boundaries; limited contextual awareness hinders spatial reasoning and semantic completion, causing localization bias; and the difficulty of precise spatial localization, weak channel expressiveness, and background dominance constrain accuracy. To mitigate the impact of the aforementioned issues on detection accuracy, a dynamic-aware 3D detector is proposed that integrates dynamic feature extraction with feature-enhancement mapping, targeting two critical stages of small-object detection: feature extraction and candidate generation. Specifically, a Dynamic Point Feature Prediction Network (DPFPN) that adaptively predicts and supplements sampling points to strengthen structural perception of small objects is introduced. Subsequently, a Feature Enhancement Mapping Network (FEMN) is built that deeply fuses the original features with those produced by the dynamic module to yield context-rich 2D feature maps, thereby compensating for contextual deficiency and improving localization. Finally, a Point Cloud Feature Enhancement Network (PCFEN) module is designed to sharpen focus on key small-object regions along both channel and spatial dimensions. Experiments on the nuScenes dataset demonstrate that the proposed approach performs better than mainstream detectors. Relative to the CenterPoint baseline, the mean Average Precision (mAP) increases from 56.1% to 59.4% and the Nuscenes Detection Score (NDS) rises from 64.4 to 67.4.
Current false comment detection models face several problems such as insufficient mining of deep emotional features, lack of semantic dependency relationships, and poor generalization performance. In response to these, a false comment recognition model, DEBR-GAN, based on emotion-weighted BERT and multi-task adversarial learning, is proposed. First, using an emotion dictionary to assist in pretraining BERT, the potential emotional information in the comment text is extracted through an emotion weighting mechanism, thereby enhancing the ability to capture subtle emotional changes in the comments. Subsequently, a Recurrent Neural Network (RNN) is used to process the semantic features output by BERT, fully exploring the temporal dependencies and contextual relationships between words in comments for improving sensitivity to text details. Furthermore, to enhance the robustness and generalization ability of the model in multi-domain scenarios, DEBR-GAN draws on the adversarial learning concept of the Generative Adversarial Networks (GAN), treating the fake comment detector as a feature generator for extracting effective features shared across domains. Simultaneously, by setting category discriminators and rating discriminators, gradient reversal techniques are used in the backpropagation process to engage in adversarial games with the generator. This effectively eliminates the interference of category information and user rating preferences in the feature extraction process, thereby ensuring that the detector is highly accurate in identifying fake comments. The experimental results show that, on the Dianping dataset, the F1 value of the DEBR-GAN model is as high as 0.926. Compared with those of the model without the multi-task adversarial learning module and the current best baseline model, the classification accuracy of DEBR-GAN is increased by 5.1 and 3.51 percentage points, respectively. In addition, DEBR-GAN exhibits high recognition accuracy in handling comments with different emotional tendencies and semantic structures, thereby verifying the effectiveness and superiority of combining emotional enhancement and adversarial learning in false comment detection.
Multi-signature is widely used in blockchain transaction schemes. Despite increasing demand for the localization of blockchain applications, research on multi-signature has not sufficiently focused on secure and efficient SM2 algorithms. Additionally, most existing solutions rely on the Public Key Infrastructure (PKI) system to implement certificate management, which poses efficiency and scalability issues. Therefore, this study proposes a certificateless multi-signature scheme based on the SM2 algorithm. First, in the SM2 key generation stage, a certificateless cryptographic mechanism is introduced to avoid expensive certificate management, and a key holding proof is designed to resist malicious key attacks. Second, by introducing a tree structure, an "online-offline" SM2 multi-signature algorithm is designed to achieve efficient and highly scalable signature generation. The scheme is proven to satisfy the Existential UnForgeability under Chosen Message Attacks (EUF-CMA) in a Random Oracle Model (ROM). Finally, the proposed solution is applied to the Hyperledger Fabric consortium chain to optimize the blockchain transaction process. Results of a performance analysis show that, compared with existing signature schemes, the proposed scheme is more effective in reducing computational and communication overhead while ensuring security.
Federated learning enhances data sharing and collaboration between healthcare institutions, thereby improving the accuracy and efficiency of medical diagnoses, treatments, and predictions. However, existing federated learning solutions face security and efficiency challenges. Model parameter updates during training may inadvertently disclose information about local training datasets. To ensure parameter confidentiality, researchers have proposed various solutions such as masking protocols and differential privacy. However, masking protocols often lack strong security, whereas differential privacy leads to tradeoffs between accuracy and privacy. To address these challenges, this study proposes a secure federated learning scheme for smart healthcare based on secret sharing and homomorphic encryption. This scheme effectively prevents both healthcare clouds and clients from stealing model parameters and resists collusion attacks among participants. In addition, a ciphertext verification algorithm is used to ensure that model parameters can be verified during training. Security and performance analyses demonstrate that our scheme meets the confidentiality and integrity requirements for model parameters in smart healthcare scenarios, with significant improvements in computational and transmission efficiency compared to existing solutions.
In digital voting systems, the combination of Fully Homomorphic Encryption (FHE) and blockchain technology guarantees the security and privacy of E-voting. The overall performance of existing schemes is constrained owing to the complex computation process of FHE algorithm, especially in terms of vote-counting efficiency and fairness. To address these issues, this paper proposes a Blockchain E-voting Scheme based on Fully Homomorphic Encryption (BCEVS-FHE). This scheme optimizes the Brakerski—Fan—Vercauteren (BFV) FHE algorithm by mitigating the impact of noise factor to reduce the computational overhead during encryption and decryption, thereby improving the vote-counting efficiency. The SM2 digital signature algorithm is used to sign ballot information generated by voters, ensuring that the voters could not deny their voting behavior and preventing identity impersonation and fraud. Furthermore, smart contracts are introduced to improve the weighting method used for vote tallying. Consequently, the unforgeability and non-tampering of voter weights are ensured, thereby guaranteeing the fairness and impartiality of the voting process. Finally, all transaction information is stored in the chain using a private blockchain, ensuring that the entire voting process is tamperproof and fully traceable. Experimental results show that BCEVS-FHE not only guarantees security attributes such as privacy, confidentiality, security, uniqueness, and verifiability but also excels in functional attributes such as fairness and mobility. Overall, BCEVS-FHE meets the security requirements of E-voting protocols and has high potential for practical applications, which is of significant research for the widespread application of digital voting systems.
For the task unloading problem in the multibase station multitask Mobile Edge Computing (MEC) environment when considering the parallel transmission between base stations, task unloading delay, and edge server load, a task unloading strategy with system delay and load balancing is proposed. To solve optimization problems, a task offloading method, called IPSO, based on improved Particle Swarm Optimization (PSO) algorithm is proposed. This algorithm optimizes the initial solution space of the PSO algorithm, uses the flight strategy of Levy to update the speed vector of each particle, effectively avoids the local optimal solution, and introduces the elite retention strategy of genetic algorithm to obtain a task unloading policy that can stably reduce the load of edge server. The IPSO algorithm is compared with Genetic Algorithm-Binary Particle Swarm Optimization (GA-BPSO), PSO, Artificial Hummingbird Algorithm (AHA), Genetic Algorithm (GA), and random coding algorithm. The experimental results show that the time delay and load standard deviation of IPSO algorithm under different task numbers and edge servers are less than the other five algorithms. Additionally, the system delay obtained after the task number increase is 3.04%, 4.63%, 6.79%, 8.94% and 12.7% lower than that of other algorithms, respectively. Moreover, the load standard deviation is 16.2%, 26.4%, 62.8%, 71.3% and 91.5% lower than the other algorithms, respectively.
This study investigates adaptive cooperative task offloading and allocation in a multiple Unmanned Aerial Vehicles (UAVs) collaborative mobile edge computing network. To enhance collaboration among UAVs in a time-varying environment and improve the efficiency of task execution, this study constructs a UAV task queuing model in a time-varying environment and establishes a UAVs task offloading decision model based on the Markov Decision Process (MDP). Moreover, this study proposes a Cooperative-based Deep Deterministic Policy Gradient (CODDPG) algorithm to address the optimization problem of multiple UAVs offloading. The CODDPG algorithm, which integrates CommNet with the traditional Multi-Agent Deep Deterministic Policy Gradient (MADDPG) algorithm, facilitates the sharing of environmental observations among all UAVs. This approach effectively extends the UAVs' perception of the environment and enhances their collaborative decision capability. It also addresses the issue of local optima in the MADDPG algorithm caused by its sole dependence on local information during agent training, thereby minimizing total computation delay. Experimental results demonstrate that the CODDPG algorithm not only significantly reduces task computation delay effectively but also converges faster than the traditional MADDPG algorithm.
This study proposes an improved elastic scaling strategy based on a composite algorithm that combines entropy weight utilization and a prediction model, to address the issues of single-metric evaluation, latency, and low resource utilization in Kubernetes's built-in elastic scaling strategy. The entropy weight utilization composite algorithm calculates the comprehensive load value of the Kubernetes cluster by focusing on the distribution differences (information entropy method) and overall trends (average utilization weight method) of resource utilization across different nodes, thereby solving the problem of single metric evaluation. Next, this study constructs a predictive model that combines Adaptive Variational Mode Decomposition (AVMD) and the Attention Mechanism-based enhanced Long Short-Term Memory (Attention Mechanism-based LSTM) network to solve the latency and low resource utilization issues by predicting load changes. This model enables the system to quickly respond, expand its capacity at the onset of high traffic, and rapidly scale down to release resources once traffic subsides. Experimental results show that the improved elastic scaling strategy reduces the response time by 52% during the early stage of burst traffic compared with the default Kubernetes scaling strategy, and it rapidly scales down after the traffic subsides to release resources, demonstrating high practical application value.
With advances in communication technology, the Internet of Things (IoT) has played an increasingly important role in real life, and the application of Unmanned Aerial Vehicle (UAV) communication in the IoT has been widely studied. UAVs are used as mobile data collectors to collect data from Sensor Nodes (SNs) in a WSN. The Age of Information (AoI) is introduced as an index for evaluating network performance. This paper proposes a data collection scheme based on UAV trajectory design and a scheduling strategy for SNs. Based on this scheme, a weighted minimization model is constructed to minimize the weighted sum of the Average AoI (AAoI) and energy consumption of the SNs by optimizing the trajectory of the UAV, the scheduling of the SNs, and the transmitting power. This mixed-integer nonlinear problem is usually difficult to solve directly. Therefore, the path discretization method is first used to discretize multiple continuous variables. Subsequently, a joint optimization algorithm based on Block Coordinate Descent (BCD) and Successive Convex Approximation (SCA) is proposed to obtain a local optimal solution that satisfies the KKT condition. The simulation results show an effective balance between the AoI and the energy consumption of the SNs, demonstrating the feasibility of the proposed scheme.
5G base stations can establish wireless networks through Integrated Access and Backaul (IAB) technology, which enables the information silos formed by disasters to restore public mobile communication rapidly. However, the link quality between base stations that have suffered blockages is severely weakened, thus degrading system reliability. To solve this problem, this study proposes a Game-theory-based Cooperative Routing (GCR) scheme to improve the reliability of end-to-end packet delivery by motivating base stations to cooperate. First, a coalition formation mechanism is designed based on the coalition game model, where neighboring base stations can form coalitions according to network reachability and coalition scale. Coalition members can share the responsibilities and benefits of packet delivery to protect the reliability of multi-hop paths from random blockages. A policy updating mechanism is then designed based on an evolutionary game model, which enables base stations to adaptively adjust the probability of cooperation to promote the selection of a forwarding policy. Simulation results show that this method performs better in terms of average delay, packet loss, and throughput when suffering from blockages. Especially, the end-to-end effective service rate reaches greater than 95%, which proves that it improves system reliability considerably.
High-resolution climate data is crucial for local and regional-scale production and livelihoods. Deep learning-based downscaling techniques can effectively bridge the gap between existing low-resolution climate data and application requirements. Deep learning-based downscaling methods that can generate high-resolution climate data are important for both local and regional production activities. However, the existing methods are often constrained by fixed scaling factors, which lead to high training costs in multiscale scenarios. However, their results for climate data are usually blurred and inaccurate in terms of high-frequency details. To address these limitations, this study proposes a deep learning super-resolution network that fuses implicit neural representations and adaptive feature encoding for arbitrary-scale climate downscaling. This method designs a dynamic pixel feature aggregation module to dynamically adjust the feature-encoding process using a learnable modulator, which can adapt to different scaling factors. Additionally, the implicit neural representation of the images is designed to predict continuous-domain pixel values by fusing coordinate linear difference features and neighborhood nonlinear features via an attention mechanism. Finally, combined with a high-order degradation training strategy, experiments on the ECMMWF HRES and ERA5 datasets demonstrate that the proposed method achieves a Peak Signal to Noise Ratio (PSNR) improvement of at least 0.7 dB at 2× scaling factor compared to fixed-ratio methods and outperforms existing arbitrary-ratio methods by at least 0.48 dB under the same scaling condition. These quantitative results demonstrate that the proposed approach is superior to existing methods because it provides a more flexible and efficient solution for meteorological data processing.
As an important aspect of engine health management, predicting the health state of aero-engines can provide quantitative basis for improving aircraft reliability and reducing engine maintenance costs. However, traditional aviation engine health state prediction methods lack sufficient attention to interpretability, resulting in a decreased support for engine maintenance decision-making, specifically when condition-dependent. This study proposes an interpretable prediction method for the health state of aero-engines based on the EnsembleBRB-SHAP approach, considering the demand for interpretability in engine health state prediction. First, a data-driven approach is used to train multiple sub-aero-engine health state prediction models based on a Belief Rule Base (BRB). Subsequently, an EnsembleBRB model is constructed for predicting the health state of aero-engines such that it utilizes multiple sources of uncertain data while ensuring prediction accuracy. Based on the SHapley Additive exPlanations (SHAP) framework, the constructed EnsembleBRB model is analyzed and interpreted to identify key features and achieve an interpretable prediction of the aero-engine health state. Finally, the feasibility and effectiveness of the proposed method are verified by introducing experimental monitoring data of engine faults recorded using the Commercial Modular Aero-Propulsion System Simulation software. Experimental results show that the Mean Square Error (MSE) of the proposed method in predicting the health status of aero-engines is 0.012 2. By analyzing local and global interpretability, the Low-Pressure Turbine (LPT) coolant bleed and physical fan speed are identified as the key parameters determining engine health status, which in turn can better support decision-making for managing aero-engine health and other work.
In a hybrid autonomous traffic scenario, where intelligent Connected Automated Vehicles (CAV) and Human-driven Vehicles (HV) coexist in a 6G network environment, vehicles automatically form a queue. By reducing the distance between vehicles, the traffic volume on the road can be increased, and the stability of the resulting fleet is worth studying. Fleet stability ensures driving safety between vehicles and alleviates traffic congestion. A hybrid autonomous traffic stability analysis method based on Digital Twin (DT) technology is proposed using an enhanced DT model to evaluate system performance without interrupting the current traffic state. First, considering environmental and vehicle transmission system factors such as weather conditions, road conditions, loads, and transmissions, as well as communication delays between CAV and their DT, based on the vehicle transmission system and longitudinal dynamics, an accurate and interpretable enhanced DT model is constructed in a model-driven manner. This model improves the efficiency, reliability, and safety of intelligent transportation. Subsequently, stability and series stability analyses are conducted on the constructed enhanced DT system, and the critical delay for the stability of the hybrid autonomous transportation system and the control gain conditions for the series stability of the CAV are derived. Finally, we analyze the impact of environmental data bias on enhanced DT systems in different traffic states and determine the effective parameter range for DT predictability. The numerical simulation results show that the proposed method can quickly determine the stability of hybrid autonomous transportation systems and obtain an effective parameter range for DT predictability.
The wide application of blockchain in the medical field has gradually brought attention to the effective supervision of diversified, sensitive, and continuously growing medical centralized purchasing data. However, owing to complex business relationships, existing medical centralized purchase data supervision schemes are not efficient in multi-department collaborative supervision processes, and shared supervision data carry the risk of privacy disclosure. Therefore, based on multi-chain collaboration, this paper proposes a supervision scheme for medical centralized purchase data, supporting security sharing. This scheme constructs a multi-chain collaborative supervision framework based on the supervision relay chain, summarizes the supervision elements and supervised data objects for the medical centralized purchase business, and forms a comprehensive view of multi-chain collaborative supervision and cross-chain interaction. Multiple regulatory information flows are described through the multi-chain collaborative regulatory model, and the regulatory elements are described as a structured list of cross-sectoral comprehensive regulatory matters to support multi-sector, multi-link, and cross-chain supervision. During the implementation of multi-chain supervision, a large amount of regulated data generated by the medical centralized purchase business is symmetrically encrypted and stored in an Inter Planetary File System (IPFS), reducing the storage burden of blockchain. Proxy Re-Encryption (PRE) technology is introduced to ensure the secure sharing of symmetric key and metadata between multiple chains, and a searchable encryption algorithm is integrated to support the retrieval of the IPFS file address ciphertext from the chain. By analyzing the medical centralized purchase business, a security analysis of the process flow of data supervision is carried out. Then, the performance of chain codes such as authorization, upload, and query in the collaborative supervision process is evaluated. Experimental results show that the proposed scheme is secure and efficient, is more suitable for the medical field than similar schemes, and meets the requirements of multi-sector and multi-link collaborative supervision and data security sharing.
A coal mine unsafe behavior corpus containing 8 entity categories and 2 359 samples has been constructed using a BIO labeling strategy to improve the efficiency of underground safety management and realize safe coal mine production, based on the relevant standards and norms of the coal mine industry as well as insights into the field of underground unsafe behavior. Aiming at the problems of insufficient semantic information utilization, unbalanced entity distribution, and fuzzy entity boundary in the named entity recognition task of unsafe behavior in coal mines, this study proposes a named entity recognition model based on Global Pointer and adversarial training. First, the improved hierarchical RoBERTa model is used to make full use of multi-layer semantic information to enhance the text vectorization of underground unsafe behavior, and the word embedding layer is disturbed by adversarial training to alleviate the problem of data imbalance and enhance model robustness. Second, Bidirectional Gated Recurrent Unit (BiGRU) is used in the feature extraction layer to more effectively capture the contextual semantic features of the corpus and enhance the semantic association of the text. Finally, Global Pointer is constructed in the decoding layer to obtain more accurate entity boundary recognition results. The effectiveness of the proposed model is evaluated on a self-built small sample coal mine underground unsafe behavior dataset. The results show that the accuracy, recall, and F1 value of the proposed model are 78.77%, 78.20%, and 78.48%, respectively, which are 2.27, 0.63, and 1.45 percentage points higher than those of the BERT-Global Pointer model. The findings provide a basis for constructing a knowledge graph of unsafe behavior in underground mines.
Aiming at the problems at an actual rebar binding construction site, such as multilayered rebar mesh, complex operating environments and light, and dense components, and to realize the accurate positioning of rebar binding nodes, a joint localization method for rebar binding nodes based on binocular stereoscopic matching and binding state recognition is proposed starting from the actual needs of multilayer rebar skeleton plane binding. This method is based on joint target recognition with binocular vision. First, the feature extraction network of AnyNet is improved by introducing a Hourglass feature extraction network and an Efficient Channel Attention Network (ECANet) to improve matching accuracy in the rebar mesh region. As the multilayer rebar mesh has a complex structure and interlayer relationship, the target lashing work layer is obtained by depth filtering. Second, a lashing node localization model based on rebar skeleton extraction is proposed according to the characteristics of the target lashing work. Additionally, the coordinates of the rebar lashing nodes are obtained by extracting the rebar skeleton and fitting the equation of the rebar skeleton. Finally, the state of lashing nodes is identified by the light-weighted YOLOv5 to output the coordinates of the points to be tied. The experimental results show that the three Pixel Error (3PE) of the benchmark network AnyNet is 8.16% and that of the proposed algorithm is only 3.72%, which effectively improves the matching accuracy of the algorithm. The proposed algorithm can filter out the interference of deep-seated rebar, and the average error of the spatial localization of rebar tying nodes is 5.03 mm, which can satisfy the demand of rebar tying in a complex background.
Existing traffic congestion prediction methodologies are based on simplistic definitions of congestion indices and fail to effectively integrate static—adaptive graph information. To address these issues, this paper proposes an innovative Traffic Congestion Index (TCI), and a novel traffic congestion prediction model based on static-adaptive graph fusion called SA-GFSTCN. The TCI is defined based on three metrics, namely average speed, traffic flow, and occupancy rate, which collectively reflect road usage and traffic conditions. The model employs a parallel architecture to process the input data using spatiotemporal convolution and spatiotemporal attention modules to model the static road network structure and extract fixed structural information along with the spatiotemporal characteristics. Concurrently, adaptive graph convolution and gated temporal convolution are used to process adaptive graph data and extract dynamic spatiotemporal associative features. Finally, a cross-attention mechanism effectively fuses the outputs of the adaptive graph convolution and gated temporal convolution. Experiments conducted on two real-world traffic datasets demonstrate that the SA-GFSTCN model outperforms the optimal baseline model in terms of Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), and Root Mean Square Error (RMSE). Specifically, it achieves improvements of 0.27 and 0.20 in MAE, 0.22 and 0.23 percentage points in MAPE, and 0.38 and 0.36 in RMSE, respectively, across the datasets when compared to the baseline model. These results validate the effectiveness of the proposed model.