Traditional machine learning algorithms perform well only when the training and testing sets are identically distributed. They cannot perform incremental learning for new categories or tasks that were not present in the original training set. Continual learning enables models to learn new knowledge adaptively while preventing the forgetting of old tasks. However, they still face challenges related to computation, storage overhead, and performance stability. Recent advances in pre-training models have provided new research directions for continual learning, which are promising for further performance improvements. This survey summarizes existing pre-training-based continual learning methods. According to the anti-forgetting mechanism, they are categorized into five types: methods based on prompt pools, methods with slow parameter updating, methods based on backbone branch extension, methods based on parameter regularization, and methods based on classifier design. Additionally, these methods are classified according to the number of phases, fine-tuning approaches, and use of language modalities. Subsequently, the overall challenges of continual learning methods are analyzed, and the applicable scenarios and limitations of various continual learning methods are summarized. The main characteristics and advantages of each method are also outlined. Comprehensive experiments are conducted on multiple benchmarks, followed by in-depth discussions on the performance gaps among the different methods. Finally, the survey discusses research trends in pre-training-based continual learning methods.
The generation of High-Definition (HD) environmental semantic maps is indispensable for environmental perception and decision making in autonomous driving systems. To address the modality discrepancy between cameras and LiDARs in perception tasks, this paper proposes an innovative multimodal fusion framework, HDMapFusion, which significantly improves semantic map generation accuracy via feature-level fusion. Unlike traditional methods that directly fuse raw sensor data, our approach innovatively transforms both camera images and LiDAR point cloud features into a unified Bird's-Eye-View (BEV) representation, enabling physically interpretable fusion of multimodal information within a consistent geometric coordinate system. Specifically, this method first extracts visual features from camera images and 3D structural features from LiDAR point clouds using deep learning networks. Subsequently, a differentiable perspective transformation module converts the front-view image features into a BEV space and the LiDAR point clouds are projected into the same BEV space through voxelization. Building on this, an attention-based feature fusion module is designed to adaptively integrate the two modalities using weighted aggregation. Finally, a semantic decoder generates high-precision semantic maps containing lane lines, pedestrian crossings, road boundary lines, and other key elements. Systematic experiments conducted on the nuScenes benchmark dataset demonstrate that HDMapFusion significantly outperforms existing baseline methods in terms of HD map generation accuracy. These results validate the effectiveness and superiority of the proposed method, offering a novel solution to multimodal fusion in autonomous driving perception.
Large Language Models (LLM) have demonstrated outstanding performance in natural language processing tasks. However, their extremely large parameter scales pose a significant challenge because the limited capacity of GPU memory becomes a performance bottleneck for inference tasks. To address this issue in the context of LLM inference services, this study proposes AdaptiveLLM, which enables the adaptive selection of offloading strategies between tensor swapping and tensor recomputation based on the characteristics of inference task workloads. To evaluate the characteristics of inference task workloads, AdaptiveLLM establishes a black-box Machine Learning (ML) model through an operator-level computational complexity analysis to predict the overhead of tensor recomputation. It also predicts the overhead of tensor swapping by conducting a fine-grained analysis of KV Cache memory usage. For the adaptive selection of offloading strategies, AdaptiveLLM designs a cost-aware memory optimization strategy specifically for the pre-emption scheduling phase. When GPU memory is insufficient, it opts for the offloading approach with a lower overhead. For the initiation scheduling phase, it devises a fairness-based user-request scheduling strategy. When GPU memory is available, it schedules more user requests in accordance with the principle of fairness. Experimental results indicate that, compared with currently widely used LLM inference benchmark frameworks, AdaptiveLLM achieves an overall increase in throughput while reducing the average weighted turnaround time, thereby realizing fair scheduling.
Advances in computational power and network technologies have driven robots toward miniaturization, swarm intelligence, and autonomous capabilities. Robot software deployed on robotic hardware must integrate diverse modules from low-level device drivers and controls to high-level motion planning and reasoning, resulting in increasingly complex architectures. A communication and programming framework for multi-robot systems—focusing on standardization, modularization, and platformization—can alleviate the complexity of programming robotic software. The development trends in robotic software and hardware architecture show that a swarm robotic system is a multi-domain, heterogeneous, and distributed system composed of computing nodes, actuators, sensors, and other hardware devices interconnected through wired or wireless networks. The heterogeneity of hardware devices makes it difficult to integrate software components into a single framework. This survey summarizes and analyzes existing robotic communication frameworks in terms of ease of use and portability, comparing their core features, such as programming models, heterogeneous hardware support, communication and coordination mechanisms between components, and programming languages. The survey then highlights the technical trends of advanced topics such as real-time virtualization, component orchestration, and fault tolerance. Moreover, this survey focuses on building a next-generation framework on a meta Operating System (OS) foundation, aiming to build a ubiquitous and integrated multi-robot software architecture for human-machine-object interactions.
This study explores the use of Artificial Intelligence (AI) technology throughout the neutron scattering experiments′ lifecycle to determine how AI technology can revolutionize key aspects such as experimental apparatus, data acquisition, and data processing. The study begins by introducing the fundamental principles and experimental procedures of neutron scattering technology before focusing on the multifaceted applications of AI technology in neutron scattering experiments. These applications include optimizing experimental infrastructure, data acquisition, and imaging preprocessing, as well as characterizing experimental samples in neutron diffraction, neutron reflection, and Inelastic Neutron Scattering (INS). This study demonstrates the importance of AI technology in increasing the intelligence level of experiments, accelerating data processing, and improving the accuracy and reliability of data analyses. In addition, an in-depth discussion is held on the future application of AI technology in neutron scattering experiments, indicating that with the continuous advancement of technologies such as multimodal learning, interpretable models, large language models, and AI-Ready databases, AI technology is poised to bring revolutionary changes to neutron scattering experiments, opening up new avenues for revealing the microstructure and properties of complex material systems.
In urban rail transit, the operator collects and analyzes the trajectory data of passengers, obtains and classifies the travel patterns of individuals or groups, optimizes the resource allocation according to the travel characteristics of passengers, and improves passenger satisfaction. To obtain the characteristics of rail transit passengers, this study considers passenger travel trajectories in subway networks, depicts travel trajectories with highly overlapping stations as similar, and designs a trajectory similarity evaluation algorithm. Based on this evaluation, a method for evaluating passenger characteristics is proposed and the similarity matrix of passenger travel trajectories during a period is obtained. The similarity matrix is further optimized to obtain the travel pattern matrix for passengers. In this study, experiments are conducted using real-world data from the Automatic Fare Collection (AFC) system in Shanghai. The results demonstrate that the proposed method is applicable to both individual passengers and groups. Among the 10 000 randomly selected passengers, 4 386 meet the travel frequency standards and regularity requirements. Additionally, the travels of regular passengers account for 67.85% of all travels.
As the core module of a task-oriented dialogue system, Natural Language Understanding (NLU) aims to structurally represent user inputs in natural language; this is generally decomposed into two subtasks: intent recognition and slot filling. Recently, the joint modeling of these two tasks has become a universal solution. However, establishing the connection between the two tasks is difficult to collect using a small number of support set samples in few-shot scenarios. Owing to domain gaps, the general knowledge learned from resource-rich source domains cannot be directly transferred to target domains. Inspired by cloze, this paper considers the average vector of non-slot (labeled as ″O″) words as the sentence pattern representation and proposes a Sentence Pattern Adaptive Prototype Network (SPAPN). In resource-rich source domains, the model fully learns the cross-domain semantic knowledge of sentence patterns and uses this information as a hub to indirectly model the relationship between intents and slots. Resource-low target domains adopt a meta-learning training mode and an attention mechanism to learn the correlation among the prototypes of intents, slots, and sentence patterns to enhance the semantic representations of intent and slot prototypes, and combine Comparative Alignment Learning (CAL) is employed to judge the labels of intents and slots based on the vector similarity between the query samples and these prototypes. Experiments conducted on Chinese and English benchmark datasets show that, irrespective of fine-tuning, the proposed method consistently outperforms state-of-the-art baselines in terms of intent accuracy, slot filling F1 score, and joint accuracy.
Driver fatigue is a major cause of traffic accidents, and driver fatigue state classification based on Electroencephalograms (EEGs) is an important task in the field of artificial intelligence. In recent years, deep learning models that incorporate attention mechanisms have been widely applied to EEG-based fatigue recognition. While these approaches have shown promise, several studies disregard the inherent features of EEG data itself. Additionally, the exploration of the mechanisms and effects of attention on the classifier is vague, which results in failure to explain the specific effects of different attention states on classification performance. Therefore, this study selects the SEED-VIG data as the research object and adopts the ReliefF feature selection algorithm to construct optimized models of Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM) network, and Support Vector Machine (SVM) based on self attention, multihead attention, channel attention, and spatial attention mechanisms. Experimental results on the EEG data included in the SEED-VIG dataset show that the performance of several neural network optimization models based on multimodal attention mechanisms has improved in terms of accuracy, recall rate, F1 score, and other indicators. Among them, the Convolutional Block Attention Module (CBAM)-CNN model, which can enhance spatial and channel information, achieves the best performance with 84.7% mean accuracy with 0.66 standard deviation.
Human Pose Estimation (HPE) is an important research task in the field of computer vision and is widely used in teaching scenarios. Currently, this task faces many challenges, such as reduced accuracy in complex scenarios, including cluttered backgrounds, small human body image scales, and occluded human bodies. Simultaneously, the flexibility and variability of human body postures require the model to have a good reasoning ability. This study proposes a geometric relationship-aware human pose representation learning model to address these problems. It uses the structured information of the human body to help the model better understand the relationship between different poses, thereby improving the accuracy and robustness of complex pose predictions to achieve effective application in classroom scenarios. The model includes four modules: channel reweighting, multi-token information interaction, limb direction construction, and adaptive loss propagation. The limb direction construction module implements the modeling of the geometric structure between the human body joints. This input clue helps the model capture the relative position and direction relationship between body parts. The channel reweighting module automatically selects and emphasizes the most helpful feature information for the pose estimation task, improving the expression ability of the visual features of the input image. The multi-token information interaction module, which is based on the Transformer encoder, realizes efficient interactions among image feature clues, joint coordinate clues, and limb direction cues. Finally, this study optimizes the traditional loss function in the adaptive loss propagation module to further improve the training effect and performance of the model. The model achieves accuracy rates of 76.1% and 90.3% on two mainstream datasets, COCO and MPII, respectively, outperforming some existing SOTA (State of the Art) models. The proposed model achieves more accurate and reasonable prediction results in complex scenarios.
In the field of multi-view time series prediction, the effective fusion of information from different views is an important and challenging issue. Existing multi-view time series prediction methods have limitations in capturing historical data trends and are often affected by the inconsistent distribution of multi-view information. To address these two problems, this study leverages the Functional Neural Process (FNP) framework and proposes a Consistent FNP (CFNP) framework. The CFNP framework is designed with two core modules: view random correlation graph module and view distribution alignment module. The view random correlation graph module assists in understanding and predicting current data by analyzing the distribution of historical data. The view distribution alignment module is dedicated to reducing the difference in probability distributions between different views and improving the time response of the model by imposing constraints in the potential space. Consequently, the model can capture the intrinsic correlation of sequences. On two public datasets, the CFNP framework improves the Root Mean Square Error (RMSE) by 14% and 5% compared to existing methods, proving that it can predict multi-view time series more accurately.
Graph neural network can effectively aggregate information from several nodes and encode structural information of sentences, thus being widely applied in relation extraction tasks. However, current relation extraction methods based on graph neural network often require the assistance of external parsing tools to construct dependency trees, a process that may introduce errors leading to incorrect information propagation. To address this issue, this paper proposes a Graph Convolutional Neural Network (GCN) model based on an association adjacency matrix for relation extraction. This model first utilizes the Robustly optimized BERT approach (RoBERTa) Pre-trained Language Model (PLM) to convert each word into vector representations and calculates the association between word vectors using their dot product. Subsequently, based on the association between words and relative entity position features, it constructs an association adjacency matrix and utilizes GCN to extract semantic structural features of sentences. Finally, it mitigates the gradient vanishing problem during model training using residual connections and obtains the final classification representation by fusing sentence and entity representations. This model avoids error propagation caused by the use of external parsing tools. Experimental results demonstrate that compared to existing graph convolution-based models, the proposed model achieves good performance in relation extraction tasks in Temporal Action and Relation Corpus (TACRED) and Re-TACRED datasets, with precision, recall, and F1 value of 68.8%, 77.5%, 72.8% and 90.5%, 91.3%, 90.9%, respectively, validating the effectiveness and feasibility of the model.
Because most test difficulty prediction schemes are labor-intensive, time-consuming, and prone to leakage, or to some extent subjective, they seriously affect the progress of intelligent education evaluation systems. Therefore, the use of neural networks to predict question difficulty automatically is of great significance. For this purpose, this study proposes a Multi-feature Attention-based Bidirectional Recurrent Neural Network model (M-ABRNN). First, the model retrieves computer-related knowledge to enrich the question stem information based on multi-feature task learning methods. Second, it mines the logical relationships in the objective question text data, extracts statement representations through a bidirectional recurrent neural network, and uses attention mechanisms to measure the importance of associated statements to the question. Finally, the obtained features are input into the model for training, and after training, the difficulty of each new question can be predicted automatically. On a university computer fundamentals course dataset, the proposed model significantly improves the Pearson Correlation Coefficient (PCC) and Degree of Agreement (DOA). These findings show that the model can effectively predict the difficulty of objective questions and evaluate question difficulty automatically.
Accurate photovoltaic power prediction is crucial for enhancing grid stability and improving energy utilization efficiency. To address the limitations of existing methods, which struggle to simultaneously consider both long-term dependencies and short-term variation patterns of photovoltaic power, this study proposes a novel photovoltaic power prediction method named Solarformer. This method integrates a Pyramid Attention Module (PAM) with a Temporal Convolutional Network (TCN) to optimize the Transformer architecture. First, multiple feature selection mechanisms are employed to screen the input features, to enhance the model′s ability to characterize photovoltaic data features. Second, a coarse-grained construction module and PAM are utilized to optimize the Transformer encoder, capturing the long-term temporal dependency features of photovoltaic power at multiple scales. Third, a constraint mechanism based on the sunrise-sunset effect of photovoltaic power and the TCN are employed to optimize the Transformer decoder, strengthening the model′s ability to capture short-term variation features of photovoltaic power and better model its short-term variation patterns. Experimental results on the Sanyo dataset from Australia demonstrate that Solarformer can effectively improve photovoltaic power forecasting accuracy. Compared with the DLinear model, it reduces the Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and Symmetric Mean Absolute Percentage Error (SMAPE) by approximately 7.45%, 6.99%, and 14.10%, respectively.
Finding the global minimum of complex functions has broad applications in engineering computations and artificial intelligence. The multi-start algorithm is a commonly used heuristic approach for this; however, it suffers from low computational efficiency. To address this issue, a new algorithm based on rejection sampling is proposed. This algorithm improves the initial point selection strategy to reduce computation time and the number of function evaluations while enhancing global convergence capabilities. Traditional multi-start algorithms obtain initial points via independent uniform sampling, which can lead to issues such as clustering of starting points, regions with no points, and low iterative efficiency. Inspired by the k-means++ algorithm′s approach to selecting the initial cluster centers, a rejection sampling method is proposed. This method restricts the distance threshold between a newly sampled point and the previously sampled points in each sampling round, ensuring a uniform spatial distribution of sampling points. A mathematical proof is provided to support this approach. The experimental results demonstrate that, compared to independent uniform sampling, rejection sampling significantly improves the optimization efficiency. For high-dimensional functions, this method can reduce the number of target function evaluations by up to 28%. In problems with multiple global minimums, the reduction in function evaluations can reach up to 41%. The chi-square test is employed to statistically verify that the proposed method can significantly enhance computational efficiency. Compared with currently prevalent optimization algorithms, the proposed algorithm exhibits significant advantages in terms of convergence and computation time. By leveraging parallel computing to accelerate this algorithm, its efficiency reaches 90% with 32-core parallelism, significantly reducing computation time and demonstrating good scalability.
Simulated Annealing (SA) is an effective method for Bayesian Network Structure Learning (BNSL). However, when handling with large-scale data, a significant search time is required. Moreover, to maintain parallel efficiency, the traditional multi-chain SA parallelization approach often requires a reduction in the number of iterations. This leads to insufficiently thorough searches when many threads are employed. Additionally, SA employs an optimal-selection update strategy during the information exchange process, which makes it prone to becoming trapped in the local optima. To address these issues, this study proposes a BNSL algorithm based on a Parallel Prediction-Based SA (PPBSA) algorithm. This algorithm ensures thoroughness in the search during the parallelization process and possesses the ability to escape local optima during the information-exchange phase. In the annealing stage of PPBSA, several generations of predicted solutions and their corresponding scores following the current solution are generated in parallel. This approach aims to guarantee search depth while substantially accelerating the search process by reducing the time spent generating and scoring subsequent solutions. When threads exchange information, a tabu list is used to restrict the search for thread solutions that have fallen into local optima, thereby enhancing the ability of the solutions to escape the local optima. Furthermore, based on the decomposability of the BDeu score, the score difference before and after perturbation in the SA process is directly calculated, significantly reducing computational redundancy. A series of experiments conducted on a set of benchmark BN compares the proposed algorithm with serial SA and other algorithms. The results demonstrate that the proposed algorithm can achieve acceleration effects of more than five times in some cases, while maintaining accuracy.
Virtualization resource computing and load allocation in a Cloud-Radio Access Network (C-RAN) is studied. First, based on the C-RAN architecture, a system model is proposed as a virtualization evolution to incorporate all factors influencing computing resource usage. The system model includes user and traffic models, wireless network models, computing resource usage models, and overload prevention mechanisms. Second, two advanced heuristic allocation methods are proposed to allocate User Processing (UP) to the computing unit-Baseband Unit (BBU). This allocation should only occur when each user terminal arrives at the system. The impact of spatial user distribution on the utilization of virtual computing resources is also studied. Finally, by pooling resources, long-term load balancing is achieved while adapting to short-term load fluctuations caused by traffic changes and scheduling effects. Based on the system-level simulation results and considering the average processing load, the proposed heuristic allocation method is shown to have a significantly better overload performance and user experience than those of classical heuristic static allocation and heuristic random allocation methods. Even when it impacts the user experience, the proposed method can conserve 57% of computing resources compared to that by other methods.
Existing dynamic pricing algorithms for edge computing are based on game-theoretic models and auction mechanisms. With the optimization objective of maximizing a service provider′s total revenue, existing pricing algorithms face difficulties in obtaining prior information about user utility, and most auction mechanisms favor local optimality over global optimality when selecting prices. To address these problems, this study proposes a dynamic pricing algorithm for offloading edge computing tasks based on a Contextual Multi-Armed Bandit (CMAB). First, the dynamic pricing problem of edge computing is modeled as a CMAB. Next, a dynamic pricing algorithm for task offloading based on Thompson Sampling (TS) is designed that employs a Bayesian posterior to induce service providers to select the price and updates the corresponding parameters by rewarding the revenue in each round, thereby effectively reducing the loss value of the total revenue in the dynamic pricing process. Finally, the effectiveness of the pricing algorithm is verified by simulating a real edge environment. The proposed pricing algorithm outperforms existing Multi-Armed Bandit (MAB) algorithms and pricing algorithms in terms of both the expected cumulative regret value and expected cumulative revenue value.
With the development of intelligent transportation technologies, the exchange of information between vehicles has become increasingly frequent. Establishing an effective reputation management mechanism for the Internet of vehicles is important because it provides the foundation and guarantee for implementing various Internet of vehicles applications. This study proposes a hierarchical blockchain-based reputation management system to address issues such as the limited resources in the Internet of vehicles, low throughput of reputation management solutions, and poor recognition rates of malicious vehicles. The system considers the impact range of traffic data and designs a layered blockchain architecture to manage the data in a hierarchical manner, thereby achieving distributed storage of traffic data and reputation values and improving storage efficiency. By aggregating the credibility of messages and the reputation values of message source nodes, a Bayesian inference model is used to determine the authenticity of events. Using the Direct Acyclic Graph (DAG) blockchain structure to design reputation weight evaluation, the reputation value update algorithm is enriched and the management of reputation value is strengthened. Considering a scenario involving connected vehicles, the main node selection and consensus process of the traditional Practical Byzantine Fault Tolerance (PBFT) algorithm is improved, and an asset value based consensus committee selection strategy is proposed. Simulation results show that compared with existing schemes, the proposed method can effectively identifies false information and maintains a correct event judgment rate of over 91%, even when the proportion of malicious nodes reaches 30%. In terms of availability, transaction latency and throughput have significant advantages over the PBFT algorithm and some improved solutions, which can further expand the network scale. The proposed system is effective and feasible for the distributed trust management of the Internet of vehicles.
A linkable ring signature, which is a special type of ring signature, can aid in verifying whether two signatures have been signed by the same user, without compromising anonymity. This feature enables it to play an important role in blockchains. However, most of the currently available linkable ring signature schemes are inefficient, and some of them are at risk of being forged owing to the signature labels. This study constructs a new lattice-based linkable ring signature scheme using rejection sampling, in which a formal security proof of the unforgeability of linkable ring signatures is provided in a random oracle model. Unlike existing schemes with a multi-round Hash function to hide user identity, this scheme places the user′s identity characteristics into verification public keys. In other words, the real signer first expands the private key according to certain rules to form an effective ring signature private key and then uses rejection sampling technology to make the linkable ring signature indistinguishable. Consequently, the number of matrix vector multiplication operations in the scheme is reduced, improving its efficiency and shortening the signature size. Furthermore, the private keys of this scheme are jointly generated by the key generation center and users, and the label is multiplied by the private key and the public matrix. This scheme solves the problem faced by some existing schemes, i.e., legitimate users can maliciously forge labels when signing, while ensuring the anonymity of ring signatures. In addition, the linkability can be proven using a random oracle model. The results show that this scheme has advantages in terms of computational efficiency and signature size.
In predictive maintenance systems, the vibration sensors used during the data collection phase may be subjected to human or environmental interference, leading to data anomalies. A secure and reliable integrated pre-detection scheme is proposed to ensure the reliability of the collected data. This scheme combines random open strategies, similarity detection methods, and sound source localization technologies to enhance the accuracy and reliability of the system in terms of both spatial and temporal dimensions. First, a random open strategy is used to ensure that the sensors are not subjected to directional interference, thereby enhancing the safety redundancy of the system. Second, a similarity detection method utilizes multi-dimensional distances to calculate the similarity of consecutive acceleration data collected by vibration sensors and compares it with a threshold to increase the sensitivity of the system to equipment status. Finally, a sound source localization technology analyzes the audio corresponding to abnormal similarities to determine the source location, further enhancing the precision of pre-detection. Experimental results in targeted testing environments indicate that in non-adversarial scenarios, the non-integrated scheme improves the accuracy and precision by 4 and 4.13 percentage points, respectively, compared to the integrated scheme and the recall remains the same. Conversely, in adversarial scenarios, the integrated scheme improves the accuracy and recall by 9.5 and 9.14 percentage points, respectively, compared to the non-integrated scheme, with the precision remaining constant.
With the continuous development of Internet technology, the circulation and trading of digital assets on Web 3.0 have become an important driver for the effective release of data value. However, in the current digital asset trading process, technical challenges still exist, such as uncontrollable data circulation and ambiguous ownership boundaries, especially in the Web 3.0 environment, which presents key challenges such as controllable embedding and dynamic changes in data rights. This study proposes a three-right separation data rights confirmation scheme for Web 3.0, digital asset trading based on blockchain technology and the chameleon signature algorithm. First, a chameleon proxy signature algorithm is designed on the basis of proxy signature and chameleon signature technology to solve the problem of ownership tags not being embedded within the control of the digital asset owner during the data release process. Second, a data trading protocol is constructed on the basis of blockchain technology to solve the problem of dynamic changes in ownership tags during the data release process and to achieve public verification of data rights. Finally, security and experimental simulation analyses show that the ownership tags of this scheme can safely and reliably confirm the rights of transaction data when they cannot be forged, meeting the actual needs of digital asset trading on Web 3.0.
Using domain name generation technology to identify nuisance website domains offers benefits such as broad coverage, the provision of substantial research data, and timely prevention of dissemination. Existing domain generation algorithms based on domain similarity face issues such as insufficient feature utilization, high redundancy in the generated domains, and a low concentration of nuisance website domains. To address these issues, this study proposes a new nuisance website domain name generation model based on semantic information and domain similarity. The proposed model employs a Transformer encoder to extract the semantic features of domain names and uses them to guide the generation process and enhance feature utilization. It improves Sequence Generative Adversarial Networks (SeqGANs) by separately focusing on semantic features for generation and contextual information for discrimination, thereby increasing the quality of the generated domains and the accuracy of the discriminator. The model detects generated domains through initial filtering, multitool rechecking, and final selection. Experimental results show that, compared to existing domain similarity-based generation models, the proposed model can discover more nuisance website domain names through its domain name generation mode and is advantageous in terms of generation quality, expansion rate, and active monitoring ability.
Several discriminative problems exist in copy-move forgery detection, including the difficulty of using keypoints to cover smooth regions, lack of color descriptive capability in feature representations, and insufficient accuracy of feature matching. This study proposes a highly discriminative image copy-move forgery detection method. For keypoint extraction, an image is divided into different super-pixel regions according to texture, and the keypoints are adaptively extracted from such regions, where the keypoints uniformly cover the smooth regions. For feature representation, a quaternion-based feature description method is proposed that can accurately describe the color information of an image. For feature matching, a Reversed generalized 2 Nearest Neighbors (Rg2NN) matching algorithm is used to improve the matching accuracy of multiple keypoints. During post-processing, the detection results are obtained using a fast mean-residual normalized production correlation (NNPROD) algorithm. Experimental results demonstrate that the proposed algorithm achieves excellent accuracy across multiple benchmarks and is robust against common geometric and signaling attacks.
To improve the accuracy of predicting human depth images, a video-based human depth image estimation method called BiSTNet is proposed. Additionally, to fully mine three-dimensional (3D) information from videos, a bidirectional spatio-temporal feature learning model is introduced. This model uses two sequence directions, namely past and future frames, for feature learning and employs a bidirectional spatio-temporal feature attention model to enhance the influence of effective frames. Furthermore, a multiscale feature fusion prediction module is incorporated to predict precise depth images with rich local geometric details by effectively fusing bidirectional spatio-temporal and spatial features, thereby improving the accuracy of the 3D models reconstructed from the predicted depth images. During the model training process, constraints on the relative sequential relationships of human joints and a bidirectional sequence self-supervised learning strategy are utilized to improve prediction accuracy while reducing reliance on supervised data. The experimental results demonstrate that the BiSTNet method not only effectively reduces errors during prediction of depth images but also produces depth images with abundant details.
The registration of Magnetic Resonance Imaging (MRI) images and TRansrectal UltraSound (TRUS) images is a combination of preoperative MRI image registration and intraoperative ultrasound image, combining the advantages of the two image modes to quickly locate the lesion area. This process plays an important role in assisting diagnosis, puncture, intraoperative navigation, and other medical surgery problems. Owing to the inherent representational differences between the two image modes, with significant intensity distortion and deformation, finding an exact dense correspondence relationship between them remains a challenge. To address this issue, this study proposes a weakly supervised deformable registration network framework based on joint learning and a Multi-level Wavelet Feature Pyramid (MWFP) to align MRI and TRUS images. Joint learning is a framework composed of a pre-trained semi-supervised segmentation network and a registration network. The segmentation and registration networks continue to alternate training, and the segmentation network provides prostate label constraint global registration for the registration network, which effectively solves the problem of insufficient labels in the registration network. MWFP is a registration network composed of multi-resolution wavelets. The multi-scale images generated by the wavelet pyramid filter the relevant noise and reduce the difference in representation between the two modes to improve the ability of the registration network to learn the multi-scale features. In addition, a Multi-Scale Feature Fusion Attention (MSFFM) module is designed in the registration network to further screen the features and provide local dense correspondence for registration. In addition, the deformable segmentation images and segmentation labels provided by the registration network are mixed with the original artificial labels and images, and the pseudo-labels and their images generated by the segmentation network are used as input for the segmentation network for additional training, which further improves the performance of multimodal image segmentation. Results on 642 publicly available prostate MRI images and a TRUS image biopsy dataset show that the proposed registration method achieves optimal Dice Similarity Coefficient (DSC), 95% Hausdorff Distance (HD95), Mutual Information (MI), and Structural Similarity (SSIM) of 81.05%±1.77%, 12.83±1.49 mm, 18.12%±4.63%, and 27.12%±4.63%, respectively, which are superior to those of traditional registration methods and advanced deep learning registration methods. In addition, the average registration time of the proposed method is 0.18±0.02 s, which is nearly 400 times higher than that of the traditional methods. The experimental results show that the proposed registration method can accurately estimate the deformation field between prostate MRI images and TRUS images in real time and has higher registration accuracy and registration speed.
Facial Expression Recognition (FER) plays a crucial role in smart education. Current recognition systems depend heavily on single prior image features, are limited by the ineffective integration of multiple image features in FER tasks, and have poor generalizability in recognizing facial expressions under natural environmental conditions. This study utilizes the large-scale visual model DINOv2 as a pre-training model, with its pre-trained weights frozen, and leverages its learned experience from natural image datasets to acquire more universal image features, thereby enhancing the generalization performance of feature extraction. Furthermore, this study proposes a hybrid feature network-based FER model HFFER that utilizes two different pre-trained models to acquire distinct features and effectively integrates them through cross-attention mechanisms and multiple convolutions. Experimental results demonstrate that the model achieves accuracies of 92.18% and 66.76% on the RAF-DB and AffectNet datasets, respectively, surpassing or being comparable to existing models. This study introduces a novel approach to facial expression recognition, and its application to real classroom images demonstrates its feasibility and potential in practical educational settings.
Deep supervised learning has made remarkable achievements in medical image segmentation. However, it is heavily dependent on a large amount of high-quality, labeled medical image data, which are difficult to obtain. To address this issue, this paper proposes a Semi-Supervised Multi-scale Consistency Network (SSMC-Net) for medical image lesion segmentation. The network architecture of SSMC-Net is built upon a joint training framework, learning from both labeled and unlabeled data. Moreover, to alleviate the loss of details during the down-sampling and up-sampling processes, a Multi-scale Subtraction (MS) module is incorporated to capture a broader spectrum of differential features, including the Subtraction Unit (SU) and Multiple Feature Fusion Unit (MFFU). The SU is responsible for extracting differential information from the multi-scale encoding outputs, and the MFFU selectively merges the most correlated features to provide more precise interactive representations for the decoder. Finally, the loss function is redesigned. The supervised part comprehensively calculates the pixel-level information outputs at various resolutions, whereas the unsupervised part introduces a multi-scale joint consistency loss and designs a distance function to diminish the impact of unreliable samples. Ablation and comparative experiments on the CPD, ATLAS, and ACDC datasets demonstrate that the proposed method achieves superior performance in terms of the Dice Similarity Coefficient (DSC) and F2 value compared to existing semi-supervised segmentation methods, even with only 50% labeled data.
Achieving accurate Magnetic Resonance Imaging (MRI) liver image segmentation is of great significance in the field of medicine. It assists doctors in rapidly locating a target region, aids treatment, and plays a key role in postoperative observation. However, MRI images contain rich semantic information and numerous abnormal noises. Traditional convolutional operations have certain limitations in image processing, with limited global modeling capability, limited receptive fields, and difficulty in capturing global information. Moreover, the hierarchy of convolution-based networks should not be too deep because deep networks tend to increase the number of parameters and miss important semantic information at high resolutions. To address these problems, this study introduces the application of Transformers in image processing to establish global information associations, to better capture global information and achieve accurate target location. However, the Transformer may destroy local details when processing detailed image features and performs poorly in providing inductive bias. To leverage the advantages of the Transformer and convolution, this study proposes a feature modeling method that works in cascade. First, coarse segmentation of the Region of Interest (RoI) is achieved by using the Medical Transformer (MedT) network, which uses fewer parameters and requires less computational effort, as the upstream network. Then, the extracted RoI region is data processed and fed into a downstream U-Net network for secondary segmentation, where special attention is paid to local information during the second segmentation to obtain finer prediction results. Experiments on the CHAOS dataset demonstrate that the proposed method achieves significant results in liver segmentation tasks, with a Dice Similarity Coefficient (DSC) of 0.922 and an Intersection over Union (IoU) score of 0.877.
Traditional face recognition systems use various bionic algorithms combined with Support Vector Machines (SVM) to form a corresponding face recognition model for the final face classification problem. This method selects the optimal SVM parameters through algorithm iteration. However, this strategy is hindered by low classification accuracy, long training time, and the possibility of easily falling into the local optimal solution. This paper proposes a face recognition method using an improved Artificial Hummingbird Algorithm (AHA) to optimize SVM. First, AHA is improved by introducing a chaotic sequence of Tent mapping so that the hummingbird population is initialized more uniformly and the algorithm does not fall into the local optimal solution; second, the improved AHA algorithm is introduced in the method of face recognition using SVM. By setting a certain number of iterations for the algorithm, the optimal relevant parameters used to optimize SVM are selected to improve face recognition accuracy. The improved AHA is compared to the Grey Wolf Optimizer (GWO), Sparrow Search Algorithm (SSA), and Whale Optimization Algorithm (WOA). The improved AHA has a faster convergence speed in solving the benchmark function. Simultaneously, in a face recognition experiment on the ORL face database, the improved AHA combined with SVM is compared to GWO, SSA and WOA combined with SVM. In face recognition tasks, the improved AHA combined with SVM achieves higher accuracy and recall rate, with a faster inference speed.
In recent years, Red Green Blue-Depth (RGB-D) saliency object detection technology has made significant progress with a notable improvement in performance. However, the dependency of the RGB-D saliency object detection technology on complex and resource-intensive architectures has limited its application in resource-constrained environments. Although lightweight networks have improved in terms of size and speed, they often come at the cost of sacrificing performance. To address this challenge, an innovative lightweight solution is proposed, which overcomes these limitations by streamlining network parameters and enhancing performance. An effective and universal training strategy and a sparse contrastive self-distillation technology are proposed, in order to compress and accelerate existing RGB-D saliency detection models while improving model performance. This strategy is primarily composed of two key technologies: sparse self-distillation and adversarial contrastive learning. Sparse self-distillation focuses on eliminating unnecessary parameters in saliency detection models while retaining key parameters, thereby achieving more efficient and effective saliency prediction. Adversarial contrastive learning, on the other hand, aims to correct potential errors, further refining the self-distillation process and improving the overall performance of the model. Experimental results on benchmark datasets such as NJUD, NLPR, LFSD, ReDWeb-S, and COME15K demonstrate that, compared to other State of The Art (SOTA) methods, our method can produce more accurate saliency detection results. Furthermore, comparison results of the proposed method with existing SOTA lightweight RGB-D saliency detection models further confirm that our method can achieve a balance between model size reduction and performance enhancement, without sacrificing performance.
Owing to the complexity of underwater environments and the scattering and absorption of light, underwater images often suffer from issues such as image blurring, color distortion, and low visibility. To improve image quality, an image enhancement frame based on color balance and feature fusion is proposed. First, the Dark Channel Prior (DCP) parameters are optimized using a method that combines a quadtree search with light attenuation characteristics to address the issue of image blurring. Second, a differential compensation is applied to the two attenuated channels used for the deblurred image for achieving a color balanced image. Subsequently, to address the issues of detail loss and low contrast in color balanced images, guided filtering is employed to decompose the image and a nonlinear stretching function is introduced for improving the detail layer, resulting in a detail enhanced image. Using Contrast Limited Adaptive Histogram Equalization (CLAHE), normalized gamma correction is added to obtain contrast enhanced images. Finally, weight maps containing different features are extracted from the detail and contrast enhanced images and a multiscale pyramid strategy is adopted for fusion to obtain the final enhanced image. The experimental results demonstrate that compared to the references, this method improves the average values of underwater image quality metrics, average gradient, and patch-based contrast quality index by 17.6%, 76.4%, and 11.2%, respectively. This method exhibits robust performance in improving image quality and can achieve various image enhancement effects.
Existing multiple Video Frame Interpolation (VFI) methods rely on optical flow or Convolutional Neural Networks (CNN) for implementation; however, they struggle to handle scenes with large motions effectively because of the inherent limitations of optical flow and CNN. In response to this challenge, this study proposes a multiple VFI method based on Transformer and enhanced deformable separable convolution. This method integrates attention mechanisms with both shifted and cross-scale windows, thereby enlarging the receptive field of attention. In addition, during frame synthesis, the method treats the time step as a key control variable input to the frame synthesis network, allowing interpolation at arbitrary time positions. Specifically, shallow features are extracted using embedding layers, followed by an encoder-decoder architecture to extract multi-scale deep features. Finally, a multi-scale, multi-frame synthesis network, based on enhanced deformable separable convolutions, takes multi-scale features, original video frames, and time step information as inputs to synthesize intermediate frames at any time position. Experimental results demonstrate that the proposed method achieves high interpolation performance on several commonly used multiple VFI datasets. Specifically, on the Vimeo90K septuplet dataset, the multi-frame interpolation method achieves Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity (SSIM) values of 27.98 dB and 0.912, respectively, while the performance of the single-frame interpolation method also reaches mainstream levels. Visualization results show that, compared with other methods, our method produces clearer and more reasonable intermediate frames in scenes with large motions.
A new variant of the vehicle routing problem, Vehicle Routing Problem with Occasional Drivers and Scheduled Lines (VRPOD-SL), is introduced for the rural delivery scenario. The VRPOD-SL involves selecting public transportation vehicles and occasional drivers for delivery services, identifying the customers serviced by public transport, and planning the optimal route for occasional drivers. To address the VRPOD-SL, an integer programming model is developed considering multiple factors. These factors include the distribution of the starting and ending points, service range of occasional drivers, and capacity of their vehicles. Additionally, the model considers the capacity limitations and fixed route constraints of the public vehicles involved in the problem. The objective of this model is to minimize the overall delivery cost. The VRPOD-SL exhibits high computational complexity owing to the interdependence between customer decision to select public transportation vehicles and occasional drivers. This interdependence necessitates repeatedly solving the occasional driver routing problem, which further increases the computational burden. Therefore, a heuristic algorithm based on Deep Reinforcement Learning (DRL), called a Genetic Algorithm (GA) integrated with an Attention Model (GA-AM), is proposed. The GA-AM combines the global search capability of genetic algorithms with the parallel decision-making ability of attention models, effectively reducing the computational burden of solving the VRPOD-SL. Additionally, a local search algorithm is proposed to further enhance the solution. Numerical experiments demonstrate that the proposed GA-AM outperforms the Gurobi solver, Adaptive Large Neighborhood Search (ALNS) and Variable Neighborhood Search (VNS) heuristic algorithms in terms of solution performance. Furthermore, the results validate the effectiveness of the collaborative delivery mode, which involves occasional drivers and public vehicles.
Short-term power load forecasting plays a crucial role in the optimal scheduling and safe operation of power systems. Power load data exhibit multiperiod characteristics, showing different patterns and trends at various time scales. Accurately extracting the scale size helps identify and separate these features. Current methods use a fixed patch length or a set of fixed patch lengths as steps and encode time series into segments called patches. However, these methods cannot adapt to the complex dynamic changes in real-world load series data. Therefore, this paper proposes a prediction model based on a dynamic Multi-scale and Dual Attention Transformer (MDAT). First, Successive Variational Mode Decomposition (SVMD) is used to separate different time patterns in the load series, and Fast Fourier Transform (FFT) is performed to extract the significant period of each pattern. Subsequently, based on the detected significant periods, the load series is divided into different time resolutions using patches of varying sizes, and multiple branches of a transformer are used to simultaneously model the dependencies of the sequences segmented at different scales. Next, dual attention is applied to these patches to capture the global correlations and local details. Finally, nonlinear feature fusion is performed on the outputs of each branch, and the final load prediction results are obtained by stacking multiple transformer modules. Experimental results on two public datasets demonstrate that the proposed model performs well in terms of prediction accuracy. Compared to the latest models based on Transformer and Multilayer Perceptron (MLP), the Mean Absolute Error (MAE) on the Australia and Morocco datasets is reduced by 10.26%-17.06% and 9.08%-70.25%, respectively.
In this study, a dual-objective berth-bridge cooperative scheduling optimization model is proposed to minimize ship service cost and emission cost, and an improved algorithm, namely Reinforcement Learning based Q-learning NSGA-Ⅱ (RL-Q-NSGA-Ⅱ), based on Non-dominated Sorting Genetic Algorithm Ⅱ (NSGA-Ⅱ) is designed. Through an empirical analysis of the Chiwan container terminal, the results obtained using the improved algorithm, the original NSGA-Ⅱ algorithm, and the first-come first-served scheduling model are quantitatively compared. The results show that the RL-Q-NSGA-Ⅱ algorithm performs better in terms of the iteration speed, convergence, and Pareto front deaggregation degree. Compared with the original NSGA-Ⅱ algorithm, the ship service cost and port ship air pollution emission cost are optimized by 12.19% and 6.04%, respectively, and the total cost is optimized by 8.39%. Compared with the first-come first-served model, the service cost and emission cost are optimized by 18.68% and 3.79%, respectively, and the total cost is optimized by 9.82%. In addition, a negative correlation is observed between ship exhaust emission cost and service cost. If the port considers only the ship service efficiency or the dock operation cost, the social cost of port exhaust emission will increase significantly. The results demonstrate that the proposed model and algorithm can provide a reference for port and shipping companies to make reasonable berth and quay crane scheduling plans for different situations.
Accurate ultra-short-term and short-term multi-region power load forecasting is the key to achieving rapid response and real-time scheduling in power systems. Therefore, based on the spatiotemporal correlation of loads in different regions of a power grid, this study proposes a single-step and multi-step prediction model for ultrashort-term and short-term power load forecasting in multiple regions. This model integrates a Gate-controlled Multi-head Temporal Convolutional Network (GMTCN), a Bi-directional Long Short-Term Memory (BiLSTM), and an attention mechanism, denoted as GMTCN-BiLSTM-Attention. First, the Spearman correlation coefficient is used to analyze the spatial correlation of power loads in different regions, and the load sequences of 15 regions are combined into a multivariate time series to be used as input. Then, GMTCN and BiLSTM are employed to obtain the temporal features and spatiotemporal dependencies of different load sequences, and the attention mechanism is applied to assign higher weights to important features, ignoring unimportant information to improve the robustness of the model. Experiments on two datasets reveal a spatiotemporal correlation between the power loads in different regions. The proposed model can effectively obtain the temporal characteristics of load sequences and the spatiotemporal dependencies among load sequences, and it can simultaneously achieve single- and multi-step predictions of ultra-short-term and short-term power loads in multiple regions. Compared with other deep learning models, it has better predictive performance, stronger robustness, and improved generalization.