School of Physics and Astronomy, Shanghai Jiao Tong University, State Key Laboratory of Dark Matter Physics, Shanghai Jiao Tong University, Tsung-Dao Lee Institute, Shanghai Jiao Tong University
Abstract:In this paper, we propose a deep learning (DL)-based task-driven spectrum prediction framework, named DeepSPred. The DeepSPred comprises a feature encoder and a task predictor, where the encoder extracts spectrum usage pattern features, and the predictor configures different networks according to the task requirements to predict future spectrum. Based on the Deep- SPred, we first propose a novel 3D spectrum prediction method combining a flow processing strategy with 3D vision Transformer (ViT, i.e., Swin) and a pyramid to serve possible applications such as spectrum monitoring task, named 3D-SwinSTB. 3D-SwinSTB unique 3D Patch Merging ViT-to-3D ViT Patch Expanding and pyramid designs help the model accurately learn the potential correlation of the evolution of the spectrogram over time. Then, we propose a novel spectrum occupancy rate (SOR) method by redesigning a predictor consisting exclusively of 3D convolutional and linear layers to serve possible applications such as dynamic spectrum access (DSA) task, named 3D-SwinLinear. Unlike the 3D-SwinSTB output spectrogram, 3D-SwinLinear projects the spectrogram directly as the SOR. Finally, we employ transfer learning (TL) to ensure the applicability of our two methods to diverse spectrum services. The results show that our 3D-SwinSTB outperforms recent benchmarks by more than 5%, while our 3D-SwinLinear achieves a 90% accuracy, with a performance improvement exceeding 10%.




Abstract:Direct preference optimization (DPO), a widely adopted offline preference optimization algorithm, aims to align large language models (LLMs) with human-desired behaviors using pairwise preference data. However, the winning response and the losing response within pairwise data are generated isolatedly, leading to weak correlations between them as well as suboptimal alignment performance. To address this issue, we propose an effective framework named BMC, for bridging and modeling correlations in pairwise data. Firstly, we increase the consistency and informativeness of the pairwise preference signals by targeted modifications, synthesizing a pseudo winning response through improving the losing response based on the winning response. Secondly, we identify that DPO alone is insufficient to model these correlations and capture nuanced variations. Therefore, we propose learning token-level correlations by dynamically leveraging the policy model's confidence during training. Comprehensive experiments on QA, math, and instruction-following tasks demonstrate the effectiveness of our approach, significantly surpassing competitive baselines, including DPO. Additionally, our in-depth quantitative analysis reveals the reasons behind our method's superior performance over DPO and showcases its versatility to other DPO variants.




Abstract:Cyber Threat Intelligence (CTI) summarization task requires the system to generate concise and accurate highlights from raw intelligence data, which plays an important role in providing decision-makers with crucial information to quickly detect and respond to cyber threats in the cybersecurity domain. However, efficient techniques for summarizing CTI reports, including facts, analytical insights, attack processes, etc., have largely been unexplored, primarily due to the lack of available dataset. To this end, we present CTISum, a new benchmark for CTI summarization task. Considering the importance of attack process, a novel fine-grained subtask of attack process summarization is proposed to enable defenders to assess risk, identify security gaps, vulnerabilities, and so on. Specifically, we first design a multi-stage annotation pipeline to gather and annotate the CTI data, and then benchmark the CTISum with a collection of extractive and abstractive summarization methods. Experimental results show that current state-of-the-art models exhibit limitations when applied to CTISum, underscoring the fact that automatically producing concise summaries of CTI reports remains an open research challenge.




Abstract:Material responses to static and dynamic stimuli, represented as nonlinear curves, are design targets for engineering functionalities like structural support, impact protection, and acoustic and photonic bandgaps. Three-dimensional metamaterials offer significant tunability due to their internal structure, yet existing methods struggle to capture their complex behavior-to-structure relationships. We present GraphMetaMat, a graph-based framework capable of designing three-dimensional metamaterials with programmable responses and arbitrary manufacturing constraints. Integrating graph networks, physics biases, reinforcement learning, and tree search, GraphMetaMat can target stress-strain curves spanning four orders of magnitude and complex behaviors, as well as viscoelastic transmission responses with varying attenuation gaps. GraphMetaMat can create cushioning materials for protective equipment and vibration-damping panels for electric vehicles, outperforming commercial materials, and enabling the automatic design of materials with on-demand functionalities.




Abstract:In recent years, LiDAR-based localization and mapping methods have achieved significant progress thanks to their reliable and real-time localization capability. Considering single LiDAR odometry often faces hardware failures and degradation in practical scenarios, Multi-LiDAR Odometry (MLO), as an emerging technology, is studied to enhance the performance of LiDAR-based localization and mapping systems. However, MLO can suffer from high computational complexity introduced by dense point clouds that are fused from multiple LiDARs, and the continuous-time measurement characteristic is constantly neglected by existing LiDAR odometry. This motivates us to develop a Continuous-Time and Efficient MLO, namely CTE-MLO, which can achieve accurate and real-time state estimation using multi-LiDAR measurements through a continuous-time perspective. In this paper, the Gaussian process estimation is naturally combined with the Kalman filter, which enables each LiDAR point in a point stream to query the corresponding continuous-time trajectory within its time instants. A decentralized multi-LiDAR synchronization scheme also be devised to combine points from separate LiDARs into a single point cloud without the requirement for primary LiDAR assignment. Moreover, with the aim of improving the real-time performance of MLO without sacrificing robustness, a point cloud sampling strategy is designed with the consideration of localizability. The effectiveness of the proposed method is demonstrated through various scenarios, including public datasets and real-world autonomous driving experiments. The results demonstrate that the proposed CTE-MLO can achieve high-accuracy continuous-time state estimations in real-time and is demonstratively competitive compared to other state-of-the-art methods.




Abstract:Significant advancements has recently been achieved in the field of multi-modal large language models (MLLMs), demonstrating their remarkable capabilities in understanding and reasoning across diverse tasks. However, these models are often trained for specific tasks and rely on task-specific input-output formats, limiting their applicability to a broader range of tasks. This raises a fundamental question: Can we develop a unified approach to represent and handle different multi-modal tasks to maximize the generalizability of MLLMs? In this paper, we propose UnifiedMLLM, a comprehensive model designed to represent various tasks using a unified representation. Our model exhibits strong capabilities in comprehending the implicit intent of user instructions and preforming reasoning. In addition to generating textual responses, our model also outputs task tokens and grounding tokens, serving as indicators of task types and task granularity. These outputs are subsequently routed through the task router and directed to specific expert models for task completion. To train our model, we construct a task-specific dataset and an 100k multi-task dataset encompassing complex scenarios. Employing a three-stage training strategy, we equip our model with robust reasoning and task processing capabilities while preserving its generalization capacity and knowledge reservoir. Extensive experiments showcase the impressive performance of our unified representation approach across various tasks, surpassing existing methodologies. Furthermore, our approach exhibits exceptional scalability and generality. Our code, model, and dataset will be available at \url{https://github.com/lzw-lzw/UnifiedMLLM}.
Abstract:Recently, deep learning technology has been successfully introduced into Automatic Modulation Recognition (AMR) tasks. However, the success of deep learning is all attributed to the training on large-scale datasets. Such a large amount of data brings huge pressure on storage, transmission and model training. In order to solve the problem of large amount of data, some researchers put forward the method of data distillation, which aims to compress large training data into smaller synthetic datasets to maintain its performance. While numerous data distillation techniques have been developed within the realm of image processing, the unique characteristics of signals set them apart. Signals exhibit distinct features across various domains, necessitating specialized approaches for their analysis and processing. To this end, a novel dataset distillation method--Multi-domain Distribution Matching (MDM) is proposed. MDM employs the Discrete Fourier Transform (DFT) to translate timedomain signals into the frequency domain, and then uses a model to compute distribution matching losses between the synthetic and real datasets, considering both the time and frequency domains. Ultimately, these two losses are integrated to update the synthetic dataset. We conduct extensive experiments on three AMR datasets. Experimental results show that, compared with baseline methods, our method achieves better performance under the same compression ratio. Furthermore, we conduct crossarchitecture generalization experiments on several models, and the experimental results show that our synthetic datasets can generalize well on other unseen models.




Abstract:It is helpful in preventing colorectal cancer to detect and treat polyps in the gastrointestinal tract early. However, there have been few studies to date on designing polyp image classification networks that balance efficiency and accuracy. This challenge is mainly attributed to the fact that polyps are similar to other pathologies and have complex features influenced by texture, color, and morphology. In this paper, we propose a novel network DFE-IANet based on both spectral transformation and feature interaction. Firstly, to extract detailed features and multi-scale features, the features are transformed by the multi-scale frequency domain feature extraction (MSFD) block to extract texture details at the fine-grained level in the frequency domain. Secondly, the multi-scale interaction attention (MSIA) block is designed to enhance the network's capability of extracting critical features. This block introduces multi-scale features into self-attention, aiming to adaptively guide the network to concentrate on vital regions. Finally, with a compact parameter of only 4M, DFE-IANet outperforms the latest and classical networks in terms of efficiency. Furthermore, DFE-IANet achieves state-of-the-art (SOTA) results on the challenging Kvasir dataset, demonstrating a remarkable Top-1 accuracy of 93.94%. This outstanding accuracy surpasses ViT by 8.94%, ResNet50 by 1.69%, and VMamba by 1.88%. Our code is publicly available at https://github.com/PURSUETHESUN/DFE-IANet.



Abstract:Childhood myopia constitutes a significant global health concern. It exhibits an escalating prevalence and has the potential to evolve into severe, irreversible conditions that detrimentally impact familial well-being and create substantial economic costs. Contemporary research underscores the importance of precisely predicting myopia progression to enable timely and effective interventions, thereby averting severe visual impairment in children. Such predictions predominantly rely on subjective clinical assessments, which are inherently biased and resource-intensive, thus hindering their widespread application. In this study, we introduce a novel, high-accuracy method for quantitatively predicting the myopic trajectory and myopia risk in children using only fundus images and baseline refraction data. This approach was validated through a six-year longitudinal study of 3,408 children in Henan, utilizing 16,211 fundus images and corresponding refractive data. Our method based on deep learning demonstrated predictive accuracy with an error margin of 0.311D per year and AUC scores of 0.944 and 0.995 for forecasting the risks of developing myopia and high myopia, respectively. These findings confirm the utility of our model in supporting early intervention strategies and in significantly reducing healthcare costs, particularly by obviating the need for additional metadata and repeated consultations. Furthermore, our method was designed to rely only on fundus images and refractive error data, without the need for meta data or multiple inquiries from doctors, strongly reducing the associated medical costs and facilitating large-scale screening. Our model can even provide good predictions based on only a single time measurement. Consequently, the proposed method is an important means to reduce medical inequities caused by economic disparities.
Abstract:Neural networks are increasingly evolving towards training large models with big data, a method that has demonstrated superior performance across many tasks. However, this approach introduces an urgent problem: current deep learning models are predominantly serial, meaning that as the number of network layers increases, so do the training and inference times. This is unacceptable if deep learning is to continue advancing. Therefore, this paper proposes a deep learning parallelization strategy based on the Universal Approximation Theorem (UAT). From this foundation, we designed a parallel network called Para-Former to test our theory. Unlike traditional serial models, the inference time of Para-Former does not increase with the number of layers, significantly accelerating the inference speed of multi-layer networks. Experimental results validate the effectiveness of this network.