Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xue Jiang

Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute

Mar 31, 2025

Yingwei Ma, Binhua Li, Yihong Dong, Xue Jiang, Rongyu Cao, Jue Chen, Fei Huang, Yongbin Li

Abstract:Recent advancements in software engineering agents have demonstrated promising capabilities in automating program improvements. However, their reliance on closed-source or resource-intensive models introduces significant deployment challenges in private environments, prompting a critical question: \textit{How can personally deployable open-source LLMs achieve comparable code reasoning performance?} To this end, we propose a unified Test-Time Compute scaling framework that leverages increased inference-time computation instead of larger models. Our framework incorporates two complementary strategies: internal TTC and external TTC. Internally, we introduce a \textit{development-contextualized trajectory synthesis} method leveraging real-world software repositories to bootstrap multi-stage reasoning processes, such as fault localization and patch generation. We further enhance trajectory quality through rejection sampling, rigorously evaluating trajectories along accuracy and complexity. Externally, we propose a novel \textit{development-process-based search} strategy guided by reward models and execution verification. This approach enables targeted computational allocation at critical development decision points, overcoming limitations of existing "end-point only" verification methods. Evaluations on SWE-bench Verified demonstrate our \textbf{32B model achieves a 46\% issue resolution rate}, surpassing significantly larger models such as DeepSeek R1 671B and OpenAI o1. Additionally, we provide the empirical validation of the test-time scaling phenomenon within SWE agents, revealing that \textbf{models dynamically allocate more tokens to increasingly challenging problems}, effectively enhancing reasoning capabilities. We publicly release all training data, models, and code to facilitate future research. https://github.com/yingweima2022/SWE-Reasoner

Via

Access Paper or Ask Questions

Universal Speech Token Learning via Low-Bitrate Neural Codec and Pretrained Representations

Mar 15, 2025

Xue Jiang, Xiulian Peng, Yuan Zhang, Yan Lu

Figure 1 for Universal Speech Token Learning via Low-Bitrate Neural Codec and Pretrained Representations

Figure 2 for Universal Speech Token Learning via Low-Bitrate Neural Codec and Pretrained Representations

Figure 3 for Universal Speech Token Learning via Low-Bitrate Neural Codec and Pretrained Representations

Figure 4 for Universal Speech Token Learning via Low-Bitrate Neural Codec and Pretrained Representations

Abstract:Current large speech language models are mainly based on semantic tokens from discretization of self-supervised learned representations and acoustic tokens from a neural codec, following a semantic-modeling and acoustic-synthesis paradigm. However, semantic tokens discard paralinguistic attributes of speakers that is important for natural spoken communication, while prompt-based acoustic synthesis from semantic tokens has limits in recovering paralinguistic details and suffers from robustness issues, especially when there are domain gaps between the prompt and the target. This paper unifies two types of tokens and proposes the UniCodec, a universal speech token learning that encapsulates all semantics of speech, including linguistic and paralinguistic information, into a compact and semantically-disentangled unified token. Such a unified token can not only benefit speech language models in understanding with paralinguistic hints but also help speech generation with high-quality output. A low-bitrate neural codec is leveraged to learn such disentangled discrete representations at global and local scales, with knowledge distilled from self-supervised learned features. Extensive evaluations on multilingual datasets demonstrate its effectiveness in generating natural, expressive and long-term consistent output quality with paralinguistic attributes well preserved in several speech processing tasks.

* Accepted by IEEE Journal of Selected Topics in Signal Processing(JSTSP)

Via

Access Paper or Ask Questions

PrivacyScalpel: Enhancing LLM Privacy via Interpretable Feature Intervention with Sparse Autoencoders

Mar 14, 2025

Ahmed Frikha, Muhammad Reza Ar Razi, Krishna Kanth Nakka, Ricardo Mendes, Xue Jiang, Xuebing Zhou

Figure 1 for PrivacyScalpel: Enhancing LLM Privacy via Interpretable Feature Intervention with Sparse Autoencoders

Figure 2 for PrivacyScalpel: Enhancing LLM Privacy via Interpretable Feature Intervention with Sparse Autoencoders

Figure 3 for PrivacyScalpel: Enhancing LLM Privacy via Interpretable Feature Intervention with Sparse Autoencoders

Figure 4 for PrivacyScalpel: Enhancing LLM Privacy via Interpretable Feature Intervention with Sparse Autoencoders

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language processing but also pose significant privacy risks by memorizing and leaking Personally Identifiable Information (PII). Existing mitigation strategies, such as differential privacy and neuron-level interventions, often degrade model utility or fail to effectively prevent leakage. To address this challenge, we introduce PrivacyScalpel, a novel privacy-preserving framework that leverages LLM interpretability techniques to identify and mitigate PII leakage while maintaining performance. PrivacyScalpel comprises three key steps: (1) Feature Probing, which identifies layers in the model that encode PII-rich representations, (2) Sparse Autoencoding, where a k-Sparse Autoencoder (k-SAE) disentangles and isolates privacy-sensitive features, and (3) Feature-Level Interventions, which employ targeted ablation and vector steering to suppress PII leakage. Our empirical evaluation on Gemma2-2b and Llama2-7b, fine-tuned on the Enron dataset, shows that PrivacyScalpel significantly reduces email leakage from 5.15\% to as low as 0.0\%, while maintaining over 99.4\% of the original model's utility. Notably, our method outperforms neuron-level interventions in privacy-utility trade-offs, demonstrating that acting on sparse, monosemantic features is more effective than manipulating polysemantic neurons. Beyond improving LLM privacy, our approach offers insights into the mechanisms underlying PII memorization, contributing to the broader field of model interpretability and secure AI deployment.

Via

Access Paper or Ask Questions

FANformer: Improving Large Language Models Through Effective Periodicity Modeling

Feb 28, 2025

Yihong Dong, Ge Li, Xue Jiang, Yongding Tao, Kechi Zhang, Hao Zhu, Huanyu Liu, Jiazheng Ding, Jia Li, Jinliang Deng(+1 more)

Abstract:Periodicity, as one of the most important basic characteristics, lays the foundation for facilitating structured knowledge acquisition and systematic cognitive processes within human learning paradigms. However, the potential flaws of periodicity modeling in Transformer affect the learning efficiency and establishment of underlying principles from data for large language models (LLMs) built upon it. In this paper, we demonstrate that integrating effective periodicity modeling can improve the learning efficiency and performance of LLMs. We introduce FANformer, which integrates Fourier Analysis Network (FAN) into attention mechanism to achieve efficient periodicity modeling, by modifying the feature projection process of attention mechanism. Extensive experimental results on language modeling show that FANformer consistently outperforms Transformer when scaling up model size and training tokens, underscoring its superior learning efficiency. To further validate the effectiveness of FANformer, we pretrain a FANformer-1B on 1 trillion tokens. FANformer-1B exhibits marked improvements on downstream tasks compared to open-source LLMs with similar model parameters or training tokens. The results position FANformer as an effective and promising architecture for advancing LLMs.

Via

Access Paper or Ask Questions

Boundary-enhanced time series data imputation with long-term dependency diffusion models

Jan 11, 2025

Chunjing Xiao, Xue Jiang, Xianghe Du, Wei Yang, Wei Lu, Xiaomin Wang, Kevin Chetty

Figure 1 for Boundary-enhanced time series data imputation with long-term dependency diffusion models

Figure 2 for Boundary-enhanced time series data imputation with long-term dependency diffusion models

Figure 3 for Boundary-enhanced time series data imputation with long-term dependency diffusion models

Figure 4 for Boundary-enhanced time series data imputation with long-term dependency diffusion models

Abstract:Data imputation is crucial for addressing challenges posed by missing values in multivariate time series data across various fields, such as healthcare, traffic, and economics, and has garnered significant attention. Among various methods, diffusion model-based approaches show notable performance improvements. However, existing methods often cause disharmonious boundaries between missing and known regions and overlook long-range dependencies in missing data estimation, leading to suboptimal results. To address these issues, we propose a Diffusion-based time Series Data Imputation (DSDI) framework. We develop a weight-reducing injection strategy that incorporates the predicted values of missing points with reducing weights into the reverse diffusion process to mitigate boundary inconsistencies. Further, we introduce a multi-scale S4-based U-Net, which combines hierarchical information from different levels via multi-resolution integration to capture long-term dependencies. Experimental results demonstrate that our model outperforms existing imputation methods.

* Accepted by Knowledge-Based Systems

Via

Access Paper or Ask Questions

Image-based Multimodal Models as Intruders: Transferable Multimodal Attacks on Video-based MLLMs

Jan 02, 2025

Linhao Huang, Xue Jiang, Zhiqiang Wang, Wentao Mo, Xi Xiao, Bo Han, Yongjie Yin, Feng Zheng

Figure 1 for Image-based Multimodal Models as Intruders: Transferable Multimodal Attacks on Video-based MLLMs

Figure 2 for Image-based Multimodal Models as Intruders: Transferable Multimodal Attacks on Video-based MLLMs

Figure 3 for Image-based Multimodal Models as Intruders: Transferable Multimodal Attacks on Video-based MLLMs

Figure 4 for Image-based Multimodal Models as Intruders: Transferable Multimodal Attacks on Video-based MLLMs

Abstract:Video-based multimodal large language models (V-MLLMs) have shown vulnerability to adversarial examples in video-text multimodal tasks. However, the transferability of adversarial videos to unseen models--a common and practical real world scenario--remains unexplored. In this paper, we pioneer an investigation into the transferability of adversarial video samples across V-MLLMs. We find that existing adversarial attack methods face significant limitations when applied in black-box settings for V-MLLMs, which we attribute to the following shortcomings: (1) lacking generalization in perturbing video features, (2) focusing only on sparse key-frames, and (3) failing to integrate multimodal information. To address these limitations and deepen the understanding of V-MLLM vulnerabilities in black-box scenarios, we introduce the Image-to-Video MLLM (I2V-MLLM) attack. In I2V-MLLM, we utilize an image-based multimodal model (IMM) as a surrogate model to craft adversarial video samples. Multimodal interactions and temporal information are integrated to disrupt video representations within the latent space, improving adversarial transferability. In addition, a perturbation propagation technique is introduced to handle different unknown frame sampling strategies. Experimental results demonstrate that our method can generate adversarial examples that exhibit strong transferability across different V-MLLMs on multiple video-text multimodal tasks. Compared to white-box attacks on these models, our black-box attacks (using BLIP-2 as surrogate model) achieve competitive performance, with average attack success rates of 55.48% on MSVD-QA and 58.26% on MSRVTT-QA for VideoQA tasks, respectively. Our code will be released upon acceptance.

Via

Access Paper or Ask Questions

Adversarial Filtering Based Evasion and Backdoor Attacks to EEG-Based Brain-Computer Interfaces

Dec 10, 2024

Lubin Meng, Xue Jiang, Xiaoqing Chen, Wenzhong Liu, Hanbin Luo, Dongrui Wu

Figure 1 for Adversarial Filtering Based Evasion and Backdoor Attacks to EEG-Based Brain-Computer Interfaces

Figure 2 for Adversarial Filtering Based Evasion and Backdoor Attacks to EEG-Based Brain-Computer Interfaces

Figure 3 for Adversarial Filtering Based Evasion and Backdoor Attacks to EEG-Based Brain-Computer Interfaces

Figure 4 for Adversarial Filtering Based Evasion and Backdoor Attacks to EEG-Based Brain-Computer Interfaces

Abstract:A brain-computer interface (BCI) enables direct communication between the brain and an external device. Electroencephalogram (EEG) is a common input signal for BCIs, due to its convenience and low cost. Most research on EEG-based BCIs focuses on the accurate decoding of EEG signals, while ignoring their security. Recent studies have shown that machine learning models in BCIs are vulnerable to adversarial attacks. This paper proposes adversarial filtering based evasion and backdoor attacks to EEG-based BCIs, which are very easy to implement. Experiments on three datasets from different BCI paradigms demonstrated the effectiveness of our proposed attack approaches. To our knowledge, this is the first study on adversarial filtering for EEG-based BCIs, raising a new security concern and calling for more attention on the security of BCIs.

* L. Meng, X. Jiang, X. Chen, W. Liu, H. Luo and D. Wu, Adversarial Filtering Based Evasion and Backdoor Attacks to EEG-Based Brain-Computer Interfaces, Information Fusion, 107:102316, 2024

Via

Access Paper or Ask Questions

Cross-Task Inconsistency Based Active Learning (CTIAL) for Emotion Recognition

Dec 02, 2024

Yifan Xu, Xue Jiang, Dongrui Wu

Figure 1 for Cross-Task Inconsistency Based Active Learning (CTIAL) for Emotion Recognition

Figure 2 for Cross-Task Inconsistency Based Active Learning (CTIAL) for Emotion Recognition

Figure 3 for Cross-Task Inconsistency Based Active Learning (CTIAL) for Emotion Recognition

Figure 4 for Cross-Task Inconsistency Based Active Learning (CTIAL) for Emotion Recognition

Abstract:Emotion recognition is a critical component of affective computing. Training accurate machine learning models for emotion recognition typically requires a large amount of labeled data. Due to the subtleness and complexity of emotions, multiple evaluators are usually needed for each affective sample to obtain its ground-truth label, which is expensive. To save the labeling cost, this paper proposes an inconsistency-based active learning approach for cross-task transfer between emotion classification and estimation. Affective norms are utilized as prior knowledge to connect the label spaces of categorical and dimensional emotions. Then, the prediction inconsistency on the two tasks for the unlabeled samples is used to guide sample selection in active learning for the target task. Experiments on within-corpus and cross-corpus transfers demonstrated that cross-task inconsistency could be a very valuable metric in active learning. To our knowledge, this is the first work that utilizes prior knowledge on affective norms and data in a different task to facilitate active learning for a new task, even the two tasks are from different datasets.

* IEEE Trans. on Affective Computing, 2024

Via

Access Paper or Ask Questions

Protecting Multiple Types of Privacy Simultaneously in EEG-based Brain-Computer Interfaces

Nov 29, 2024

Lubin Meng, Xue Jiang, Tianwang Jia, Dongrui Wu

Figure 1 for Protecting Multiple Types of Privacy Simultaneously in EEG-based Brain-Computer Interfaces

Figure 2 for Protecting Multiple Types of Privacy Simultaneously in EEG-based Brain-Computer Interfaces

Figure 3 for Protecting Multiple Types of Privacy Simultaneously in EEG-based Brain-Computer Interfaces

Figure 4 for Protecting Multiple Types of Privacy Simultaneously in EEG-based Brain-Computer Interfaces

Abstract:A brain-computer interface (BCI) enables direct communication between the brain and an external device. Electroencephalogram (EEG) is the preferred input signal in non-invasive BCIs, due to its convenience and low cost. EEG-based BCIs have been successfully used in many applications, such as neurological rehabilitation, text input, games, and so on. However, EEG signals inherently carry rich personal information, necessitating privacy protection. This paper demonstrates that multiple types of private information (user identity, gender, and BCI-experience) can be easily inferred from EEG data, imposing a serious privacy threat to BCIs. To address this issue, we design perturbations to convert the original EEG data into privacy-protected EEG data, which conceal the private information while maintaining the primary BCI task performance. Experimental results demonstrated that the privacy-protected EEG data can significantly reduce the classification accuracy of user identity, gender and BCI-experience, but almost do not affect at all the classification accuracy of the primary BCI task, enabling user privacy protection in EEG-based BCIs.

* IEEE Int'l Conf. on Systems, Man and Cybernetics, Sarawak, Malaysia, October 2024

Via

Access Paper or Ask Questions

GeoGround: A Unified Large Vision-Language Model. for Remote Sensing Visual Grounding

Nov 16, 2024

Yue Zhou, Mengcheng Lan, Xiang Li, Yiping Ke, Xue Jiang, Litong Feng, Wayne Zhang

Figure 1 for GeoGround: A Unified Large Vision-Language Model. for Remote Sensing Visual Grounding

Figure 2 for GeoGround: A Unified Large Vision-Language Model. for Remote Sensing Visual Grounding

Figure 3 for GeoGround: A Unified Large Vision-Language Model. for Remote Sensing Visual Grounding

Figure 4 for GeoGround: A Unified Large Vision-Language Model. for Remote Sensing Visual Grounding

Abstract:Remote sensing (RS) visual grounding aims to use natural language expression to locate specific objects (in the form of the bounding box or segmentation mask) in RS images, enhancing human interaction with intelligent RS interpretation systems. Early research in this area was primarily based on horizontal bounding boxes (HBBs), but as more diverse RS datasets have become available, tasks involving oriented bounding boxes (OBBs) and segmentation masks have emerged. In practical applications, different targets require different grounding types: HBB can localize an object's position, OBB provides its orientation, and mask depicts its shape. However, existing specialized methods are typically tailored to a single type of RS visual grounding task and are hard to generalize across tasks. In contrast, large vision-language models (VLMs) exhibit powerful multi-task learning capabilities but struggle to handle dense prediction tasks like segmentation. This paper proposes GeoGround, a novel framework that unifies support for HBB, OBB, and mask RS visual grounding tasks, allowing flexible output selection. Rather than customizing the architecture of VLM, our work aims to elegantly support pixel-level visual grounding output through the Text-Mask technique. We define prompt-assisted and geometry-guided learning to enhance consistency across different signals. To support model training, we present refGeo, a large-scale RS visual instruction-following dataset containing 161k image-text pairs. Experimental results show that GeoGround demonstrates strong performance across four RS visual grounding tasks, matching or surpassing the performance of specialized methods on multiple benchmarks. Code available at https://github.com/zytx121/GeoGround

* 25 pages, 19 figures

Via

Access Paper or Ask Questions