Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jianling Sun

SWE-Universe: Scale Real-World Verifiable Environments to Millions

Feb 02, 2026

Mouxiang Chen, Lei Zhang, Yunlong Feng, Xuwu Wang, Wenting Zhao, Ruisheng Cao, Jiaxi Yang, Jiawei Chen, Mingze Li, Zeyao Ma(+8 more)

Abstract:We propose SWE-Universe, a scalable and efficient framework for automatically constructing real-world software engineering (SWE) verifiable environments from GitHub pull requests (PRs). To overcome the prevalent challenges of automatic building, such as low production yield, weak verifiers, and prohibitive cost, our framework utilizes a building agent powered by an efficient custom-trained model. This agent employs iterative self-verification and in-loop hacking detection to ensure the reliable generation of high-fidelity, verifiable tasks. Using this method, we scale the number of real-world multilingual SWE environments to a million scale (807,693). We demonstrate the profound value of our environments through large-scale agentic mid-training and reinforcement learning. Finally, we applied this technique to Qwen3-Max-Thinking and achieved a score of 75.3% on SWE-Bench Verified. Our work provides both a critical resource and a robust methodology to advance the next generation of coding agents.

* 13 pages

Via

Access Paper or Ask Questions

The Power of Architecture: Deep Dive into Transformer Architectures for Long-Term Time Series Forecasting

Jul 17, 2025

Lefei Shen, Mouxiang Chen, Han Fu, Xiaoxue Ren, Xiaoyun Joy Wang, Jianling Sun, Zhuo Li, Chenghao Liu

Figure 1 for The Power of Architecture: Deep Dive into Transformer Architectures for Long-Term Time Series Forecasting

Figure 2 for The Power of Architecture: Deep Dive into Transformer Architectures for Long-Term Time Series Forecasting

Figure 3 for The Power of Architecture: Deep Dive into Transformer Architectures for Long-Term Time Series Forecasting

Figure 4 for The Power of Architecture: Deep Dive into Transformer Architectures for Long-Term Time Series Forecasting

Abstract:Transformer-based models have recently become dominant in Long-term Time Series Forecasting (LTSF), yet the variations in their architecture, such as encoder-only, encoder-decoder, and decoder-only designs, raise a crucial question: What Transformer architecture works best for LTSF tasks? However, existing models are often tightly coupled with various time-series-specific designs, making it difficult to isolate the impact of the architecture itself. To address this, we propose a novel taxonomy that disentangles these designs, enabling clearer and more unified comparisons of Transformer architectures. Our taxonomy considers key aspects such as attention mechanisms, forecasting aggregations, forecasting paradigms, and normalization layers. Through extensive experiments, we uncover several key insights: bi-directional attention with joint-attention is most effective; more complete forecasting aggregation improves performance; and the direct-mapping paradigm outperforms autoregressive approaches. Furthermore, our combined model, utilizing optimal architectural choices, consistently outperforms several existing models, reinforcing the validity of our conclusions. We hope these findings offer valuable guidance for future research on Transformer architectural designs in LTSF. Our code is available at https://github.com/HALF111/TSF_architecture.

* 15 pages, 6 figures

Via

Access Paper or Ask Questions

Parallel Scaling Law for Language Models

May 15, 2025

Mouxiang Chen, Binyuan Hui, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Jianling Sun, Junyang Lin, Zhongxin Liu

Figure 1 for Parallel Scaling Law for Language Models

Figure 2 for Parallel Scaling Law for Language Models

Figure 3 for Parallel Scaling Law for Language Models

Figure 4 for Parallel Scaling Law for Language Models

Abstract:It is commonly believed that scaling language models should commit a significant space or time cost, by increasing the parameters (parameter scaling) or output tokens (inference-time scaling). We introduce the third and more inference-efficient scaling paradigm: increasing the model's parallel computation during both training and inference time. We apply $P$ diverse and learnable transformations to the input, execute forward passes of the model in parallel, and dynamically aggregate the $P$ outputs. This method, namely parallel scaling (ParScale), scales parallel computation by reusing existing parameters and can be applied to any model structure, optimization procedure, data, or task. We theoretically propose a new scaling law and validate it through large-scale pre-training, which shows that a model with $P$ parallel streams is similar to scaling the parameters by $O(\log P)$ while showing superior inference efficiency. For example, ParScale can use up to 22$\times$ less memory increase and 6$\times$ less latency increase compared to parameter scaling that achieves the same performance improvement. It can also recycle an off-the-shelf pre-trained model into a parallelly scaled one by post-training on a small amount of tokens, further reducing the training budget. The new scaling law we discovered potentially facilitates the deployment of more powerful models in low-resource scenarios, and provides an alternative perspective for the role of computation in machine learning.

Via

Access Paper or Ask Questions

B4: Towards Optimal Assessment of Plausible Code Solutions with Plausible Tests

Sep 13, 2024

Mouxiang Chen, Zhongxin Liu, He Tao, Yusu Hong, David Lo, Xin Xia, Jianling Sun

Figure 1 for B4: Towards Optimal Assessment of Plausible Code Solutions with Plausible Tests

Figure 2 for B4: Towards Optimal Assessment of Plausible Code Solutions with Plausible Tests

Figure 3 for B4: Towards Optimal Assessment of Plausible Code Solutions with Plausible Tests

Figure 4 for B4: Towards Optimal Assessment of Plausible Code Solutions with Plausible Tests

Abstract:Selecting the best code solution from multiple generated ones is an essential task in code generation, which can be achieved by using some reliable validators (e.g., developer-written test cases) for assistance. Since reliable test cases are not always available and can be expensive to build in practice, researchers propose to automatically generate test cases to assess code solutions. However, when both code solutions and test cases are plausible and not reliable, selecting the best solution becomes challenging. Although some heuristic strategies have been proposed to tackle this problem, they lack a strong theoretical guarantee and it is still an open question whether an optimal selection strategy exists. Our work contributes in two ways. First, we show that within a Bayesian framework, the optimal selection strategy can be defined based on the posterior probability of the observed passing states between solutions and tests. The problem of identifying the best solution is then framed as an integer programming problem. Second, we propose an efficient approach for approximating this optimal (yet uncomputable) strategy, where the approximation error is bounded by the correctness of prior knowledge. We then incorporate effective prior knowledge to tailor code generation tasks. Both theoretical and empirical studies confirm that existing heuristics are limited in selecting the best solutions with plausible test cases. Our proposed approximated optimal strategy B4 significantly surpasses existing heuristics in selecting code solutions generated by large language models (LLMs) with LLM-generated tests, achieving a relative performance improvement by up to 50% over the strongest heuristic and 246% over the random selection in the most challenging scenarios. Our code is publicly available at https://github.com/ZJU-CTAG/B4.

* accepted by ASE' 24 (full paper)

Via

Access Paper or Ask Questions

VisionTS: Visual Masked Autoencoders Are Free-Lunch Zero-Shot Time Series Forecasters

Aug 30, 2024

Mouxiang Chen, Lefei Shen, Zhuo Li, Xiaoyun Joy Wang, Jianling Sun, Chenghao Liu

Figure 1 for VisionTS: Visual Masked Autoencoders Are Free-Lunch Zero-Shot Time Series Forecasters

Figure 2 for VisionTS: Visual Masked Autoencoders Are Free-Lunch Zero-Shot Time Series Forecasters

Figure 3 for VisionTS: Visual Masked Autoencoders Are Free-Lunch Zero-Shot Time Series Forecasters

Figure 4 for VisionTS: Visual Masked Autoencoders Are Free-Lunch Zero-Shot Time Series Forecasters

Abstract:Foundation models have emerged as a promising approach in time series forecasting (TSF). Existing approaches either fine-tune large language models (LLMs) or build large-scale time-series datasets to develop TSF foundation models. However, these methods face challenges due to the severe cross-domain gap or in-domain heterogeneity. In this paper, we explore a new road to building a TSF foundation model from rich and high-quality natural images, based on the intrinsic similarities between images and time series. To bridge the gap between the two domains, we reformulate the TSF task as an image reconstruction task, which is further processed by a visual masked autoencoder (MAE) self-supervised pre-trained on the ImageNet dataset. Surprisingly, without further adaptation in the time-series domain, the proposed VisionTS could achieve superior zero-shot forecasting performance compared to existing TSF foundation models. With minimal fine-tuning, VisionTS could further improve the forecasting and achieve state-of-the-art performance in most cases. These findings suggest that visual models could be a free lunch for TSF and highlight the potential for future cross-domain research between computer vision and TSF. Our code is publicly available at https://github.com/Keytoyze/VisionTS.

* 26 pages, 11 figures

Via

Access Paper or Ask Questions

PEMT: Multi-Task Correlation Guided Mixture-of-Experts Enables Parameter-Efficient Transfer Learning

Feb 23, 2024

Zhisheng Lin, Han Fu, Chenghao Liu, Zhuo Li, Jianling Sun

Figure 1 for PEMT: Multi-Task Correlation Guided Mixture-of-Experts Enables Parameter-Efficient Transfer Learning

Figure 2 for PEMT: Multi-Task Correlation Guided Mixture-of-Experts Enables Parameter-Efficient Transfer Learning

Figure 3 for PEMT: Multi-Task Correlation Guided Mixture-of-Experts Enables Parameter-Efficient Transfer Learning

Figure 4 for PEMT: Multi-Task Correlation Guided Mixture-of-Experts Enables Parameter-Efficient Transfer Learning

Abstract:Parameter-efficient fine-tuning (PEFT) has emerged as an effective method for adapting pre-trained language models to various tasks efficiently. Recently, there has been a growing interest in transferring knowledge from one or multiple tasks to the downstream target task to achieve performance improvements. However, current approaches typically either train adapters on individual tasks or distill shared knowledge from source tasks, failing to fully exploit task-specific knowledge and the correlation between source and target tasks. To overcome these limitations, we propose PEMT, a novel parameter-efficient fine-tuning framework based on multi-task transfer learning. PEMT extends the mixture-of-experts (MoE) framework to capture the transferable knowledge as a weighted combination of adapters trained on source tasks. These weights are determined by a gated unit, measuring the correlation between the target and each source task using task description prompt vectors. To fully exploit the task-specific knowledge, we also propose the Task Sparsity Loss to improve the sparsity of the gated unit. We conduct experiments on a broad range of tasks over 17 datasets. The experimental results demonstrate our PEMT yields stable improvements over full fine-tuning, and state-of-the-art PEFT and knowledge transferring methods on various tasks. The results highlight the effectiveness of our method which is capable of sufficiently exploiting the knowledge and correlation features across multiple tasks.

Via

Access Paper or Ask Questions

Advancing Graph Representation Learning with Large Language Models: A Comprehensive Survey of Techniques

Feb 04, 2024

Qiheng Mao, Zemin Liu, Chenghao Liu, Zhuo Li, Jianling Sun

Figure 1 for Advancing Graph Representation Learning with Large Language Models: A Comprehensive Survey of Techniques

Figure 2 for Advancing Graph Representation Learning with Large Language Models: A Comprehensive Survey of Techniques

Figure 3 for Advancing Graph Representation Learning with Large Language Models: A Comprehensive Survey of Techniques

Figure 4 for Advancing Graph Representation Learning with Large Language Models: A Comprehensive Survey of Techniques

Abstract:The integration of Large Language Models (LLMs) with Graph Representation Learning (GRL) marks a significant evolution in analyzing complex data structures. This collaboration harnesses the sophisticated linguistic capabilities of LLMs to improve the contextual understanding and adaptability of graph models, thereby broadening the scope and potential of GRL. Despite a growing body of research dedicated to integrating LLMs into the graph domain, a comprehensive review that deeply analyzes the core components and operations within these models is notably lacking. Our survey fills this gap by proposing a novel taxonomy that breaks down these models into primary components and operation techniques from a novel technical perspective. We further dissect recent literature into two primary components including knowledge extractors and organizers, and two operation techniques including integration and training stratigies, shedding light on effective model design and training strategies. Additionally, we identify and explore potential future research avenues in this nascent yet underexplored field, proposing paths for continued progress.

Via

Access Paper or Ask Questions

JumpCoder: Go Beyond Autoregressive Coder via Online Modification

Jan 15, 2024

Mouxiang Chen, Hao Tian, Zhongxin Liu, Xiaoxue Ren, Jianling Sun

Figure 1 for JumpCoder: Go Beyond Autoregressive Coder via Online Modification

Figure 2 for JumpCoder: Go Beyond Autoregressive Coder via Online Modification

Figure 3 for JumpCoder: Go Beyond Autoregressive Coder via Online Modification

Figure 4 for JumpCoder: Go Beyond Autoregressive Coder via Online Modification

Abstract:While existing code large language models (code LLMs) exhibit impressive capabilities in code generation, their autoregressive sequential generation inherently lacks reversibility. This limitation hinders them from timely correcting previous missing statements during coding as humans do, often leading to error propagation and suboptimal performance. We introduce JumpCoder, a novel modelagnostic framework that enables online modification and non-sequential generation to augment the code LLMs. The key idea behind JumpCoder is to insert new code into the currently generated code when necessary during generation, which is achieved through an auxiliary infilling model that works in tandem with the code LLM. Since identifying the best infill position beforehand is intractable, we adopt an infill-first, judge-later strategy, which experiments with filling at the $k$ most critical positions following the generation of each line, and uses an Abstract Syntax Tree (AST) parser alongside the Generation Model Scoring to effectively judge the validity of each potential infill. Extensive experiments using six state-of-the-art code LLMs across multiple benchmarks consistently indicate significant improvements over all baselines. Notably, JumpCoder assists code LLMs in achieving up to a 3.6% increase in Pass@1 for Python, 6.3% for Java, and 3.7% for C++ in the multilingual HumanEval benchmarks. Our code is public at https://github.com/Keytoyze/JumpCoder.

Via

Access Paper or Ask Questions

Calibration of Time-Series Forecasting Transformers: Detecting and Adapting Context-Driven Distribution Shift

Oct 23, 2023

Mouxiang Chen, Lefei Shen, Han Fu, Zhuo Li, Jianling Sun, Chenghao Liu

Figure 1 for Calibration of Time-Series Forecasting Transformers: Detecting and Adapting Context-Driven Distribution Shift

Figure 2 for Calibration of Time-Series Forecasting Transformers: Detecting and Adapting Context-Driven Distribution Shift

Figure 3 for Calibration of Time-Series Forecasting Transformers: Detecting and Adapting Context-Driven Distribution Shift

Figure 4 for Calibration of Time-Series Forecasting Transformers: Detecting and Adapting Context-Driven Distribution Shift

Abstract:Recent years have witnessed the success of introducing Transformers to time series forecasting. From a data generation perspective, we illustrate that existing Transformers are susceptible to distribution shifts driven by temporal contexts, whether observed or unobserved. Such context-driven distribution shift (CDS) introduces biases in predictions within specific contexts and poses challenges for conventional training paradigm. In this paper, we introduce a universal calibration methodology for the detection and adaptation of CDS with a trained Transformer model. To this end, we propose a novel CDS detector, termed the "residual-based CDS detector" or "Reconditionor", which quantifies the model's vulnerability to CDS by evaluating the mutual information between prediction residuals and their corresponding contexts. A high Reconditionor score indicates a severe susceptibility, thereby necessitating model adaptation. In this circumstance, we put forth a straightforward yet potent adapter framework for model calibration, termed the "sample-level contextualized adapter" or "SOLID". This framework involves the curation of a contextually similar dataset to the provided test sample and the subsequent fine-tuning of the model's prediction layer with a limited number of steps. Our theoretical analysis demonstrates that this adaptation strategy is able to achieve an optimal equilibrium between bias and variance. Notably, our proposed Reconditionor and SOLID are model-agnostic and readily adaptable to a wide range of Transformers. Extensive experiments show that SOLID consistently enhances the performance of current SOTA Transformers on real-world datasets, especially on cases with substantial CDS detected by the proposed Reconditionor, thus validate the effectiveness of the calibration approach.

Via

Access Paper or Ask Questions

ULTRA-DP: Unifying Graph Pre-training with Multi-task Graph Dual Prompt

Oct 23, 2023

Mouxiang Chen, Zemin Liu, Chenghao Liu, Jundong Li, Qiheng Mao, Jianling Sun

Figure 1 for ULTRA-DP: Unifying Graph Pre-training with Multi-task Graph Dual Prompt

Figure 2 for ULTRA-DP: Unifying Graph Pre-training with Multi-task Graph Dual Prompt

Figure 3 for ULTRA-DP: Unifying Graph Pre-training with Multi-task Graph Dual Prompt

Figure 4 for ULTRA-DP: Unifying Graph Pre-training with Multi-task Graph Dual Prompt

Abstract:Recent research has demonstrated the efficacy of pre-training graph neural networks (GNNs) to capture the transferable graph semantics and enhance the performance of various downstream tasks. However, the semantic knowledge learned from pretext tasks might be unrelated to the downstream task, leading to a semantic gap that limits the application of graph pre-training. To reduce this gap, traditional approaches propose hybrid pre-training to combine various pretext tasks together in a multi-task learning fashion and learn multi-grained knowledge, which, however, cannot distinguish tasks and results in some transferable task-specific knowledge distortion by each other. Moreover, most GNNs cannot distinguish nodes located in different parts of the graph, making them fail to learn position-specific knowledge and lead to suboptimal performance. In this work, inspired by the prompt-based tuning in natural language processing, we propose a unified framework for graph hybrid pre-training which injects the task identification and position identification into GNNs through a prompt mechanism, namely multi-task graph dual prompt (ULTRA-DP). Based on this framework, we propose a prompt-based transferability test to find the most relevant pretext task in order to reduce the semantic gap. To implement the hybrid pre-training tasks, beyond the classical edge prediction task (node-node level), we further propose a novel pre-training paradigm based on a group of $k$-nearest neighbors (node-group level). The combination of them across different scales is able to comprehensively express more structural semantics and derive richer multi-grained knowledge. Extensive experiments show that our proposed ULTRA-DP can significantly enhance the performance of hybrid pre-training methods and show the generalizability to other pre-training tasks and backbone architectures.

Via

Access Paper or Ask Questions