South China University of Technology
Abstract:Recent advancements in large language models (LLMs) have significantly improved performance on the Text-to-SQL task. However, prior approaches typically rely on static, pre-processed database information provided at inference time, which limits the model's ability to fully understand the database contents. Without dynamic interaction, LLMs are constrained to fixed, human-provided context and cannot autonomously explore the underlying data. To address this limitation, we propose SDE-SQL, a framework that enables large language models to perform self-driven exploration of databases during inference. This is accomplished by generating and executing SQL probes, which allow the model to actively retrieve information from the database and iteratively update its understanding of the data. Unlike prior methods, SDE-SQL operates in a zero-shot setting, without relying on any question-SQL pairs as in-context demonstrations. When evaluated on the BIRD benchmark with Qwen2.5-72B-Instruct, SDE-SQL achieves an 8.02% relative improvement in execution accuracy over the vanilla Qwen2.5-72B-Instruct baseline, establishing a new state-of-the-art among methods based on open-source models without supervised fine-tuning (SFT) or model ensembling. Moreover, with SFT, the performance of SDE-SQL can be further enhanced, yielding an additional 0.52% improvement.
Abstract:In Text-to-SQL, execution feedback is essential for guiding large language models (LLMs) to reason accurately and generate reliable SQL queries. However, existing methods treat execution feedback solely as a post-hoc signal for correction or selection, failing to integrate it into the generation process. This limitation hinders their ability to address reasoning errors as they occur, ultimately reducing query accuracy and robustness. To address this issue, we propose ReEx-SQL (Reasoning with Execution-Aware Reinforcement Learning), a framework for Text-to-SQL that enables models to interact with the database during decoding and dynamically adjust their reasoning based on execution feedback. ReEx-SQL introduces an execution-aware reasoning paradigm that interleaves intermediate SQL execution into reasoning paths, facilitating context-sensitive revisions. It achieves this through structured prompts with markup tags and a stepwise rollout strategy that integrates execution feedback into each stage of generation. To supervise policy learning, we develop a composite reward function that includes an exploration reward, explicitly encouraging effective database interaction. Additionally, ReEx-SQL adopts a tree-based decoding strategy to support exploratory reasoning, enabling dynamic expansion of alternative reasoning paths. Notably, ReEx-SQL achieves 88.8% on Spider and 64.9% on BIRD at the 7B scale, surpassing the standard reasoning baseline by 2.7% and 2.6%, respectively. It also shows robustness, achieving 85.2% on Spider-Realistic with leading performance. In addition, its tree-structured decoding improves efficiency and performance over linear decoding, reducing inference time by 51.9% on the BIRD development set.
Abstract:The discovery of novel small molecule drugs remains a critical scientific challenge with far-reaching implications for treating diseases and advancing human health. Traditional drug development--especially for small molecule therapeutics--is a highly complex, resource-intensive, and time-consuming process that requires multidisciplinary collaboration. Recent breakthroughs in artificial intelligence (AI), particularly the rise of large language models (LLMs), present a transformative opportunity to streamline and accelerate this process. In this paper, we introduce PharmAgents, a virtual pharmaceutical ecosystem driven by LLM-based multi-agent collaboration. PharmAgents simulates the full drug discovery workflow--from target discovery to preclinical evaluation--by integrating explainable, LLM-driven agents equipped with specialized machine learning models and computational tools. Through structured knowledge exchange and automated optimization, PharmAgents identifies potential therapeutic targets, discovers promising lead compounds, enhances binding affinity and key molecular properties, and performs in silico analyses of toxicity and synthetic feasibility. Additionally, the system supports interpretability, agent interaction, and self-evolvement, enabling it to refine future drug designs based on prior experience. By showcasing the potential of LLM-powered multi-agent systems in drug discovery, this work establishes a new paradigm for autonomous, explainable, and scalable pharmaceutical research, with future extensions toward comprehensive drug lifecycle management.
Abstract:We introduce a practical real-time neural video codec (NVC) designed to deliver high compression ratio, low latency and broad versatility. In practice, the coding speed of NVCs depends on 1) computational costs, and 2) non-computational operational costs, such as memory I/O and the number of function calls. While most efficient NVCs prioritize reducing computational cost, we identify operational cost as the primary bottleneck to achieving higher coding speed. Leveraging this insight, we introduce a set of efficiency-driven design improvements focused on minimizing operational costs. Specifically, we employ implicit temporal modeling to eliminate complex explicit motion modules, and use single low-resolution latent representations rather than progressive downsampling. These innovations significantly accelerate NVC without sacrificing compression quality. Additionally, we implement model integerization for consistent cross-device coding and a module-bank-based rate control scheme to improve practical adaptability. Experiments show our proposed DCVC-RT achieves an impressive average encoding/decoding speed at 125.2/112.8 fps (frames per second) for 1080p video, while saving an average of 21% in bitrate compared to H.266/VTM. The code is available at https://github.com/microsoft/DCVC.
Abstract:In recent work on time-series prediction, Transformers and even large language models have garnered significant attention due to their strong capabilities in sequence modeling. However, in practical deployments, time-series prediction often requires operation in resource-constrained environments, such as edge devices, which are unable to handle the computational overhead of large models. To address such scenarios, some lightweight models have been proposed, but they exhibit poor performance on non-stationary sequences. In this paper, we propose $\textit{SWIFT}$, a lightweight model that is not only powerful, but also efficient in deployment and inference for Long-term Time Series Forecasting (LTSF). Our model is based on three key points: (i) Utilizing wavelet transform to perform lossless downsampling of time series. (ii) Achieving cross-band information fusion with a learnable filter. (iii) Using only one shared linear layer or one shallow MLP for sub-series' mapping. We conduct comprehensive experiments, and the results show that $\textit{SWIFT}$ achieves state-of-the-art (SOTA) performance on multiple datasets, offering a promising method for edge computing and deployment in this task. Moreover, it is noteworthy that the number of parameters in $\textit{SWIFT-Linear}$ is only 25\% of what it would be with a single-layer linear model for time-domain prediction. Our code is available at https://github.com/LancelotXWX/SWIFT.
Abstract:Permanent magnet tracking using the external sensor array is crucial for the accurate localization of wireless capsule endoscope robots. Traditional tracking algorithms, based on the magnetic dipole model and Levenberg-Marquardt (LM) algorithm, face challenges related to computational delays and the need for initial position estimation. More recently proposed neural network-based approaches often require extensive hardware calibration and real-world data collection, which are time-consuming and labor-intensive. To address these challenges, we propose MobilePosenet, a lightweight neural network architecture that leverages depthwise separable convolutions to minimize computational cost and a channel attention mechanism to enhance localization accuracy. Besides, the inputs to the network integrate the sensors' coordinate information and random noise, compensating for the discrepancies between the theoretical model and the actual magnetic fields and thus allowing MobilePosenet to be trained entirely on theoretical data. Experimental evaluations conducted in a \(90 \times 90 \times 80\) mm workspace demonstrate that MobilePosenet exhibits excellent 5-DOF localization accuracy ($1.54 \pm 1.03$ mm and $2.24 \pm 1.84^{\circ}$) and inference speed (0.9 ms) against state-of-the-art methods trained on real-world data. Since network training relies solely on theoretical data, MobilePosenet can eliminate the hardware calibration and real-world data collection process, improving the generalizability of this permanent magnet localization method and the potential for rapid adoption in different clinical settings.
Abstract:Ultrasound (US)-guided needle insertion is widely employed in percutaneous interventions. However, providing feedback on the needle tip position via US image presents challenges due to noise, artifacts, and the thin imaging plane of US, which degrades needle features and leads to intermittent tip visibility. In this paper, a Mamba-based US needle tracker MambaXCTrack utilizing structured state space models cross-correlation (SSMX-Corr) and implicit motion prompt is proposed, which is the first application of Mamba in US needle tracking. The SSMX-Corr enhances cross-correlation by long-range modeling and global searching of distant semantic features between template and search maps, benefiting the tracking under noise and artifacts by implicitly learning potential distant semantic cues. By combining with cross-map interleaved scan (CIS), local pixel-wise interaction with positional inductive bias can also be introduced to SSMX-Corr. The implicit low-level motion descriptor is proposed as a non-visual prompt to enhance tracking robustness, addressing the intermittent tip visibility problem. Extensive experiments on a dataset with motorized needle insertion in both phantom and tissue samples demonstrate that the proposed tracker outperforms other state-of-the-art trackers while ablation studies further highlight the effectiveness of each proposed tracking module.
Abstract:Recent In-Context Learning based methods have achieved remarkable success in Text-to-SQL task. However, there is still a large gap between the performance of these models and human performance on datasets with complex database schema and difficult questions, such as BIRD. Besides, existing work has neglected to supervise intermediate steps when solving questions iteratively with question decomposition methods, and the schema linking methods used in these works are very rudimentary. To address these issues, we propose MAG-SQL, a multi-agent generative approach with soft schema linking and iterative Sub-SQL refinement. In our framework, an entity-based method with tables' summary is used to select the columns in database, and a novel targets-conditions decomposition method is introduced to decompose those complex questions. Additionally, we build a iterative generating module which includes a Sub-SQL Generator and Sub-SQL Refiner, introducing external oversight for each step of generation. Through a series of ablation studies, the effectiveness of each agent in our framework has been demonstrated. When evaluated on the BIRD benchmark with GPT-4, MAG-SQL achieves an execution accuracy of 61.08\%, compared to the baseline accuracy of 46.35\% for vanilla GPT-4 and the baseline accuracy of 57.56\% for MAC-SQL. Besides, our approach makes similar progress on Spider.
Abstract:Significant progress has been made in scene text detection models since the rise of deep learning, but scene text layout analysis, which aims to group detected text instances as paragraphs, has not kept pace. Previous works either treated text detection and grouping using separate models, or train a model from scratch while using a unified one. All of them have not yet made full use of the already well-trained text detectors and easily obtainable detection datasets. In this paper, we present Text Grouping Adapter (TGA), a module that can enable the utilization of various pre-trained text detectors to learn layout analysis, allowing us to adopt a well-trained text detector right off the shelf or just fine-tune it efficiently. Designed to be compatible with various text detector architectures, TGA takes detected text regions and image features as universal inputs to assemble text instance features. To capture broader contextual information for layout analysis, we propose to predict text group masks from text instance features by one-to-many assignment. Our comprehensive experiments demonstrate that, even with frozen pre-trained models, incorporating our TGA into various pre-trained text detectors and text spotters can achieve superior layout analysis performance, simultaneously inheriting generalized text detection ability from pre-training. In the case of full parameter fine-tuning, we can further improve layout analysis performance.
Abstract:Video-Language Models (VLMs), powered by the advancements in Large Language Models (LLMs), are charting new frontiers in video understanding. A pivotal challenge is the development of an efficient method to encapsulate video content into a set of representative tokens to align with LLMs. In this work, we introduce Slot-VLM, a novel framework designed to generate semantically decomposed video tokens, in terms of object-wise and event-wise visual representations, to facilitate LLM inference. Particularly, we design a SlowFast Slots module, i.e., SF-Slots, that adaptively aggregates the dense video tokens from the CLIP vision encoder to a set of representative slots. In order to take into account both the spatial object details and the varied temporal dynamics, SF-Slots is built with a dual-branch structure. The Slow-Slots branch focuses on extracting object-centric slots from features at high spatial resolution but low (slow) frame sample rate, emphasizing detailed object information. Conversely, Fast-Slots branch is engineered to learn event-centric slots from high temporal sample rate but low spatial resolution features. These complementary slots are combined to form the vision context, serving as the input to the LLM for efficient question answering. Our experimental results demonstrate the effectiveness of our Slot-VLM, which achieves the state-of-the-art performance on video question-answering.