Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Kai Gao

DuraMark: Duration-Embedded Watermarking in LLM-based TTS

Jun 13, 2026

Zhenwei Mou, Weili Jiang, Liping Chen, Zhen-Hua Ling, Kong Aik Lee, Kai Gao, Boyu Zhao

Abstract:Large language model (LLM)-based text-to-speech (TTS) models have achieved remarkable voice cloning capabilities, raising concerns about potential deepfake misuse. Speech watermarking mitigates this by embedding traceable information into generated speech. Mainstream watermarking methods operate at the signal level (waveform or spectrogram), rendering the watermark vulnerable to generative attacks (e.g., neural codec and vocoder). To address this, we propose DuraMark, a robust information-level watermarking framework. It utilizes syllable duration editing to achieve watermark embedding. Specifically, DuraMark integrates a duration-controllable LLM-based TTS model to edit syllable durations during synthesis, coupled with a duration extractor to extract these durations for detection. Experiments demonstrate DuraMark's superior robustness against generative attacks, significantly outperforming signal-level baselines. Audio samples are available at https://muzw.github.io/duramark_demo/.

* Accepted to INTERSPEECH 2026. 5 pages, 1 figure. Audio samples: https://muzw.github.io/duramark_demo/

Via

Access Paper or Ask Questions

A governance horizon for ethical-use constraints in open-weight AI models

May 23, 2026

Weiwei Xu, Hengzhi Ye, Haoran Ye, Kai Gao, Vladimir Filkov, Minghui Zhou

Abstract:Ethical constraints on open-weight AI models are both a reflection of societal concerns and a foundation for AI governance policy. They are expected to propagate to downstream derivatives while implemented as voluntary metadata disclosures that must be restated at each generation of reuse. We audit 2,142,823 model repositories on Hugging Face Hub to test whether this disclosure-based governance infrastructure can sustain traceability across deep model lineages. Restriction evidence decays with a half-life of 1.31 derivation steps ($R^2$=0.98), and beyond seven downstream generations at least 80% of descendant models lack sufficient public evidence for a governance determination, a depth boundary we formalize as the governance horizon. Platform-level interventions to restore missing licence metadata reveal that policy design (not enforcement alone) is the binding factor: inheritance-only designs require near-complete enforcement to move the horizon, whereas a mandatory-declaration design that explicitly resolves orphan lineage components shifts the horizon already at moderate enforcement. The structural bottleneck is lineages with no inheritable upstream intent: such orphan components remain undecidable under any inheritance-only policy regardless of enforcement rate, and unresolved upstream nodes additionally create direct downstream undecidability bottlenecks that inheritance rules alone cannot recover. Comparison with PyPI, where governance signals are carried by explicit machine-readable declarations, corroborates that the collapse is topology-specific to open-weight derivation rather than inherent to open ecosystems. These results establish that disclosure-based governance has a shallow, structurally determined reach in open-weight AI, and that achieving deep supply-chain accountability requires provenance mechanisms propagating governance signals through derivation itself.

Via

Access Paper or Ask Questions

DeepFAN, a transformer-based deep learning model for human-artificial intelligence collaborative assessment of incidental pulmonary nodules in CT scans: a multi-reader, multi-case trial

Mar 26, 2026

Zhenchen Zhu, Ge Hu, Weixiong Tan, Kai Gao, Chao Sun, Zhen Zhou, Kepei Xu, Wei Han, Meixia Shang, Xiaoming Qiu(+9 more)

Abstract:The widespread adoption of CT has notably increased the number of detected lung nodules. However, current deep learning methods for classifying benign and malignant nodules often fail to comprehensively integrate global and local features, and most of them have not been validated through clinical trials. To address this, we developed DeepFAN, a transformer-based model trained on over 10K pathology-confirmed nodules and further conducted a multi-reader, multi-case clinical trial to evaluate its efficacy in assisting junior radiologists. DeepFAN achieved diagnostic area under the curve (AUC) of 0.939 (95% CI 0.930-0.948) on an internal test set and 0.954 (95% CI 0.934-0.973) on the clinical trial dataset involving 400 cases across three independent medical institutions. Explainability analysis indicated higher contributions from global than local features. Twelve readers' average performance significantly improved by 10.9% (95% CI 8.3%-13.5%) in AUC, 10.0% (95% CI 8.9%-11.1%) in accuracy, 7.6% (95% CI 6.1%-9.2%) in sensitivity, and 12.6% (95% CI 10.9%-14.3%) in specificity (P<0.001 for all). Nodule-level inter-reader diagnostic consistency improved from fair to moderate (overall k: 0.313 vs. 0.421; P=0.019). In conclusion, DeepFAN effectively assisted junior radiologists and may help homogenize diagnostic quality and reduce unnecessary follow-up of indeterminate pulmonary nodules. Chinese Clinical Trial Registry: ChiCTR2400084624.

* 28 pages for main text and 37 pages for supplementary information, 7 figures in main text and 9 figures in supplementary information

Via

Access Paper or Ask Questions

Infinite-Instruct: Synthesizing Scaling Code instruction Data with Bidirectional Synthesis and Static Verification

May 29, 2025

Wenjing Xing, Wenke Lu, Yeheng Duan, Bing Zhao, Zhenghui kang, Yaolong Wang, Kai Gao, Lei Qiao

Abstract:Traditional code instruction data synthesis methods suffer from limited diversity and poor logic. We introduce Infinite-Instruct, an automated framework for synthesizing high-quality question-answer pairs, designed to enhance the code generation capabilities of large language models (LLMs). The framework focuses on improving the internal logic of synthesized problems and the quality of synthesized code. First, "Reverse Construction" transforms code snippets into diverse programming problems. Then, through "Backfeeding Construction," keywords in programming problems are structured into a knowledge graph to reconstruct them into programming problems with stronger internal logic. Finally, a cross-lingual static code analysis pipeline filters invalid samples to ensure data quality. Experiments show that on mainstream code generation benchmarks, our fine-tuned models achieve an average performance improvement of 21.70% on 7B-parameter models and 36.95% on 32B-parameter models. Using less than one-tenth of the instruction fine-tuning data, we achieved performance comparable to the Qwen-2.5-Coder-Instruct. Infinite-Instruct provides a scalable solution for LLM training in programming. We open-source the datasets used in the experiments, including both unfiltered versions and filtered versions via static analysis. The data are available at https://github.com/xingwenjing417/Infinite-Instruct-dataset

Via

Access Paper or Ask Questions

MuST: Multi-Head Skill Transformer for Long-Horizon Dexterous Manipulation with Skill Progress

Feb 04, 2025

Kai Gao, Fan Wang, Erica Aduh, Dylan Randle, Jane Shi

Figure 1 for MuST: Multi-Head Skill Transformer for Long-Horizon Dexterous Manipulation with Skill Progress

Figure 2 for MuST: Multi-Head Skill Transformer for Long-Horizon Dexterous Manipulation with Skill Progress

Figure 3 for MuST: Multi-Head Skill Transformer for Long-Horizon Dexterous Manipulation with Skill Progress

Figure 4 for MuST: Multi-Head Skill Transformer for Long-Horizon Dexterous Manipulation with Skill Progress

Abstract:Robot picking and packing tasks require dexterous manipulation skills, such as rearranging objects to establish a good grasping pose, or placing and pushing items to achieve tight packing. These tasks are challenging for robots due to the complexity and variability of the required actions. To tackle the difficulty of learning and executing long-horizon tasks, we propose a novel framework called the Multi-Head Skill Transformer (MuST). This model is designed to learn and sequentially chain together multiple motion primitives (skills), enabling robots to perform complex sequences of actions effectively. MuST introduces a "progress value" for each skill, guiding the robot on which skill to execute next and ensuring smooth transitions between skills. Additionally, our model is capable of expanding its skill set and managing various sequences of sub-tasks efficiently. Extensive experiments in both simulated and real-world environments demonstrate that MuST significantly enhances the robot's ability to perform long-horizon dexterous manipulation tasks.

* Accepted by ICRA 2025 (2025 IEEE International Conference on Robotics & Automation)

Via

Access Paper or Ask Questions

Imitation Game for Adversarial Disillusion with Multimodal Generative Chain-of-Thought Role-Play

Jan 31, 2025

Ching-Chun Chang, Fan-Yun Chen, Shih-Hong Gu, Kai Gao, Hanrui Wang, Isao Echizen

Figure 1 for Imitation Game for Adversarial Disillusion with Multimodal Generative Chain-of-Thought Role-Play

Figure 2 for Imitation Game for Adversarial Disillusion with Multimodal Generative Chain-of-Thought Role-Play

Figure 3 for Imitation Game for Adversarial Disillusion with Multimodal Generative Chain-of-Thought Role-Play

Figure 4 for Imitation Game for Adversarial Disillusion with Multimodal Generative Chain-of-Thought Role-Play

Abstract:As the cornerstone of artificial intelligence, machine perception confronts a fundamental threat posed by adversarial illusions. These adversarial attacks manifest in two primary forms: deductive illusion, where specific stimuli are crafted based on the victim model's general decision logic, and inductive illusion, where the victim model's general decision logic is shaped by specific stimuli. The former exploits the model's decision boundaries to create a stimulus that, when applied, interferes with its decision-making process. The latter reinforces a conditioned reflex in the model, embedding a backdoor during its learning phase that, when triggered by a stimulus, causes aberrant behaviours. The multifaceted nature of adversarial illusions calls for a unified defence framework, addressing vulnerabilities across various forms of attack. In this study, we propose a disillusion paradigm based on the concept of an imitation game. At the heart of the imitation game lies a multimodal generative agent, steered by chain-of-thought reasoning, which observes, internalises and reconstructs the semantic essence of a sample, liberated from the classic pursuit of reversing the sample to its original state. As a proof of concept, we conduct experimental simulations using a multimodal generative dialogue agent and evaluates the methodology under a variety of attack scenarios.

Via

Access Paper or Ask Questions

Tabletop Object Rearrangement: Structure, Complexity, and Efficient Combinatorial Search-Based Solutions

Dec 19, 2024

Kai Gao

Abstract:This thesis provides an in-depth structural analysis and efficient algorithmic solutions for tabletop object rearrangement with overhand grasps (TORO), a foundational task in advancing intelligent robotic manipulation. Rearranging multiple objects in a confined workspace presents two primary challenges: sequencing actions to minimize pick-and-place operations - an NP-hard problem in TORO - and determining temporary object placements ("buffer poses") within a cluttered environment, which is essential yet highly complex. For TORO with available external free space, this work investigates the minimum buffer space, or "running buffer size," required for temporary relocations, presenting both theoretical insights and exact algorithms. For TORO without external free space, the concept of lazy buffer verification is introduced, with its efficiency evaluated across various manipulator configurations, including single-arm, dual-arm, and mobile manipulators.

* PhD Thesis. arXiv admin note: substantial text overlap with arXiv:2304.01764

Via

Access Paper or Ask Questions

A First Look at License Compliance Capability of LLMs in Code Generation

Aug 05, 2024

Weiwei Xu, Kai Gao, Hao He, Minghui Zhou

Figure 1 for A First Look at License Compliance Capability of LLMs in Code Generation

Figure 2 for A First Look at License Compliance Capability of LLMs in Code Generation

Figure 3 for A First Look at License Compliance Capability of LLMs in Code Generation

Figure 4 for A First Look at License Compliance Capability of LLMs in Code Generation

Abstract:Recent advances in Large Language Models (LLMs) have revolutionized code generation, leading to widespread adoption of AI coding tools by developers. However, LLMs can generate license-protected code without providing the necessary license information, leading to potential intellectual property violations during software production. This paper addresses the critical, yet underexplored, issue of license compliance in LLM-generated code by establishing a benchmark to evaluate the ability of LLMs to provide accurate license information for their generated code. To establish this benchmark, we conduct an empirical study to identify a reasonable standard for "striking similarity" that excludes the possibility of independent creation, indicating a copy relationship between the LLM output and certain open-source code. Based on this standard, we propose an evaluation benchmark LiCoEval, to evaluate the license compliance capabilities of LLMs. Using LiCoEval, we evaluate 14 popular LLMs, finding that even top-performing LLMs produce a non-negligible proportion (0.88% to 2.01%) of code strikingly similar to existing open-source implementations. Notably, most LLMs fail to provide accurate license information, particularly for code under copyleft licenses. These findings underscore the urgent need to enhance LLM compliance capabilities in code generation tasks. Our study provides a foundation for future research and development to improve license compliance in AI-assisted software development, contributing to both the protection of open-source software copyrights and the mitigation of legal risks for LLM users.

Via

Access Paper or Ask Questions

Unsupervised Multimodal Clustering for Semantics Discovery in Multimodal Utterances

May 21, 2024

Hanlei Zhang, Hua Xu, Fei Long, Xin Wang, Kai Gao

Figure 1 for Unsupervised Multimodal Clustering for Semantics Discovery in Multimodal Utterances

Figure 2 for Unsupervised Multimodal Clustering for Semantics Discovery in Multimodal Utterances

Figure 3 for Unsupervised Multimodal Clustering for Semantics Discovery in Multimodal Utterances

Figure 4 for Unsupervised Multimodal Clustering for Semantics Discovery in Multimodal Utterances

Abstract:Discovering the semantics of multimodal utterances is essential for understanding human language and enhancing human-machine interactions. Existing methods manifest limitations in leveraging nonverbal information for discerning complex semantics in unsupervised scenarios. This paper introduces a novel unsupervised multimodal clustering method (UMC), making a pioneering contribution to this field. UMC introduces a unique approach to constructing augmentation views for multimodal data, which are then used to perform pre-training to establish well-initialized representations for subsequent clustering. An innovative strategy is proposed to dynamically select high-quality samples as guidance for representation learning, gauged by the density of each sample's nearest neighbors. Besides, it is equipped to automatically determine the optimal value for the top-$K$ parameter in each cluster to refine sample selection. Finally, both high- and low-quality samples are used to learn representations conducive to effective clustering. We build baselines on benchmark multimodal intent and dialogue act datasets. UMC shows remarkable improvements of 2-6\% scores in clustering metrics over state-of-the-art methods, marking the first successful endeavor in this domain. The complete code and data are available at https://github.com/thuiar/UMC.

* Accepted by ACL 2024, Main Conference, Long Paper

Via

Access Paper or Ask Questions

Toward Holistic Planning and Control Optimization for Dual-Arm Rearrangement

Apr 10, 2024

Kai Gao, Zihe Ye, Duo Zhang, Baichuan Huang, Jingjin Yu

Figure 1 for Toward Holistic Planning and Control Optimization for Dual-Arm Rearrangement

Figure 2 for Toward Holistic Planning and Control Optimization for Dual-Arm Rearrangement

Figure 3 for Toward Holistic Planning and Control Optimization for Dual-Arm Rearrangement

Figure 4 for Toward Holistic Planning and Control Optimization for Dual-Arm Rearrangement

Abstract:Long-horizon task and motion planning (TAMP) is notoriously difficult to solve, let alone optimally, due to the tight coupling between the interleaved (discrete) task and (continuous) motion planning phases, where each phase on its own is frequently an NP-hard or even PSPACE-hard computational challenge. In this study, we tackle the even more challenging goal of jointly optimizing task and motion plans for a real dual-arm system in which the two arms operate in close vicinity to solve highly constrained tabletop multi-object rearrangement problems. Toward that, we construct a tightly integrated planning and control optimization pipeline, Makespan-Optimized Dual-Arm Planner (MODAP) that combines novel sampling techniques for task planning with state-of-the-art trajectory optimization techniques. Compared to previous state-of-the-art, MODAP produces task and motion plans that better coordinate a dual-arm system, delivering significantly improved execution time improvements while simultaneously ensuring that the resulting time-parameterized trajectory conforms to specified acceleration and jerk limits.

* First three authors made equal contributions to this study

Via

Access Paper or Ask Questions