Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Rui Chen

AllTact Fin Ray: A Compliant Robot Gripper with Omni-Directional Tactile Sensing

Apr 25, 2025

Siwei Liang, Yixuan Guan, Jing Xu, Hongyu Qian, Xiangjun Zhang, Dan Wu, Wenbo Ding, Rui Chen

Figure 1 for AllTact Fin Ray: A Compliant Robot Gripper with Omni-Directional Tactile Sensing

Figure 2 for AllTact Fin Ray: A Compliant Robot Gripper with Omni-Directional Tactile Sensing

Figure 3 for AllTact Fin Ray: A Compliant Robot Gripper with Omni-Directional Tactile Sensing

Figure 4 for AllTact Fin Ray: A Compliant Robot Gripper with Omni-Directional Tactile Sensing

Abstract:Tactile sensing plays a crucial role in robot grasping and manipulation by providing essential contact information between the robot and the environment. In this paper, we present AllTact Fin Ray, a novel compliant gripper design with omni-directional and local tactile sensing capabilities. The finger body is unibody-casted using transparent elastic silicone, and a camera positioned at the base of the finger captures the deformation of the whole body and the contact face. Due to the global deformation of the adaptive structure, existing vision-based tactile sensing approaches that assume constant illumination are no longer applicable. To address this, we propose a novel sensing method where the global deformation is first reconstructed from the image using edge features and spatial constraints. Then, detailed contact geometry is computed from the brightness difference against a dynamically retrieved reference image. Extensive experiments validate the effectiveness of our proposed gripper design and sensing method in contact detection, force estimation, object grasping, and precise manipulation.

Via

Access Paper or Ask Questions

BadVideo: Stealthy Backdoor Attack against Text-to-Video Generation

Apr 23, 2025

Ruotong Wang, Mingli Zhu, Jiarong Ou, Rui Chen, Xin Tao, Pengfei Wan, Baoyuan Wu

Abstract:Text-to-video (T2V) generative models have rapidly advanced and found widespread applications across fields like entertainment, education, and marketing. However, the adversarial vulnerabilities of these models remain rarely explored. We observe that in T2V generation tasks, the generated videos often contain substantial redundant information not explicitly specified in the text prompts, such as environmental elements, secondary objects, and additional details, providing opportunities for malicious attackers to embed hidden harmful content. Exploiting this inherent redundancy, we introduce BadVideo, the first backdoor attack framework tailored for T2V generation. Our attack focuses on designing target adversarial outputs through two key strategies: (1) Spatio-Temporal Composition, which combines different spatiotemporal features to encode malicious information; (2) Dynamic Element Transformation, which introduces transformations in redundant elements over time to convey malicious information. Based on these strategies, the attacker's malicious target seamlessly integrates with the user's textual instructions, providing high stealthiness. Moreover, by exploiting the temporal dimension of videos, our attack successfully evades traditional content moderation systems that primarily analyze spatial information within individual frames. Extensive experiments demonstrate that BadVideo achieves high attack success rates while preserving original semantics and maintaining excellent performance on clean inputs. Overall, our work reveals the adversarial vulnerability of T2V models, calling attention to potential risks and misuse. Our project page is at https://wrt2000.github.io/BadVideo2025/.

Via

Access Paper or Ask Questions

The Tenth NTIRE 2025 Efficient Super-Resolution Challenge Report

Apr 14, 2025

Bin Ren, Hang Guo, Lei Sun, Zongwei Wu, Radu Timofte, Yawei Li, Yao Zhang, Xinning Chai, Zhengxue Cheng, Yingsheng Qin(+136 more)

Abstract:This paper presents a comprehensive review of the NTIRE 2025 Challenge on Single-Image Efficient Super-Resolution (ESR). The challenge aimed to advance the development of deep models that optimize key computational metrics, i.e., runtime, parameters, and FLOPs, while achieving a PSNR of at least 26.90 dB on the $\operatorname{DIV2K\_LSDIR\_valid}$ dataset and 26.99 dB on the $\operatorname{DIV2K\_LSDIR\_test}$ dataset. A robust participation saw \textbf{244} registered entrants, with \textbf{43} teams submitting valid entries. This report meticulously analyzes these methods and results, emphasizing groundbreaking advancements in state-of-the-art single-image ESR techniques. The analysis highlights innovative approaches and establishes benchmarks for future research in the field.

* Accepted by CVPR2025 NTIRE Workshop, Efficient Super-Resolution Challenge Report. 50 pages

Via

Access Paper or Ask Questions

FingER: Content Aware Fine-grained Evaluation with Reasoning for AI-Generated Videos

Apr 14, 2025

Rui Chen, Lei Sun, Jing Tang, Geng Li, Xiangxiang Chu

Figure 1 for FingER: Content Aware Fine-grained Evaluation with Reasoning for AI-Generated Videos

Figure 2 for FingER: Content Aware Fine-grained Evaluation with Reasoning for AI-Generated Videos

Figure 3 for FingER: Content Aware Fine-grained Evaluation with Reasoning for AI-Generated Videos

Figure 4 for FingER: Content Aware Fine-grained Evaluation with Reasoning for AI-Generated Videos

Abstract:Recent advances in video generation have posed great challenges in the assessment of AI-generated content, particularly with the emergence of increasingly sophisticated models. The various inconsistencies and defects observed in such videos are inherently complex, making overall scoring notoriously difficult. In this paper, we emphasize the critical importance of integrating fine-grained reasoning into video evaluation, and we propose $\textbf{F}$ing$\textbf{ER}$, a novel entity-level reasoning evaluation framework that first automatically generates $\textbf{F}$ine-grained $\textbf{E}$ntity-level questions, and then answers those questions by a $\textbf{R}$easoning model with scores, which can be subsequently weighted summed to an overall score for different applications. Specifically, we leverage LLMs to derive entity-level questions across five distinct perspectives, which (i) often focus on some specific entities of the content, thereby making answering or scoring much easier by MLLMs, and (ii) are more interpretable. Then we construct a FingER dataset, consisting of approximately 3.3k videos and corresponding 60k fine-grained QA annotations, each with detailed reasons. Based on that, we further investigate various training protocols to best incentivize the reasoning capability of MLLMs for correct answer prediction. Extensive experiments demonstrate that a reasoning model trained using Group Relative Policy Optimization (GRPO) with a cold-start strategy achieves the best performance. Notably, our model surpasses existing methods by a relative margin of $11.8\%$ on GenAI-Bench and $5.5\%$ on MonetBench with only 3.3k training videos, which is at most one-tenth of the training samples utilized by other methods. Our code and dataset will be released soon.

* 10 pages, 4 figures

Via

Access Paper or Ask Questions

Parametric Near-Field MMSE Channel Estimation for sub-THz XL-MIMO Systems

Apr 14, 2025

Wen-Xuan Long, Marco Moretti, Michele Morelli, Luca Sanguinetti, Rui Chen

Figure 1 for Parametric Near-Field MMSE Channel Estimation for sub-THz XL-MIMO Systems

Figure 2 for Parametric Near-Field MMSE Channel Estimation for sub-THz XL-MIMO Systems

Figure 3 for Parametric Near-Field MMSE Channel Estimation for sub-THz XL-MIMO Systems

Figure 4 for Parametric Near-Field MMSE Channel Estimation for sub-THz XL-MIMO Systems

Abstract:Accurate channel estimation is essential for reliable communication in sub-THz extremely large (XL) MIMO systems. Deploying XL-MIMO in high-frequency bands not only increases the number of antennas, but also fundamentally alters channel propagation characteristics, placing the user equipments (UE) in the radiative near-field of the base station. This paper proposes a parametric estimation method using the multiple signal classification (MUSIC) algorithm to extract UE location data from uplink pilot signals. These parameters are used to reconstruct the spatial correlation matrix, followed by an approximation of the minimum mean square error (MMSE) channel estimator. Numerical results show that the proposed method outperforms the least-squares (LS) estimator in terms of the normalized mean-square error (NMSE), even without prior UE location knowledge.

Via

Access Paper or Ask Questions

ViTaMIn: Learning Contact-Rich Tasks Through Robot-Free Visuo-Tactile Manipulation Interface

Apr 08, 2025

Fangchen Liu, Chuanyu Li, Yihua Qin, Ankit Shaw, Jing Xu, Pieter Abbeel, Rui Chen

Abstract:Tactile information plays a crucial role for humans and robots to interact effectively with their environment, particularly for tasks requiring the understanding of contact properties. Solving such dexterous manipulation tasks often relies on imitation learning from demonstration datasets, which are typically collected via teleoperation systems and often demand substantial time and effort. To address these challenges, we present ViTaMIn, an embodiment-free manipulation interface that seamlessly integrates visual and tactile sensing into a hand-held gripper, enabling data collection without the need for teleoperation. Our design employs a compliant Fin Ray gripper with tactile sensing, allowing operators to perceive force feedback during manipulation for more intuitive operation. Additionally, we propose a multimodal representation learning strategy to obtain pre-trained tactile representations, improving data efficiency and policy robustness. Experiments on seven contact-rich manipulation tasks demonstrate that ViTaMIn significantly outperforms baseline methods, demonstrating its effectiveness for complex manipulation tasks.

Via

Access Paper or Ask Questions

MOSAIC: Generating Consistent, Privacy-Preserving Scenes from Multiple Depth Views in Multi-Room Environments

Mar 18, 2025

Zhixuan Liu, Haokun Zhu, Rui Chen, Jonathan Francis, Soonmin Hwang, Ji Zhang, Jean Oh

Abstract:We introduce a novel diffusion-based approach for generating privacy-preserving digital twins of multi-room indoor environments from depth images only. Central to our approach is a novel Multi-view Overlapped Scene Alignment with Implicit Consistency (MOSAIC) model that explicitly considers cross-view dependencies within the same scene in the probabilistic sense. MOSAIC operates through a novel inference-time optimization that avoids error accumulation common in sequential or single-room constraint in panorama-based approaches. MOSAIC scales to complex scenes with zero extra training and provably reduces the variance during denoising processes when more overlapping views are added, leading to improved generation quality. Experiments show that MOSAIC outperforms state-of-the-art baselines on image fidelity metrics in reconstructing complex multi-room environments. Project page is available at: https://mosaic-cmubig.github.io

Via

Access Paper or Ask Questions

Mixed-granularity Implicit Representation for Continuous Hyperspectral Compressive Reconstruction

Mar 17, 2025

Jianan Li, Huan Chen, Wangcai Zhao, Rui Chen, Tingfa Xu

Abstract:Hyperspectral Images (HSIs) are crucial across numerous fields but are hindered by the long acquisition times associated with traditional spectrometers. The Coded Aperture Snapshot Spectral Imaging (CASSI) system mitigates this issue through a compression technique that accelerates the acquisition process. However, reconstructing HSIs from compressed data presents challenges due to fixed spatial and spectral resolution constraints. This study introduces a novel method using implicit neural representation for continuous hyperspectral image reconstruction. We propose the Mixed Granularity Implicit Representation (MGIR) framework, which includes a Hierarchical Spectral-Spatial Implicit Encoder for efficient multi-scale implicit feature extraction. This is complemented by a Mixed-Granularity Local Feature Aggregator that adaptively integrates local features across scales, combined with a decoder that merges coordinate information for precise reconstruction. By leveraging implicit neural representations, the MGIR framework enables reconstruction at any desired spatial-spectral resolution, significantly enhancing the flexibility and adaptability of the CASSI system. Extensive experimental evaluations confirm that our model produces reconstructed images at arbitrary resolutions and matches state-of-the-art methods across varying spectral-spatial compression ratios. The code will be released at https://github.com/chh11/MGIR.

* Accepted by TNNLS

Via

Access Paper or Ask Questions

MaskGWM: A Generalizable Driving World Model with Video Mask Reconstruction

Feb 17, 2025

Jingcheng Ni, Yuxin Guo, Yichen Liu, Rui Chen, Lewei Lu, Zehuan Wu

Abstract:World models that forecast environmental changes from actions are vital for autonomous driving models with strong generalization. The prevailing driving world model mainly build on video prediction model. Although these models can produce high-fidelity video sequences with advanced diffusion-based generator, they are constrained by their predictive duration and overall generalization capabilities. In this paper, we explore to solve this problem by combining generation loss with MAE-style feature-level context learning. In particular, we instantiate this target with three key design: (1) A more scalable Diffusion Transformer (DiT) structure trained with extra mask construction task. (2) we devise diffusion-related mask tokens to deal with the fuzzy relations between mask reconstruction and generative diffusion process. (3) we extend mask construction task to spatial-temporal domain by utilizing row-wise mask for shifted self-attention rather than masked self-attention in MAE. Then, we adopt a row-wise cross-view module to align with this mask design. Based on above improvement, we propose MaskGWM: a Generalizable driving World Model embodied with Video Mask reconstruction. Our model contains two variants: MaskGWM-long, focusing on long-horizon prediction, and MaskGWM-mview, dedicated to multi-view generation. Comprehensive experiments on standard benchmarks validate the effectiveness of the proposed method, which contain normal validation of Nuscene dataset, long-horizon rollout of OpenDV-2K dataset and zero-shot validation of Waymo dataset. Quantitative metrics on these datasets show our method notably improving state-of-the-art driving world model.

Via

Access Paper or Ask Questions

Boosting Self-Efficacy and Performance of Large Language Models via Verbal Efficacy Stimulations

Feb 10, 2025

Rui Chen, Tailai Peng, Xinran Xie, Dekun Lin, Zhe Cui, Zheng Chen

Abstract:Significant improvements have been observed in the zero-shot capabilities of the Large Language Models (LLMs). Due to their high sensitivity to input, research has increasingly focused on enhancing LLMs' performance via direct and simple prompt engineering rather than intricate domain adaptation. Studies suggest that LLMs exhibit emotional intelligence, and both positive and negative emotions can potentially enhance task performances. However, prior interaction prompts have predominantly concentrated on a single stimulus type, neglecting to compare different stimulus effects, examine the influence of varying task difficulties, or explore underlying mechanisms. This paper, inspired by the positive correlation between self-efficacy and task performance within the social cognitive theory, introduces Verbal Efficacy Stimulations (VES). Our VES comprises three types of verbal prompts: encouraging, provocative, and critical, addressing six aspects such as helpfulness and competence. And we further categorize task difficulty, aiming to extensively investigate how distinct VES influence the self-efficacy and task achievements of language models at varied levels of difficulty. The experimental results show that the three types of VES improve the performance of LLMs on most tasks, and the most effective VES varies for different models. In extensive experiments, we have obtained some findings consistent with psychological theories, providing novel insights for future research.

* to be published in ICONIP 2024

Via

Access Paper or Ask Questions