Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zhuo Chen

refer to the report for detailed contributions

MKGL: Mastery of a Three-Word Language

Oct 10, 2024

Lingbing Guo, Zhongpu Bo, Zhuo Chen, Yichi Zhang, Jiaoyan Chen, Yarong Lan, Mengshu Sun, Zhiqiang Zhang, Yangyifei Luo, Qian Li(+3 more)

Figure 1 for MKGL: Mastery of a Three-Word Language

Figure 2 for MKGL: Mastery of a Three-Word Language

Figure 3 for MKGL: Mastery of a Three-Word Language

Figure 4 for MKGL: Mastery of a Three-Word Language

Abstract:Large language models (LLMs) have significantly advanced performance across a spectrum of natural language processing (NLP) tasks. Yet, their application to knowledge graphs (KGs), which describe facts in the form of triplets and allow minimal hallucinations, remains an underexplored frontier. In this paper, we investigate the integration of LLMs with KGs by introducing a specialized KG Language (KGL), where a sentence precisely consists of an entity noun, a relation verb, and ends with another entity noun. Despite KGL's unfamiliar vocabulary to the LLM, we facilitate its learning through a tailored dictionary and illustrative sentences, and enhance context understanding via real-time KG context retrieval and KGL token embedding augmentation. Our results reveal that LLMs can achieve fluency in KGL, drastically reducing errors compared to conventional KG embedding methods on KG completion. Furthermore, our enhanced LLM shows exceptional competence in generating accurate three-word sentences from an initial entity and interpreting new unseen terms out of KGs.

* NeurIPS 2024 (spotlight)

Via

Access Paper or Ask Questions

Toward Scalable Image Feature Compression: A Content-Adaptive and Diffusion-Based Approach

Oct 08, 2024

Sha Guo, Zhuo Chen, Yang Zhao, Ning Zhang, Xiaotong Li, Lingyu Duan

Figure 1 for Toward Scalable Image Feature Compression: A Content-Adaptive and Diffusion-Based Approach

Figure 2 for Toward Scalable Image Feature Compression: A Content-Adaptive and Diffusion-Based Approach

Figure 3 for Toward Scalable Image Feature Compression: A Content-Adaptive and Diffusion-Based Approach

Figure 4 for Toward Scalable Image Feature Compression: A Content-Adaptive and Diffusion-Based Approach

Abstract:Traditional image codecs emphasize signal fidelity and human perception, often at the expense of machine vision tasks. Deep learning methods have demonstrated promising coding performance by utilizing rich semantic embeddings optimized for both human and machine vision. However, these compact embeddings struggle to capture fine details such as contours and textures, resulting in imperfect reconstructions. Furthermore, existing learning-based codecs lack scalability. To address these limitations, this paper introduces a content-adaptive diffusion model for scalable image compression. The proposed method encodes fine textures through a diffusion process, enhancing perceptual quality while preserving essential features for machine vision tasks. The approach employs a Markov palette diffusion model combined with widely used feature extractors and image generators, enabling efficient data compression. By leveraging collaborative texture-semantic feature extraction and pseudo-label generation, the method accurately captures texture information. A content-adaptive Markov palette diffusion model is then applied to represent both low-level textures and high-level semantic content in a scalable manner. This framework offers flexible control over compression ratios by selecting intermediate diffusion states, eliminating the need for retraining deep learning models at different operating points. Extensive experiments demonstrate the effectiveness of the proposed framework in both image reconstruction and downstream machine vision tasks such as object detection, segmentation, and facial landmark detection, achieving superior perceptual quality compared to state-of-the-art methods.

* in Proceedings of the 31st ACM International Conference on Multimedia, pp. 1431-1442, 2023

Via

Access Paper or Ask Questions

Revealing Directions for Text-guided 3D Face Editing

Oct 07, 2024

Zhuo Chen, Yichao Yan, Sehngqi Liu, Yuhao Cheng, Weiming Zhao, Lincheng Li, Mengxiao Bi, Xiaokang Yang

Figure 1 for Revealing Directions for Text-guided 3D Face Editing

Figure 2 for Revealing Directions for Text-guided 3D Face Editing

Figure 3 for Revealing Directions for Text-guided 3D Face Editing

Figure 4 for Revealing Directions for Text-guided 3D Face Editing

Abstract:3D face editing is a significant task in multimedia, aimed at the manipulation of 3D face models across various control signals. The success of 3D-aware GAN provides expressive 3D models learned from 2D single-view images only, encouraging researchers to discover semantic editing directions in its latent space. However, previous methods face challenges in balancing quality, efficiency, and generalization. To solve the problem, we explore the possibility of introducing the strength of diffusion model into 3D-aware GANs. In this paper, we present Face Clan, a fast and text-general approach for generating and manipulating 3D faces based on arbitrary attribute descriptions. To achieve disentangled editing, we propose to diffuse on the latent space under a pair of opposite prompts to estimate the mask indicating the region of interest on latent codes. Based on the mask, we then apply denoising to the masked latent codes to reveal the editing direction. Our method offers a precisely controllable manipulation method, allowing users to intuitively customize regions of interest with the text description. Experiments demonstrate the effectiveness and generalization of our Face Clan for various pre-trained GANs. It offers an intuitive and wide application for text-guided face editing that contributes to the landscape of multimedia content creation.

Via

Access Paper or Ask Questions

AniSDF: Fused-Granularity Neural Surfaces with Anisotropic Encoding for High-Fidelity 3D Reconstruction

Oct 02, 2024

Jingnan Gao, Zhuo Chen, Yichao Yan, Xiaokang Yang

Abstract:Neural radiance fields have recently revolutionized novel-view synthesis and achieved high-fidelity renderings. However, these methods sacrifice the geometry for the rendering quality, limiting their further applications including relighting and deformation. How to synthesize photo-realistic rendering while reconstructing accurate geometry remains an unsolved problem. In this work, we present AniSDF, a novel approach that learns fused-granularity neural surfaces with physics-based encoding for high-fidelity 3D reconstruction. Different from previous neural surfaces, our fused-granularity geometry structure balances the overall structures and fine geometric details, producing accurate geometry reconstruction. To disambiguate geometry from reflective appearance, we introduce blended radiance fields to model diffuse and specularity following the anisotropic spherical Gaussian encoding, a physics-based rendering pipeline. With these designs, AniSDF can reconstruct objects with complex structures and produce high-quality renderings. Furthermore, our method is a unified model that does not require complex hyperparameter tuning for specific objects. Extensive experiments demonstrate that our method boosts the quality of SDF-based methods by a great scale in both geometry reconstruction and novel-view synthesis.

* Project Page: https://g-1nonly.github.io/AniSDF_Website/

Via

Access Paper or Ask Questions

TransForce: Transferable Force Prediction for Vision-based Tactile Sensors with Sequential Image Translation

Sep 15, 2024

Zhuo Chen, Ni Ou, Xuyang Zhang, Shan Luo

Figure 1 for TransForce: Transferable Force Prediction for Vision-based Tactile Sensors with Sequential Image Translation

Figure 2 for TransForce: Transferable Force Prediction for Vision-based Tactile Sensors with Sequential Image Translation

Figure 3 for TransForce: Transferable Force Prediction for Vision-based Tactile Sensors with Sequential Image Translation

Figure 4 for TransForce: Transferable Force Prediction for Vision-based Tactile Sensors with Sequential Image Translation

Abstract:Vision-based tactile sensors (VBTSs) provide high-resolution tactile images crucial for robot in-hand manipulation. However, force sensing in VBTSs is underutilized due to the costly and time-intensive process of acquiring paired tactile images and force labels. In this study, we introduce a transferable force prediction model, TransForce, designed to leverage collected image-force paired data for new sensors under varying illumination colors and marker patterns while improving the accuracy of predicted forces, especially in the shear direction. Our model effectively achieves translation of tactile images from the source domain to the target domain, ensuring that the generated tactile images reflect the illumination colors and marker patterns of the new sensors while accurately aligning the elastomer deformation observed in existing sensors, which is beneficial to force prediction of new sensors. As such, a recurrent force prediction model trained with generated sequential tactile images and existing force labels is employed to estimate higher-accuracy forces for new sensors with lowest average errors of 0.69N (5.8\% in full work range) in $x$-axis, 0.70N (5.8\%) in $y$-axis, and 1.11N (6.9\%) in $z$-axis compared with models trained with single images. The experimental results also reveal that pure marker modality is more helpful than the RGB modality in improving the accuracy of force in the shear direction, while the RGB modality show better performance in the normal direction.

Via

Access Paper or Ask Questions

Seed-Music: A Unified Framework for High Quality and Controlled Music Generation

Sep 13, 2024

Ye Bai, Haonan Chen, Jitong Chen, Zhuo Chen, Yi Deng, Xiaohong Dong, Lamtharn Hantrakul, Weituo Hao, Qingqing Huang, Zhongyi Huang(+28 more)

Figure 1 for Seed-Music: A Unified Framework for High Quality and Controlled Music Generation

Figure 2 for Seed-Music: A Unified Framework for High Quality and Controlled Music Generation

Figure 3 for Seed-Music: A Unified Framework for High Quality and Controlled Music Generation

Figure 4 for Seed-Music: A Unified Framework for High Quality and Controlled Music Generation

Abstract:We introduce Seed-Music, a suite of music generation systems capable of producing high-quality music with fine-grained style control. Our unified framework leverages both auto-regressive language modeling and diffusion approaches to support two key music creation workflows: \textit{controlled music generation} and \textit{post-production editing}. For controlled music generation, our system enables vocal music generation with performance controls from multi-modal inputs, including style descriptions, audio references, musical scores, and voice prompts. For post-production editing, it offers interactive tools for editing lyrics and vocal melodies directly in the generated audio. We encourage readers to listen to demo audio examples at https://team.doubao.com/seed-music .

* Seed-Music technical report, 20 pages, 5 figures

Via

Access Paper or Ask Questions

Cross-composition Feature Disentanglement for Compositional Zero-shot Learning

Aug 19, 2024

Yuxia Geng, Runkai Zhu, Jiaoyan Chen, Jintai Chen, Zhuo Chen, Xiang Chen, Can Xu, Yuxiang Wang, Xiaoliang Xu

Figure 1 for Cross-composition Feature Disentanglement for Compositional Zero-shot Learning

Figure 2 for Cross-composition Feature Disentanglement for Compositional Zero-shot Learning

Figure 3 for Cross-composition Feature Disentanglement for Compositional Zero-shot Learning

Figure 4 for Cross-composition Feature Disentanglement for Compositional Zero-shot Learning

Abstract:Disentanglement of visual features of primitives (i.e., attributes and objects) has shown exceptional results in Compositional Zero-shot Learning (CZSL). However, due to the feature divergence of an attribute (resp. object) when combined with different objects (resp. attributes), it is challenging to learn disentangled primitive features that are general across different compositions. To this end, we propose the solution of cross-composition feature disentanglement, which takes multiple primitive-sharing compositions as inputs and constrains the disentangled primitive features to be general across these compositions. More specifically, we leverage a compositional graph to define the overall primitive-sharing relationships between compositions, and build a task-specific architecture upon the recently successful large pre-trained vision-language model (VLM) CLIP, with dual cross-composition disentangling adapters (called L-Adapter and V-Adapter) inserted into CLIP's frozen text and image encoders, respectively. Evaluation on three popular CZSL benchmarks shows that our proposed solution significantly improves the performance of CZSL, and its components have been verified by solid ablation studies.

* work in progress

Via

Access Paper or Ask Questions

Marker or Markerless? Mode-Switchable Optical Tactile Sensing for Diverse Robot Tasks

Aug 15, 2024

Ni Ou, Zhuo Chen, Shan Luo

Abstract:Optical tactile sensors play a pivotal role in robot perception and manipulation tasks. The membrane of these sensors can be painted with markers or remain markerless, enabling them to function in either marker or markerless mode. However, this uni-modal selection means the sensor is only suitable for either manipulation or perception tasks. While markers are vital for manipulation, they can also obstruct the camera, thereby impeding perception. The dilemma of selecting between marker and markerless modes presents a significant obstacle. To address this issue, we propose a novel mode-switchable optical tactile sensing approach that facilitates transitions between the two modes. The marker-to-markerless transition is achieved through a generative model, whereas its inverse transition is realized using a sparsely supervised regressive model. Our approach allows a single-mode optical sensor to operate effectively in both marker and markerless modes without the need for additional hardware, making it well-suited for both perception and manipulation tasks. Extensive experiments validate the effectiveness of our method. For perception tasks, our approach decreases the number of categories that include misclassified samples by 2 and improves contact area segmentation IoU by 3.53%. For manipulation tasks, our method attains a high success rate of 92.59% in slip detection. Code, dataset and demo videos are available at the project website: https://gitouni.github.io/Marker-Markerless-Transition/

* 8 pages, 10 figures

Via

Access Paper or Ask Questions

Language Model Can Listen While Speaking

Aug 05, 2024

Ziyang Ma, Yakun Song, Chenpeng Du, Jian Cong, Zhuo Chen, Yuping Wang, Yuxuan Wang, Xie Chen

Figure 1 for Language Model Can Listen While Speaking

Figure 2 for Language Model Can Listen While Speaking

Figure 3 for Language Model Can Listen While Speaking

Figure 4 for Language Model Can Listen While Speaking

Abstract:Dialogue serves as the most natural manner of human-computer interaction (HCI). Recent advancements in speech language models (SLM) have significantly enhanced speech-based conversational AI. However, these models are limited to turn-based conversation, lacking the ability to interact with humans in real-time spoken scenarios, for example, being interrupted when the generated content is not satisfactory. To address these limitations, we explore full duplex modeling (FDM) in interactive speech language models (iSLM), focusing on enhancing real-time interaction and, more explicitly, exploring the quintessential ability of interruption. We introduce a novel model design, namely listening-while-speaking language model (LSLM), an end-to-end system equipped with both listening and speaking channels. Our LSLM employs a token-based decoder-only TTS for speech generation and a streaming self-supervised learning (SSL) encoder for real-time audio input. LSLM fuses both channels for autoregressive generation and detects turn-taking in real time. Three fusion strategies -- early fusion, middle fusion, and late fusion -- are explored, with middle fusion achieving an optimal balance between speech generation and real-time interaction. Two experimental settings, command-based FDM and voice-based FDM, demonstrate LSLM's robustness to noise and sensitivity to diverse instructions. Our results highlight LSLM's capability to achieve duplex communication with minimal impact on existing systems. This study aims to advance the development of interactive speech dialogue systems, enhancing their applicability in real-world contexts.

* Demo can be found at https://ddlbojack.github.io/LSLM

Via

Access Paper or Ask Questions

An Asynchronous Multi-core Accelerator for SNN inference

Jul 30, 2024

Zhuo Chen, De Ma, Xiaofei Jin, Qinghui Xing, Ouwen Jin, Xin Du, Shuibing He, Gang Pan

Figure 1 for An Asynchronous Multi-core Accelerator for SNN inference

Figure 2 for An Asynchronous Multi-core Accelerator for SNN inference

Figure 3 for An Asynchronous Multi-core Accelerator for SNN inference

Figure 4 for An Asynchronous Multi-core Accelerator for SNN inference

Abstract:Spiking Neural Networks (SNNs) are extensively utilized in brain-inspired computing and neuroscience research. To enhance the speed and energy efficiency of SNNs, several many-core accelerators have been developed. However, maintaining the accuracy of SNNs often necessitates frequent explicit synchronization among all cores, which presents a challenge to overall efficiency. In this paper, we propose an asynchronous architecture for Spiking Neural Networks (SNNs) that eliminates the need for inter-core synchronization, thus enhancing speed and energy efficiency. This approach leverages the pre-determined dependencies of neuromorphic cores established during compilation. Each core is equipped with a scheduler that monitors the status of its dependencies, allowing it to safely advance to the next timestep without waiting for other cores. This eliminates the necessity for global synchronization and minimizes core waiting time despite inherent workload imbalances. Comprehensive evaluations using five different SNN workloads show that our architecture achieves a 1.86x speedup and a 1.55x increase in energy efficiency compared to state-of-the-art synchronization architectures.

Via

Access Paper or Ask Questions