Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yong Xu

3D Shape Completion on Unseen Categories:A Weakly-supervised Approach

Jan 19, 2024

Lintai Wu, Junhui Hou, Linqi Song, Yong Xu

Figure 1 for 3D Shape Completion on Unseen Categories:A Weakly-supervised Approach

Figure 2 for 3D Shape Completion on Unseen Categories:A Weakly-supervised Approach

Figure 3 for 3D Shape Completion on Unseen Categories:A Weakly-supervised Approach

Figure 4 for 3D Shape Completion on Unseen Categories:A Weakly-supervised Approach

Abstract:3D shapes captured by scanning devices are often incomplete due to occlusion. 3D shape completion methods have been explored to tackle this limitation. However, most of these methods are only trained and tested on a subset of categories, resulting in poor generalization to unseen categories. In this paper, we introduce a novel weakly-supervised framework to reconstruct the complete shapes from unseen categories. We first propose an end-to-end prior-assisted shape learning network that leverages data from the seen categories to infer a coarse shape. Specifically, we construct a prior bank consisting of representative shapes from the seen categories. Then, we design a multi-scale pattern correlation module for learning the complete shape of the input by analyzing the correlation between local patterns within the input and the priors at various scales. In addition, we propose a self-supervised shape refinement model to further refine the coarse shape. Considering the shape variability of 3D objects across categories, we construct a category-specific prior bank to facilitate shape refinement. Then, we devise a voxel-based partial matching loss and leverage the partial scans to drive the refinement process. Extensive experimental results show that our approach is superior to state-of-the-art methods by a large margin.

* 13 pages,8 figures

Via

Access Paper or Ask Questions

TaskWeaver: A Code-First Agent Framework

Dec 01, 2023

Bo Qiao, Liqun Li, Xu Zhang, Shilin He, Yu Kang, Chaoyun Zhang, Fangkai Yang, Hang Dong, Jue Zhang, Lu Wang(+9 more)

Figure 1 for TaskWeaver: A Code-First Agent Framework

Figure 2 for TaskWeaver: A Code-First Agent Framework

Figure 3 for TaskWeaver: A Code-First Agent Framework

Figure 4 for TaskWeaver: A Code-First Agent Framework

Abstract:Large Language Models (LLMs) have shown impressive abilities in natural language understanding and generation, leading to their use in applications such as chatbots and virtual assistants. However, existing LLM frameworks face limitations in handling domain-specific data analytics tasks with rich data structures. Moreover, they struggle with flexibility to meet diverse user requirements. To address these issues, TaskWeaver is proposed as a code-first framework for building LLM-powered autonomous agents. It converts user requests into executable code and treats user-defined plugins as callable functions. TaskWeaver provides support for rich data structures, flexible plugin usage, and dynamic plugin selection, and leverages LLM coding capabilities for complex logic. It also incorporates domain-specific knowledge through examples and ensures the secure execution of generated code. TaskWeaver offers a powerful and flexible framework for creating intelligent conversational agents that can handle complex tasks and adapt to domain-specific scenarios. The code is open-sourced at https://github.com/microsoft/TaskWeaver/.

Via

Access Paper or Ask Questions

Everything of Thoughts: Defying the Law of Penrose Triangle for Thought Generation

Nov 12, 2023

Ruomeng Ding, Chaoyun Zhang, Lu Wang, Yong Xu, Minghua Ma, Wei Zhang, Si Qin, Saravan Rajmohan, Qingwei Lin, Dongmei Zhang

Figure 1 for Everything of Thoughts: Defying the Law of Penrose Triangle for Thought Generation

Figure 2 for Everything of Thoughts: Defying the Law of Penrose Triangle for Thought Generation

Figure 3 for Everything of Thoughts: Defying the Law of Penrose Triangle for Thought Generation

Figure 4 for Everything of Thoughts: Defying the Law of Penrose Triangle for Thought Generation

Abstract:Recent advancements in Large Language Models (LLMs) have revolutionized decision-making by breaking down complex problems into more manageable language sequences referred to as ``thoughts''. An effective thought design should consider three key perspectives: performance, efficiency, and flexibility. However, existing thought can at most exhibit two of these attributes. To address these limitations, we introduce a novel thought prompting approach called ``Everything of Thoughts'' (XoT) to defy the law of ``Penrose triangle of existing thought paradigms. XoT leverages pretrained reinforcement learning and Monte Carlo Tree Search (MCTS) to incorporate external domain knowledge into thoughts, thereby enhancing LLMs' capabilities and enabling them to generalize to unseen problems efficiently. Through the utilization of the MCTS-LLM collaborative thought revision framework, this approach autonomously produces high-quality comprehensive cognitive mappings with minimal LLM interactions. Additionally, XoT empowers LLMs to engage in unconstrained thinking, allowing for flexible cognitive mappings for problems with multiple solutions. We evaluate XoT on several challenging multi-solution problem-solving tasks, including Game of 24, 8-Puzzle, and Pocket Cube. Our results demonstrate that XoT significantly outperforms existing approaches. Notably, XoT can yield multiple solutions with just one LLM call, showcasing its remarkable proficiency in addressing complex problems across diverse domains.

* 17 pages, 5 figures

Via

Access Paper or Ask Questions

GPT-ST: Generative Pre-Training of Spatio-Temporal Graph Neural Networks

Nov 07, 2023

Zhonghang Li, Lianghao Xia, Yong Xu, Chao Huang

Figure 1 for GPT-ST: Generative Pre-Training of Spatio-Temporal Graph Neural Networks

Figure 2 for GPT-ST: Generative Pre-Training of Spatio-Temporal Graph Neural Networks

Figure 3 for GPT-ST: Generative Pre-Training of Spatio-Temporal Graph Neural Networks

Figure 4 for GPT-ST: Generative Pre-Training of Spatio-Temporal Graph Neural Networks

Abstract:In recent years, there has been a rapid development of spatio-temporal prediction techniques in response to the increasing demands of traffic management and travel planning. While advanced end-to-end models have achieved notable success in improving predictive performance, their integration and expansion pose significant challenges. This work aims to address these challenges by introducing a spatio-temporal pre-training framework that seamlessly integrates with downstream baselines and enhances their performance. The framework is built upon two key designs: (i) We propose a spatio-temporal mask autoencoder as a pre-training model for learning spatio-temporal dependencies. The model incorporates customized parameter learners and hierarchical spatial pattern encoding networks. These modules are specifically designed to capture spatio-temporal customized representations and intra- and inter-cluster region semantic relationships, which have often been neglected in existing approaches. (ii) We introduce an adaptive mask strategy as part of the pre-training mechanism. This strategy guides the mask autoencoder in learning robust spatio-temporal representations and facilitates the modeling of different relationships, ranging from intra-cluster to inter-cluster, in an easy-to-hard training manner. Extensive experiments conducted on representative benchmarks demonstrate the effectiveness of our proposed method. We have made our model implementation publicly available at https://github.com/HKUDS/GPT-ST.

* Thirty-seventh Conference on Neural Information Processing Systems(NeurIPS 2023)
* This paper has been accepted by NeurIPS 2023

Via

Access Paper or Ask Questions

Audio-Visual Speaker Tracking: Progress, Challenges, and Future Directions

Oct 23, 2023

Jinzheng Zhao, Yong Xu, Xinyuan Qian, Davide Berghi, Peipei Wu, Meng Cui, Jianyuan Sun, Philip J. B. Jackson, Wenwu Wang

Figure 1 for Audio-Visual Speaker Tracking: Progress, Challenges, and Future Directions

Figure 2 for Audio-Visual Speaker Tracking: Progress, Challenges, and Future Directions

Figure 3 for Audio-Visual Speaker Tracking: Progress, Challenges, and Future Directions

Figure 4 for Audio-Visual Speaker Tracking: Progress, Challenges, and Future Directions

Abstract:Audio-visual speaker tracking has drawn increasing attention over the past few years due to its academic values and wide application. Audio and visual modalities can provide complementary information for localization and tracking. With audio and visual information, the Bayesian-based filter can solve the problem of data association, audio-visual fusion and track management. In this paper, we conduct a comprehensive overview of audio-visual speaker tracking. To our knowledge, this is the first extensive survey over the past five years. We introduce the family of Bayesian filters and summarize the methods for obtaining audio-visual measurements. In addition, the existing trackers and their performance on AV16.3 dataset are summarized. In the past few years, deep learning techniques have thrived, which also boosts the development of audio visual speaker tracking. The influence of deep learning techniques in terms of measurement extraction and state estimation is also discussed. At last, we discuss the connections between audio-visual speaker tracking and other areas such as speech separation and distributed speaker tracking.

Via

Access Paper or Ask Questions

uSee: Unified Speech Enhancement and Editing with Conditional Diffusion Models

Oct 02, 2023

Muqiao Yang, Chunlei Zhang, Yong Xu, Zhongweiyang Xu, Heming Wang, Bhiksha Raj, Dong Yu

Abstract:Speech enhancement aims to improve the quality of speech signals in terms of quality and intelligibility, and speech editing refers to the process of editing the speech according to specific user needs. In this paper, we propose a Unified Speech Enhancement and Editing (uSee) model with conditional diffusion models to handle various tasks at the same time in a generative manner. Specifically, by providing multiple types of conditions including self-supervised learning embeddings and proper text prompts to the score-based diffusion model, we can enable controllable generation of the unified speech enhancement and editing model to perform corresponding actions on the source speech. Our experiments show that our proposed uSee model can achieve superior performance in both speech denoising and dereverberation compared to other related generative speech enhancement models, and can perform speech editing given desired environmental sound text description, signal-to-noise ratios (SNR), and room impulse responses (RIR). Demos of the generated speech are available at https://muqiaoy.github.io/usee.

Via

Access Paper or Ask Questions

Audio Visual Speaker Localization from EgoCentric Views

Sep 28, 2023

Jinzheng Zhao, Yong Xu, Xinyuan Qian, Wenwu Wang

Figure 1 for Audio Visual Speaker Localization from EgoCentric Views

Figure 2 for Audio Visual Speaker Localization from EgoCentric Views

Figure 3 for Audio Visual Speaker Localization from EgoCentric Views

Figure 4 for Audio Visual Speaker Localization from EgoCentric Views

Abstract:The use of audio and visual modality for speaker localization has been well studied in the literature by exploiting their complementary characteristics. However, most previous works employ the setting of static sensors mounted at fixed positions. Unlike them, in this work, we explore the ego-centric setting, where the heterogeneous sensors are embodied and could be moving with a human to facilitate speaker localization. Compared to the static scenario, the ego-centric setting is more realistic for smart-home applications e.g., a service robot. However, this also brings new challenges such as blurred images, frequent speaker disappearance from the field of view of the wearer, and occlusions. In this paper, we study egocentric audio-visual speaker DOA estimation and deal with the challenges mentioned above. Specifically, we propose a transformer-based audio-visual fusion method to estimate the relative DOA of the speaker to the wearer, and design a training strategy to mitigate the problem of the speaker disappearing from the camera's view. We also develop a new dataset for simulating the out-of-view scenarios, by creating a scene with a camera wearer walking around while a speaker is moving at the same time. The experimental results show that our proposed method offers promising performance in this new dataset in terms of tracking accuracy. Finally, we adapt the proposed method for the multi-speaker scenario. Experiments on EasyCom show the effectiveness of the proposed model for multiple speakers in real scenarios, which achieves state-of-the-art results in the sphere active speaker detection task and the wearer activity prediction task. The simulated dataset and related code are available at https://github.com/KawhiZhao/Egocentric-Audio-Visual-Speaker-Localization.

Via

Access Paper or Ask Questions

SpatialCodec: Neural Spatial Speech Coding

Sep 14, 2023

Zhongweiyang Xu, Yong Xu, Vinay Kothapally, Heming Wang, Muqiao Yang, Dong Yu

Figure 1 for SpatialCodec: Neural Spatial Speech Coding

Figure 2 for SpatialCodec: Neural Spatial Speech Coding

Figure 3 for SpatialCodec: Neural Spatial Speech Coding

Abstract:In this work, we address the challenge of encoding speech captured by a microphone array using deep learning techniques with the aim of preserving and accurately reconstructing crucial spatial cues embedded in multi-channel recordings. We propose a neural spatial audio coding framework that achieves a high compression ratio, leveraging single-channel neural sub-band codec and SpatialCodec. Our approach encompasses two phases: (i) a neural sub-band codec is designed to encode the reference channel with low bit rates, and (ii), a SpatialCodec captures relative spatial information for accurate multi-channel reconstruction at the decoder end. In addition, we also propose novel evaluation metrics to assess the spatial cue preservation: (i) spatial similarity, which calculates cosine similarity on a spatially intuitive beamspace, and (ii), beamformed audio quality. Our system shows superior spatial performance compared with high bitrate baselines and black-box neural architecture. Demos are available at https://xzwy.github.io/SpatialCodecDemo. Codes and models are available at https://github.com/XZWY/SpatialCodec.

* Paper in Submission

Via

Access Paper or Ask Questions

Cross-Modal Vertical Federated Learning for MRI Reconstruction

Jun 05, 2023

Yunlu Yan, Hong Wang, Yawen Huang, Nanjun He, Lei Zhu, Yuexiang Li, Yong Xu, Yefeng Zheng

Abstract:Federated learning enables multiple hospitals to cooperatively learn a shared model without privacy disclosure. Existing methods often take a common assumption that the data from different hospitals have the same modalities. However, such a setting is difficult to fully satisfy in practical applications, since the imaging guidelines may be different between hospitals, which makes the number of individuals with the same set of modalities limited. To this end, we formulate this practical-yet-challenging cross-modal vertical federated learning task, in which shape data from multiple hospitals have different modalities with a small amount of multi-modality data collected from the same individuals. To tackle such a situation, we develop a novel framework, namely Federated Consistent Regularization constrained Feature Disentanglement (Fed-CRFD), for boosting MRI reconstruction by effectively exploring the overlapping samples (individuals with multi-modalities) and solving the domain shift problem caused by different modalities. Particularly, our Fed-CRFD involves an intra-client feature disentangle scheme to decouple data into modality-invariant and modality-specific features, where the modality-invariant features are leveraged to mitigate the domain shift problem. In addition, a cross-client latent representation consistency constraint is proposed specifically for the overlapping samples to further align the modality-invariant features extracted from different modalities. Hence, our method can fully exploit the multi-source data from hospitals while alleviating the domain shift problem. Extensive experiments on two typical MRI datasets demonstrate that our network clearly outperforms state-of-the-art MRI reconstruction methods. The source code will be publicly released upon the publication of this work.

* 12 pages, 7 figures

Via

Access Paper or Ask Questions

Graph Transformer for Recommendation

Jun 04, 2023

Chaoliu Li, Lianghao Xia, Xubin Ren, Yaowen Ye, Yong Xu, Chao Huang

Figure 1 for Graph Transformer for Recommendation

Figure 2 for Graph Transformer for Recommendation

Figure 3 for Graph Transformer for Recommendation

Figure 4 for Graph Transformer for Recommendation

Abstract:This paper presents a novel approach to representation learning in recommender systems by integrating generative self-supervised learning with graph transformer architecture. We highlight the importance of high-quality data augmentation with relevant self-supervised pretext tasks for improving performance. Towards this end, we propose a new approach that automates the self-supervision augmentation process through a rationale-aware generative SSL that distills informative user-item interaction patterns. The proposed recommender with Graph TransFormer (GFormer) that offers parameterized collaborative rationale discovery for selective augmentation while preserving global-aware user-item relationships. In GFormer, we allow the rationale-aware SSL to inspire graph collaborative filtering with task-adaptive invariant rationalization in graph transformer. The experimental results reveal that our GFormer has the capability to consistently improve the performance over baselines on different datasets. Several in-depth experiments further investigate the invariant rationale-aware augmentation from various aspects. The source code for this work is publicly available at: https://github.com/HKUDS/GFormer.

* Accepted by SIGIR'2023

Via

Access Paper or Ask Questions