and Other Contributors




Abstract:Human-object interactions (HOIs) are crucial for human-centric scene understanding applications such as human-centric visual generation, AR/VR, and robotics. Since existing methods mainly explore capturing HOIs, rendering HOI remains less investigated. In this paper, we address this challenge in HOI animation from a compositional perspective, i.e., animating novel HOIs including novel interaction, novel human and/or novel object driven by a novel pose sequence. Specifically, we adopt neural human-object deformation to model and render HOI dynamics based on implicit neural representations. To enable the interaction pose transferring among different persons and objects, we then devise a new compositional conditional neural radiance field (or CC-NeRF), which decomposes the interdependence between human and object using latent codes to enable compositionally animation control of novel HOIs. Experiments show that the proposed method can generalize well to various novel HOI animation settings. Our project page is https://zhihou7.github.io/CHONA/
Abstract:Recent neural architecture search (NAS) based approaches have made great progress in hyperspectral image (HSI) classification tasks. However, the architectures are usually optimized independently of the network weights, increasing searching time and restricting model performances. To tackle these issues, in this paper, different from previous methods that extra define structural parameters, we propose to directly generate structural parameters by utilizing the specifically designed hyper kernels, ingeniously converting the original complex dual optimization problem into easily implemented one-tier optimizations, and greatly shrinking searching costs. Then, we develop a hierarchical multi-module search space whose candidate operations only contain convolutions, and these operations can be integrated into unified kernels. Using the above searching strategy and searching space, we obtain three kinds of networks to separately conduct pixel-level or image-level classifications with 1-D or 3-D convolutions. In addition, by combining the proposed hyper kernel searching scheme with the 3-D convolution decomposition mechanism, we obtain diverse architectures to simulate 3-D convolutions, greatly improving network flexibilities. A series of quantitative and qualitative experiments on six public datasets demonstrate that the proposed methods achieve state-of-the-art results compared with other advanced NAS-based HSI classification approaches.


Abstract:The recent work known as Segment Anything (SA) has made significant strides in pushing the boundaries of semantic segmentation into the era of foundation models. The impact of SA has sparked extremely active discussions and ushered in an encouraging new wave of developing foundation models for the diverse tasks in the Euclidean domain, such as object detection and image inpainting. Despite the promising advances led by SA, the concept has yet to be extended to the non-Euclidean graph domain. In this paper, we explore a novel Segment Non-Euclidean Anything (SNA) paradigm that strives to develop foundation models that can handle the diverse range of graph data within the non-Euclidean domain, seeking to expand the scope of SA and lay the groundwork for future research in this direction. To achieve this goal, we begin by discussing the recent achievements in foundation models associated with SA. We then shed light on the unique challenges that arise when applying the SA concept to graph analysis, which involves understanding the differences between the Euclidean and non-Euclidean domains from both the data and task perspectives. Motivated by these observations, we present several preliminary solutions to tackle the challenges of SNA and detail their corresponding limitations, along with several potential directions to pave the way for future SNA research. Experiments on five Open Graph Benchmark (OGB) datasets across various tasks, including graph property classification and regression, as well as multi-label prediction, demonstrate that the performance of the naive SNA solutions has considerable room for improvement, pointing towards a promising avenue for future exploration of Graph General Intelligence.
Abstract:Hyperspectral image (HSI) classification is challenging due to spatial variability caused by complex imaging conditions. Prior methods suffer from limited representation ability, as they train specially designed networks from scratch on limited annotated data. We propose a tri-spectral image generation pipeline that transforms HSI into high-quality tri-spectral images, enabling the use of off-the-shelf ImageNet pretrained backbone networks for feature extraction. Motivated by the observation that there are many homogeneous areas with distinguished semantic and geometric properties in HSIs, which can be used to extract useful contexts, we propose an end-to-end segmentation network named DCN-T. It adopts transformers to effectively encode regional adaptation and global aggregation spatial contexts within and between the homogeneous areas discovered by similarity-based clustering. To fully exploit the rich spectrums of the HSI, we adopt an ensemble approach where all segmentation results of the tri-spectral images are integrated into the final prediction through a voting scheme. Extensive experiments on three public benchmarks show that our proposed method outperforms state-of-the-art methods for HSI classification.
Abstract:In recent decades, visual simultaneous localization and mapping (vSLAM) has gained significant interest in both academia and industry. It estimates camera motion and reconstructs the environment concurrently using visual sensors on a moving robot. However, conventional cameras are limited by hardware, including motion blur and low dynamic range, which can negatively impact performance in challenging scenarios like high-speed motion and high dynamic range illumination. Recent studies have demonstrated that event cameras, a new type of bio-inspired visual sensor, offer advantages such as high temporal resolution, dynamic range, low power consumption, and low latency. This paper presents a timely and comprehensive review of event-based vSLAM algorithms that exploit the benefits of asynchronous and irregular event streams for localization and mapping tasks. The review covers the working principle of event cameras and various event representations for preprocessing event data. It also categorizes event-based vSLAM methods into four main categories: feature-based, direct, motion-compensation, and deep learning methods, with detailed discussions and practical guidance for each approach. Furthermore, the paper evaluates the state-of-the-art methods on various benchmarks, highlighting current challenges and future opportunities in this emerging research area. A public repository will be maintained to keep track of the rapid developments in this field at {\url{https://github.com/kun150kun/ESLAM-survey}}.




Abstract:Multi-view camera-based 3D object detection has gained popularity due to its low cost. But accurately inferring 3D geometry solely from camera data remains challenging, which impacts model performance. One promising approach to address this issue is to distill precise 3D geometry knowledge from LiDAR data. However, transferring knowledge between different sensor modalities is hindered by the significant modality gap. In this paper, we approach this challenge from the perspective of both architecture design and knowledge distillation and present a new simulated multi-modal 3D object detection method named BEVSimDet. We first introduce a novel framework that includes a LiDAR and camera fusion-based teacher and a simulated multi-modal student, where the student simulates multi-modal features with image-only input. To facilitate effective distillation, we propose a simulated multi-modal distillation scheme that supports intra-modal, cross-modal, and multi-modal distillation simultaneously, in Bird's-eye-view (BEV) space. By combining them together, BEVSimDet can learn better feature representations for 3D object detection while enjoying cost-effective camera-only deployment. Experimental results on the challenging nuScenes benchmark demonstrate the effectiveness and superiority of BEVSimDet over recent representative methods. The source code will be released at \href{https://github.com/ViTAE-Transformer/BEVSimDet}{BEVSimDet}.
Abstract:Neural radiance field (NeRF) has become a popular 3D representation method for human avatar reconstruction due to its high-quality rendering capabilities, e.g., regarding novel views and poses. However, previous methods for editing the geometry and appearance of the avatar only allow for global editing through body shape parameters and 2D texture maps. In this paper, we propose a new approach named \textbf{U}nified \textbf{V}olumetric \textbf{A}vatar (\textbf{UVA}) that enables local and independent editing of both geometry and texture, while retaining the ability to render novel views and poses. UVA transforms each observation point to a canonical space using a skinning motion field and represents geometry and texture in separate neural fields. Each field is composed of a set of structured latent codes that are attached to anchor nodes on a deformable mesh in canonical space and diffused into the entire space via interpolation, allowing for local editing. To address spatial ambiguity in code interpolation, we use a local signed height indicator. We also replace the view-dependent radiance color with a pose-dependent shading factor to better represent surface illumination in different poses. Experiments on multiple human avatars demonstrate that our UVA achieves competitive results in novel view synthesis and novel pose rendering while enabling local and independent editing of geometry and appearance. The source code will be released.
Abstract:Image matting refers to extracting precise alpha matte from natural images, and it plays a critical role in various downstream applications, such as image editing. Despite being an ill-posed problem, traditional methods have been trying to solve it for decades. The emergence of deep learning has revolutionized the field of image matting and given birth to multiple new techniques, including automatic, interactive, and referring image matting. This paper presents a comprehensive review of recent advancements in image matting in the era of deep learning. We focus on two fundamental sub-tasks: auxiliary input-based image matting, which involves user-defined input to predict the alpha matte, and automatic image matting, which generates results without any manual intervention. We systematically review the existing methods for these two tasks according to their task settings and network structures and provide a summary of their advantages and disadvantages. Furthermore, we introduce the commonly used image matting datasets and evaluate the performance of representative matting methods both quantitatively and qualitatively. Finally, we discuss relevant applications of image matting and highlight existing challenges and potential opportunities for future research. We also maintain a public repository to track the rapid development of deep image matting at https://github.com/JizhiziLi/matting-survey.
Abstract:The field of deep learning has witnessed significant progress, particularly in computer vision (CV), natural language processing (NLP), and speech. The use of large-scale models trained on vast amounts of data holds immense promise for practical applications, enhancing industrial productivity and facilitating social development. With the increasing demands on computational capacity, though numerous studies have explored the efficient training, a comprehensive summarization on acceleration techniques of training deep learning models is still much anticipated. In this survey, we present a detailed review for training acceleration. We consider the fundamental update formulation and split its basic components into five main perspectives: (1) data-centric: including dataset regularization, data sampling, and data-centric curriculum learning techniques, which can significantly reduce the computational complexity of the data samples; (2) model-centric, including acceleration of basic modules, compression training, model initialization and model-centric curriculum learning techniques, which focus on accelerating the training via reducing the calculations on parameters; (3) optimization-centric, including the selection of learning rate, the employment of large batchsize, the designs of efficient objectives, and model average techniques, which pay attention to the training policy and improving the generality for the large-scale models; (4) budgeted training, including some distinctive acceleration methods on source-constrained situations; (5) system-centric, including some efficient open-source distributed libraries/systems which provide adequate hardware support for the implementation of acceleration algorithms. By presenting this comprehensive taxonomy, our survey presents a comprehensive review to understand the general mechanisms within each component and their joint interaction.
Abstract:Despite remarkable successes in solving various complex decision-making tasks, training an imitation learning (IL) algorithm with deep neural networks (DNNs) suffers from the high computation burden. In this work, we propose quantum imitation learning (QIL) with a hope to utilize quantum advantage to speed up IL. Concretely, we develop two QIL algorithms, quantum behavioural cloning (Q-BC) and quantum generative adversarial imitation learning (Q-GAIL). Q-BC is trained with a negative log-likelihood loss in an off-line manner that suits extensive expert data cases, whereas Q-GAIL works in an inverse reinforcement learning scheme, which is on-line and on-policy that is suitable for limited expert data cases. For both QIL algorithms, we adopt variational quantum circuits (VQCs) in place of DNNs for representing policies, which are modified with data re-uploading and scaling parameters to enhance the expressivity. We first encode classical data into quantum states as inputs, then perform VQCs, and finally measure quantum outputs to obtain control signals of agents. Experiment results demonstrate that both Q-BC and Q-GAIL can achieve comparable performance compared to classical counterparts, with the potential of quantum speed-up. To our knowledge, we are the first to propose the concept of QIL and conduct pilot studies, which paves the way for the quantum era.