Alert button
Picture for Xiaolong Li

Xiaolong Li

Alert button

IntentDial: An Intent Graph based Multi-Turn Dialogue System with Reasoning Path Visualization

Oct 18, 2023
Zengguang Hao, Jie Zhang, Binxia Xu, Yafang Wang, Gerard de Melo, Xiaolong Li

Intent detection and identification from multi-turn dialogue has become a widely explored technique in conversational agents, for example, voice assistants and intelligent customer services. The conventional approaches typically cast the intent mining process as a classification task. Although neural classifiers have proven adept at such classification tasks, the issue of neural network models often impedes their practical deployment in real-world settings. We present a novel graph-based multi-turn dialogue system called , which identifies a user's intent by identifying intent elements and a standard query from a dynamically constructed and extensible intent graph using reinforcement learning. In addition, we provide visualization components to monitor the immediate reasoning path for each turn of a dialogue, which greatly facilitates further improvement of the system.

* 4pages, 5 figures 
Viaarxiv icon

Group Pose: A Simple Baseline for End-to-End Multi-person Pose Estimation

Aug 14, 2023
Huan Liu, Qiang Chen, Zichang Tan, Jiang-Jiang Liu, Jian Wang, Xiangbo Su, Xiaolong Li, Kun Yao, Junyu Han, Errui Ding, Yao Zhao, Jingdong Wang

Figure 1 for Group Pose: A Simple Baseline for End-to-End Multi-person Pose Estimation
Figure 2 for Group Pose: A Simple Baseline for End-to-End Multi-person Pose Estimation
Figure 3 for Group Pose: A Simple Baseline for End-to-End Multi-person Pose Estimation
Figure 4 for Group Pose: A Simple Baseline for End-to-End Multi-person Pose Estimation

In this paper, we study the problem of end-to-end multi-person pose estimation. State-of-the-art solutions adopt the DETR-like framework, and mainly develop the complex decoder, e.g., regarding pose estimation as keypoint box detection and combining with human detection in ED-Pose, hierarchically predicting with pose decoder and joint (keypoint) decoder in PETR. We present a simple yet effective transformer approach, named Group Pose. We simply regard $K$-keypoint pose estimation as predicting a set of $N\times K$ keypoint positions, each from a keypoint query, as well as representing each pose with an instance query for scoring $N$ pose predictions. Motivated by the intuition that the interaction, among across-instance queries of different types, is not directly helpful, we make a simple modification to decoder self-attention. We replace single self-attention over all the $N\times(K+1)$ queries with two subsequent group self-attentions: (i) $N$ within-instance self-attention, with each over $K$ keypoint queries and one instance query, and (ii) $(K+1)$ same-type across-instance self-attention, each over $N$ queries of the same type. The resulting decoder removes the interaction among across-instance type-different queries, easing the optimization and thus improving the performance. Experimental results on MS COCO and CrowdPose show that our approach without human box supervision is superior to previous methods with complex decoders, and even is slightly better than ED-Pose that uses human box supervision. $\href{https://github.com/Michel-liu/GroupPose-Paddle}{\rm Paddle}$ and $\href{https://github.com/Michel-liu/GroupPose}{\rm PyTorch}$ code are available.

* Accepted by ICCV 2023 
Viaarxiv icon

Towards Visual Foundational Models of Physical Scenes

Jun 06, 2023
Chethan Parameshwara, Alessandro Achille, Matthew Trager, Xiaolong Li, Jiawei Mo, Matthew Trager, Ashwin Swaminathan, CJ Taylor, Dheera Venkatraman, Xiaohan Fei, Stefano Soatto

Figure 1 for Towards Visual Foundational Models of Physical Scenes
Figure 2 for Towards Visual Foundational Models of Physical Scenes
Figure 3 for Towards Visual Foundational Models of Physical Scenes
Figure 4 for Towards Visual Foundational Models of Physical Scenes

We describe a first step towards learning general-purpose visual representations of physical scenes using only image prediction as a training criterion. To do so, we first define "physical scene" and show that, even though different agents may maintain different representations of the same scene, the underlying physical scene that can be inferred is unique. Then, we show that NeRFs cannot represent the physical scene, as they lack extrapolation mechanisms. Those, however, could be provided by Diffusion Models, at least in theory. To test this hypothesis empirically, NeRFs can be combined with Diffusion Models, a process we refer to as NeRF Diffusion, used as unsupervised representations of the physical scene. Our analysis is limited to visual data, without external grounding mechanisms that can be provided by independent sensory modalities.

* TLDR: Physical scenes are equivalence classes of sufficient statistics, and can be inferred uniquely by any agent measuring the same finite data; We formalize and implement an approach to representation learning that overturns "naive realism" in favor of an analytical approach of Russell and Koenderink. NeRFs cannot capture the physical scenes, but combined with Diffusion Models they can 
Viaarxiv icon

Number-Adaptive Prototype Learning for 3D Point Cloud Semantic Segmentation

Oct 18, 2022
Yangheng Zhao, Jun Wang, Xiaolong Li, Yue Hu, Ce Zhang, Yanfeng Wang, Siheng Chen

Figure 1 for Number-Adaptive Prototype Learning for 3D Point Cloud Semantic Segmentation
Figure 2 for Number-Adaptive Prototype Learning for 3D Point Cloud Semantic Segmentation
Figure 3 for Number-Adaptive Prototype Learning for 3D Point Cloud Semantic Segmentation
Figure 4 for Number-Adaptive Prototype Learning for 3D Point Cloud Semantic Segmentation

3D point cloud semantic segmentation is one of the fundamental tasks for 3D scene understanding and has been widely used in the metaverse applications. Many recent 3D semantic segmentation methods learn a single prototype (classifier weights) for each semantic class, and classify 3D points according to their nearest prototype. However, learning only one prototype for each class limits the model's ability to describe the high variance patterns within a class. Instead of learning a single prototype for each class, in this paper, we propose to use an adaptive number of prototypes to dynamically describe the different point patterns within a semantic class. With the powerful capability of vision transformer, we design a Number-Adaptive Prototype Learning (NAPL) model for point cloud semantic segmentation. To train our NAPL model, we propose a simple yet effective prototype dropout training strategy, which enables our model to adaptively produce prototypes for each class. The experimental results on SemanticKITTI dataset demonstrate that our method achieves 2.3% mIoU improvement over the baseline model based on the point-wise classification paradigm.

Viaarxiv icon

Tracking and Reconstructing Hand Object Interactions from Point Cloud Sequences in the Wild

Sep 24, 2022
Jiayi Chen, Mi Yan, Jiazhao Zhang, Yinzhen Xu, Xiaolong Li, Yijia Weng, Li Yi, Shuran Song, He Wang

Figure 1 for Tracking and Reconstructing Hand Object Interactions from Point Cloud Sequences in the Wild
Figure 2 for Tracking and Reconstructing Hand Object Interactions from Point Cloud Sequences in the Wild
Figure 3 for Tracking and Reconstructing Hand Object Interactions from Point Cloud Sequences in the Wild
Figure 4 for Tracking and Reconstructing Hand Object Interactions from Point Cloud Sequences in the Wild

In this work, we tackle the challenging task of jointly tracking hand object pose and reconstructing their shapes from depth point cloud sequences in the wild, given the initial poses at frame 0. We for the first time propose a point cloud based hand joint tracking network, HandTrackNet, to estimate the inter-frame hand joint motion. Our HandTrackNet proposes a novel hand pose canonicalization module to ease the tracking task, yielding accurate and robust hand joint tracking. Our pipeline then reconstructs the full hand via converting the predicted hand joints into a template-based parametric hand model MANO. For object tracking, we devise a simple yet effective module that estimates the object SDF from the first frame and performs optimization-based tracking. Finally, a joint optimization step is adopted to perform joint hand and object reasoning, which alleviates the occlusion-induced ambiguity and further refines the hand pose. During training, the whole pipeline only sees purely synthetic data, which are synthesized with sufficient variations and by depth simulation for the ease of generalization. The whole pipeline is pertinent to the generalization gaps and thus directly transferable to real in-the-wild data. We evaluate our method on two real hand object interaction datasets, e.g. HO3D and DexYCB, without any finetuning. Our experiments demonstrate that the proposed method significantly outperforms the previous state-of-the-art depth-based hand and object pose estimation and tracking methods, running at a frame rate of 9 FPS.

Viaarxiv icon

Leveraging SE(3) Equivariance for Self-Supervised Category-Level Object Pose Estimation

Oct 30, 2021
Xiaolong Li, Yijia Weng, Li Yi, Leonidas Guibas, A. Lynn Abbott, Shuran Song, He Wang

Figure 1 for Leveraging SE(3) Equivariance for Self-Supervised Category-Level Object Pose Estimation
Figure 2 for Leveraging SE(3) Equivariance for Self-Supervised Category-Level Object Pose Estimation
Figure 3 for Leveraging SE(3) Equivariance for Self-Supervised Category-Level Object Pose Estimation
Figure 4 for Leveraging SE(3) Equivariance for Self-Supervised Category-Level Object Pose Estimation

Category-level object pose estimation aims to find 6D object poses of previously unseen object instances from known categories without access to object CAD models. To reduce the huge amount of pose annotations needed for category-level learning, we propose for the first time a self-supervised learning framework to estimate category-level 6D object pose from single 3D point clouds.During training, our method assumes no ground-truth pose annotations, no CAD models, and no multi-view supervision. The key to our method is to disentangle shape and pose through an invariant shape reconstruction module and an equivariant pose estimation module, empowered by SE(3) equivariant point cloud networks.The invariant shape reconstruction module learns to perform aligned reconstructions, yielding a category-level reference frame without using any annotations. In addition, the equivariant pose estimation module achieves category-level pose estimation accuracy that is comparable to some fully supervised methods. Extensive experiments demonstrate the effectiveness of our approach on both complete and partial depth point clouds from the ModelNet40 benchmark, and on real depth point clouds from the NOCS-REAL 275 dataset. The project page with code and visualizations can be found at: https://dragonlong.github.io/equi-pose.

* 20 pages, 11 figures 
Viaarxiv icon

R4: A Framework for Route Representation and Route Recommendation

Oct 25, 2021
Ran Cheng, Chao Chen, Longfei Xu, Shen Li, Lei Wang, Hengbin Cui, Kaikui Liu, Xiaolong Li

Figure 1 for R4: A Framework for Route Representation and Route Recommendation
Figure 2 for R4: A Framework for Route Representation and Route Recommendation
Figure 3 for R4: A Framework for Route Representation and Route Recommendation
Figure 4 for R4: A Framework for Route Representation and Route Recommendation

Route recommendation is significant in navigation service. Two major challenges for route recommendation are route representation and user representation. Different from items that can be identified by unique IDs in traditional recommendation, routes are combinations of links (i.e., a road segment and its following action like turning left) and the number of combinations could be close to infinite. Besides, the representation of a route changes under different scenarios. These facts result in severe sparsity of routes, which increases the difficulty of route representation. Moreover, link attribute deficiencies and errors affect preciseness of route representation. Because of the sparsity of routes, the interaction data between users and routes are also sparse. This makes it not easy to acquire user representation from historical user-item interactions as traditional recommendations do. To address these issues, we propose a novel learning framework R4. In R4, we design a sparse & dense network to obtain representations of routes. The sparse unit learns link ID embeddings and aggregates them to represent a route, which captures implicit route characteristics and subsequently alleviates problems caused by link attribute deficiencies and errors. The dense unit extracts implicit local features of routes from link attributes. For user representation, we utilize a series of historical navigation to extract user preference. R4 achieves remarkable performance in both offline and online experiments.

Viaarxiv icon

Denoising User-aware Memory Network for Recommendation

Jul 12, 2021
Zhi Bian, Shaojun Zhou, Hao Fu, Qihong Yang, Zhenqi Sun, Junjie Tang, Guiquan Liu, Kaikui Liu, Xiaolong Li

Figure 1 for Denoising User-aware Memory Network for Recommendation
Figure 2 for Denoising User-aware Memory Network for Recommendation
Figure 3 for Denoising User-aware Memory Network for Recommendation
Figure 4 for Denoising User-aware Memory Network for Recommendation

For better user satisfaction and business effectiveness, more and more attention has been paid to the sequence-based recommendation system, which is used to infer the evolution of users' dynamic preferences, and recent studies have noticed that the evolution of users' preferences can be better understood from the implicit and explicit feedback sequences. However, most of the existing recommendation techniques do not consider the noise contained in implicit feedback, which will lead to the biased representation of user interest and a suboptimal recommendation performance. Meanwhile, the existing methods utilize item sequence for capturing the evolution of user interest. The performance of these methods is limited by the length of the sequence, and can not effectively model the long-term interest in a long period of time. Based on this observation, we propose a novel CTR model named denoising user-aware memory network (DUMN). Specifically, the framework: (i) proposes a feature purification module based on orthogonal mapping, which use the representation of explicit feedback to purify the representation of implicit feedback, and effectively denoise the implicit feedback; (ii) designs a user memory network to model the long-term interests in a fine-grained way by improving the memory network, which is ignored by the existing methods; and (iii) develops a preference-aware interactive representation component to fuse the long-term and short-term interests of users based on gating to understand the evolution of unbiased preferences of users. Extensive experiments on two real e-commerce user behavior datasets show that DUMN has a significant improvement over the state-of-the-art baselines. The code of DUMN model has been uploaded as an additional material.

Viaarxiv icon

Interactive Question Clarification in Dialogue via Reinforcement Learning

Dec 17, 2020
Xiang Hu, Zujie Wen, Yafang Wang, Xiaolong Li, Gerard de Melo

Figure 1 for Interactive Question Clarification in Dialogue via Reinforcement Learning
Figure 2 for Interactive Question Clarification in Dialogue via Reinforcement Learning
Figure 3 for Interactive Question Clarification in Dialogue via Reinforcement Learning
Figure 4 for Interactive Question Clarification in Dialogue via Reinforcement Learning

Coping with ambiguous questions has been a perennial problem in real-world dialogue systems. Although clarification by asking questions is a common form of human interaction, it is hard to define appropriate questions to elicit more specific intents from a user. In this work, we propose a reinforcement model to clarify ambiguous questions by suggesting refinements of the original query. We first formulate a collection partitioning problem to select a set of labels enabling us to distinguish potential unambiguous intents. We list the chosen labels as intent phrases to the user for further confirmation. The selected label along with the original user query then serves as a refined query, for which a suitable response can more easily be identified. The model is trained using reinforcement learning with a deep policy network. We evaluate our model based on real-world user clicks and demonstrate significant improvements across several different experiments.

* COLING industry track 
Viaarxiv icon

Natural grasp intention recognition based on gaze fixation in human-robot interaction

Dec 16, 2020
Bo Yang, Jian Huang, Xiaolong Li, Xinxing Chen, Caihua Xiong, Yasuhisa Hasegawa

Figure 1 for Natural grasp intention recognition based on gaze fixation in human-robot interaction
Figure 2 for Natural grasp intention recognition based on gaze fixation in human-robot interaction
Figure 3 for Natural grasp intention recognition based on gaze fixation in human-robot interaction
Figure 4 for Natural grasp intention recognition based on gaze fixation in human-robot interaction

Eye movement is closely related to limb actions, so it can be used to infer movement intentions. More importantly, in some cases, eye movement is the only way for paralyzed and impaired patients with severe movement disorders to communicate and interact with the environment. Despite this, eye-tracking technology still has very limited application scenarios as an intention recognition method. The goal of this paper is to achieve a natural fixation-based grasping intention recognition method, with which a user with hand movement disorders can intuitively express what tasks he/she wants to do by directly looking at the object of interest. Toward this goal, we design experiments to study the relationships of fixations in different tasks. We propose some quantitative features from these relationships and analyze them statistically. Then we design a natural method for grasping intention recognition. The experimental results prove that the accuracy of the proposed method for the grasping intention recognition exceeds 89\% on the training objects. When this method is extendedly applied to objects not included in the training set, the average accuracy exceeds 85\%. The grasping experiment in the actual environment verifies the effectiveness of the proposed method.

Viaarxiv icon