Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jian Sun

the State Key Lab of Intelligent Control and Decision of Complex Systems and the School of Automation, Beijing Institute of Technology, Beijing, China, Beijing Institute of Technology Chongqing Innovation Center, Chongqing, China

SPACE-3: Unified Dialog Model Pre-training for Task-Oriented Dialog Understanding and Generation

Sep 14, 2022

Wanwei He, Yinpei Dai, Min Yang, Jian Sun, Fei Huang, Luo Si, Yongbin Li

Figure 1 for SPACE-3: Unified Dialog Model Pre-training for Task-Oriented Dialog Understanding and Generation

Figure 2 for SPACE-3: Unified Dialog Model Pre-training for Task-Oriented Dialog Understanding and Generation

Figure 3 for SPACE-3: Unified Dialog Model Pre-training for Task-Oriented Dialog Understanding and Generation

Figure 4 for SPACE-3: Unified Dialog Model Pre-training for Task-Oriented Dialog Understanding and Generation

Abstract:Recently, pre-training methods have shown remarkable success in task-oriented dialog (TOD) systems. However, most existing pre-trained models for TOD focus on either dialog understanding or dialog generation, but not both. In this paper, we propose SPACE-3, a novel unified semi-supervised pre-trained conversation model learning from large-scale dialog corpora with limited annotations, which can be effectively fine-tuned on a wide range of downstream dialog tasks. Specifically, SPACE-3 consists of four successive components in a single transformer to maintain a task-flow in TOD systems: (i) a dialog encoding module to encode dialog history, (ii) a dialog understanding module to extract semantic vectors from either user queries or system responses, (iii) a dialog policy module to generate a policy vector that contains high-level semantics of the response, and (iv) a dialog generation module to produce appropriate responses. We design a dedicated pre-training objective for each component. Concretely, we pre-train the dialog encoding module with span mask language modeling to learn contextualized dialog information. To capture the structured dialog semantics, we pre-train the dialog understanding module via a novel tree-induced semi-supervised contrastive learning objective with the help of extra dialog annotations. In addition, we pre-train the dialog policy module by minimizing the L2 distance between its output policy vector and the semantic vector of the response for policy optimization. Finally, the dialog generation model is pre-trained by language modeling. Results show that SPACE-3 achieves state-of-the-art performance on eight downstream dialog benchmarks, including intent prediction, dialog state tracking, and end-to-end dialog modeling. We also show that SPACE-3 has a stronger few-shot ability than existing models under the low-resource setting.

* 14 pages, 5 figures. Accepted by SIGIR 2022

Via

Access Paper or Ask Questions

Implicit Full Waveform Inversion with Deep Neural Representation

Sep 08, 2022

Jian Sun, Kristopher Innanen

Figure 1 for Implicit Full Waveform Inversion with Deep Neural Representation

Figure 2 for Implicit Full Waveform Inversion with Deep Neural Representation

Figure 3 for Implicit Full Waveform Inversion with Deep Neural Representation

Figure 4 for Implicit Full Waveform Inversion with Deep Neural Representation

Abstract:Full waveform inversion (FWI) commonly stands for the state-of-the-art approach for imaging subsurface structures and physical parameters, however, its implementation usually faces great challenges, such as building a good initial model to escape from local minima, and evaluating the uncertainty of inversion results. In this paper, we propose the implicit full waveform inversion (IFWI) algorithm using continuously and implicitly defined deep neural representations. Compared to FWI, which is sensitive to the initial model, IFWI benefits from the increased degrees of freedom with deep learning optimization, thus allowing to start from a random initialization, which greatly reduces the risk of non-uniqueness and being trapped in local minima. Both theoretical and experimental analyses indicates that, given a random initial model, IFWI is able to converge to the global minimum and produce a high-resolution image of subsurface with fine structures. In addition, uncertainty analysis of IFWI can be easily performed by approximating Bayesian inference with various deep learning approaches, which is analyzed in this paper by adding dropout neurons. Furthermore, IFWI has a certain degree of robustness and strong generalization ability that are exemplified in the experiments of various 2D geological models. With proper setup, IFWI can also be well suited for multi-scale joint geophysical inversion.

Via

Access Paper or Ask Questions

A Survey on Text-to-SQL Parsing: Concepts, Methods, and Future Directions

Aug 29, 2022

Bowen Qin, Binyuan Hui, Lihan Wang, Min Yang, Jinyang Li, Binhua Li, Ruiying Geng, Rongyu Cao, Jian Sun, Luo Si(+2 more)

Figure 1 for A Survey on Text-to-SQL Parsing: Concepts, Methods, and Future Directions

Figure 2 for A Survey on Text-to-SQL Parsing: Concepts, Methods, and Future Directions

Figure 3 for A Survey on Text-to-SQL Parsing: Concepts, Methods, and Future Directions

Figure 4 for A Survey on Text-to-SQL Parsing: Concepts, Methods, and Future Directions

Abstract:Text-to-SQL parsing is an essential and challenging task. The goal of text-to-SQL parsing is to convert a natural language (NL) question to its corresponding structured query language (SQL) based on the evidences provided by relational databases. Early text-to-SQL parsing systems from the database community achieved a noticeable progress with the cost of heavy human engineering and user interactions with the systems. In recent years, deep neural networks have significantly advanced this task by neural generation models, which automatically learn a mapping function from an input NL question to an output SQL query. Subsequently, the large pre-trained language models have taken the state-of-the-art of the text-to-SQL parsing task to a new level. In this survey, we present a comprehensive review on deep learning approaches for text-to-SQL parsing. First, we introduce the text-to-SQL parsing corpora which can be categorized as single-turn and multi-turn. Second, we provide a systematical overview of pre-trained language models and existing methods for text-to-SQL parsing. Third, we present readers with the challenges faced by text-to-SQL parsing and explore some potential future directions in this field.

Via

Access Paper or Ask Questions

Differentiable Architecture Search with Random Features

Aug 18, 2022

Xuanyang Zhang, Yonggang Li, Xiangyu Zhang, Yongtao Wang, Jian Sun

Figure 1 for Differentiable Architecture Search with Random Features

Figure 2 for Differentiable Architecture Search with Random Features

Figure 3 for Differentiable Architecture Search with Random Features

Figure 4 for Differentiable Architecture Search with Random Features

Abstract:Differentiable architecture search (DARTS) has significantly promoted the development of NAS techniques because of its high search efficiency and effectiveness but suffers from performance collapse. In this paper, we make efforts to alleviate the performance collapse problem for DARTS from two aspects. First, we investigate the expressive power of the supernet in DARTS and then derive a new setup of DARTS paradigm with only training BatchNorm. Second, we theoretically find that random features dilute the auxiliary connection role of skip-connection in supernet optimization and enable search algorithm focus on fairer operation selection, thereby solving the performance collapse problem. We instantiate DARTS and PC-DARTS with random features to build an improved version for each named RF-DARTS and RF-PCDARTS respectively. Experimental results show that RF-DARTS obtains \textbf{94.36\%} test accuracy on CIFAR-10 (which is the nearest optimal result in NAS-Bench-201), and achieves the newest state-of-the-art top-1 test error of \textbf{24.0\%} on ImageNet when transferring from CIFAR-10. Moreover, RF-DARTS performs robustly across three datasets (CIFAR-10, CIFAR-100, and SVHN) and four search spaces (S1-S4). Besides, RF-PCDARTS achieves even better results on ImageNet, that is, \textbf{23.9\%} top-1 and \textbf{7.1\%} top-5 test error, surpassing representative methods like single-path, training-free, and partial-channel paradigms directly searched on ImageNet.

* Tech Report

Via

Access Paper or Ask Questions

DBQ-SSD: Dynamic Ball Query for Efficient 3D Object Detection

Jul 22, 2022

Jinrong Yang, Lin Song, Songtao Liu, Zeming Li, Xiaoping Li, Hongbin Sun, Jian Sun, Nanning Zheng

Figure 1 for DBQ-SSD: Dynamic Ball Query for Efficient 3D Object Detection

Figure 2 for DBQ-SSD: Dynamic Ball Query for Efficient 3D Object Detection

Figure 3 for DBQ-SSD: Dynamic Ball Query for Efficient 3D Object Detection

Figure 4 for DBQ-SSD: Dynamic Ball Query for Efficient 3D Object Detection

Abstract:Many point-based 3D detectors adopt point-feature sampling strategies to drop some points for efficient inference. These strategies are typically based on fixed and handcrafted rules, making difficult to handle complicated scenes. Different from them, we propose a Dynamic Ball Query (DBQ) network to adaptively select a subset of input points according to the input features, and assign the feature transform with suitable receptive field for each selected point. It can be embedded into some state-of-the-art 3D detectors and trained in an end-to-end manner, which significantly reduces the computational cost. Extensive experiments demonstrate that our method can reduce latency by 30%-60% on KITTI and Waymo datasets. Specifically, the inference speed of our detector can reach 162 FPS and 30 FPS with negligible performance degradation on KITTI and Waymo datasets, respectively.

Via

Access Paper or Ask Questions

StreamYOLO: Real-time Object Detection for Streaming Perception

Jul 21, 2022

Jinrong Yang, Songtao Liu, Zeming Li, Xiaoping Li, Jian Sun

Figure 1 for StreamYOLO: Real-time Object Detection for Streaming Perception

Figure 2 for StreamYOLO: Real-time Object Detection for Streaming Perception

Figure 3 for StreamYOLO: Real-time Object Detection for Streaming Perception

Figure 4 for StreamYOLO: Real-time Object Detection for Streaming Perception

Abstract:The perceptive models of autonomous driving require fast inference within a low latency for safety. While existing works ignore the inevitable environmental changes after processing, streaming perception jointly evaluates the latency and accuracy into a single metric for video online perception, guiding the previous works to search trade-offs between accuracy and speed. In this paper, we explore the performance of real time models on this metric and endow the models with the capacity of predicting the future, significantly improving the results for streaming perception. Specifically, we build a simple framework with two effective modules. One is a Dual Flow Perception module (DFP). It consists of dynamic flow and static flow in parallel to capture moving tendency and basic detection feature, respectively. Trend Aware Loss (TAL) is the other module which adaptively generates loss weight for each object with its moving speed. Realistically, we consider multiple velocities driving scene and further propose Velocity-awared streaming AP (VsAP) to jointly evaluate the accuracy. In this realistic setting, we design a efficient mix-velocity training strategy to guide detector perceive any velocities. Our simple method achieves the state-of-the-art performance on Argoverse-HD dataset and improves the sAP and VsAP by 4.7% and 8.2% respectively compared to the strong baseline, validating its effectiveness.

* Extended version of arXiv:2203.12338

Via

Access Paper or Ask Questions

Dense Teacher: Dense Pseudo-Labels for Semi-supervised Object Detection

Jul 19, 2022

Hongyu Zhou, Zheng Ge, Songtao Liu, Weixin Mao, Zeming Li, Haiyan Yu, Jian Sun

Figure 1 for Dense Teacher: Dense Pseudo-Labels for Semi-supervised Object Detection

Figure 2 for Dense Teacher: Dense Pseudo-Labels for Semi-supervised Object Detection

Figure 3 for Dense Teacher: Dense Pseudo-Labels for Semi-supervised Object Detection

Figure 4 for Dense Teacher: Dense Pseudo-Labels for Semi-supervised Object Detection

Abstract:To date, the most powerful semi-supervised object detectors (SS-OD) are based on pseudo-boxes, which need a sequence of post-processing with fine-tuned hyper-parameters. In this work, we propose replacing the sparse pseudo-boxes with the dense prediction as a united and straightforward form of pseudo-label. Compared to the pseudo-boxes, our Dense Pseudo-Label (DPL) does not involve any post-processing method, thus retaining richer information. We also introduce a region selection technique to highlight the key information while suppressing the noise carried by dense labels. We name our proposed SS-OD algorithm that leverages the DPL as Dense Teacher. On COCO and VOC, Dense Teacher shows superior performance under various settings compared with the pseudo-box-based methods.

* ECCV2022

Via

Access Paper or Ask Questions

Layout-Aware Information Extraction for Document-Grounded Dialogue: Dataset, Method and Demonstration

Jul 14, 2022

Zhenyu Zhang, Bowen Yu, Haiyang Yu, Tingwen Liu, Cheng Fu, Jingyang Li, Chengguang Tang, Jian Sun, Yongbin Li

Figure 1 for Layout-Aware Information Extraction for Document-Grounded Dialogue: Dataset, Method and Demonstration

Figure 2 for Layout-Aware Information Extraction for Document-Grounded Dialogue: Dataset, Method and Demonstration

Figure 3 for Layout-Aware Information Extraction for Document-Grounded Dialogue: Dataset, Method and Demonstration

Figure 4 for Layout-Aware Information Extraction for Document-Grounded Dialogue: Dataset, Method and Demonstration

Abstract:Building document-grounded dialogue systems have received growing interest as documents convey a wealth of human knowledge and commonly exist in enterprises. Wherein, how to comprehend and retrieve information from documents is a challenging research problem. Previous work ignores the visual property of documents and treats them as plain text, resulting in incomplete modality. In this paper, we propose a Layout-aware document-level Information Extraction dataset, LIE, to facilitate the study of extracting both structural and semantic knowledge from visually rich documents (VRDs), so as to generate accurate responses in dialogue systems. LIE contains 62k annotations of three extraction tasks from 4,061 pages in product and official documents, becoming the largest VRD-based information extraction dataset to the best of our knowledge. We also develop benchmark methods that extend the token-based language model to consider layout features like humans. Empirical results show that layout is critical for VRD-based extraction, and system demonstration also verifies that the extracted knowledge can help locate the answers that users care about.

* Accepted to ACM Multimedia (MM) Industry Track 2022

Via

Access Paper or Ask Questions

Scaling up Kernels in 3D CNNs

Jun 21, 2022

Yukang Chen, Jianhui Liu, Xiaojuan Qi, Xiangyu Zhang, Jian Sun, Jiaya Jia

Figure 1 for Scaling up Kernels in 3D CNNs

Figure 2 for Scaling up Kernels in 3D CNNs

Figure 3 for Scaling up Kernels in 3D CNNs

Figure 4 for Scaling up Kernels in 3D CNNs

Abstract:Recent advances in 2D CNNs and vision transformers (ViTs) reveal that large kernels are essential for enough receptive fields and high performance. Inspired by this literature, we examine the feasibility and challenges of 3D large-kernel designs. We demonstrate that applying large convolutional kernels in 3D CNNs has more difficulties in both performance and efficiency. Existing techniques that work well in 2D CNNs are ineffective in 3D networks, including the popular depth-wise convolutions. To overcome these obstacles, we present the spatial-wise group convolution and its large-kernel module (SW-LK block). It avoids the optimization and efficiency issues of naive 3D large kernels. Our large-kernel 3D CNN network, i.e., LargeKernel3D, yields non-trivial improvements on various 3D tasks, including semantic segmentation and object detection. Notably, it achieves 73.9% mIoU on the ScanNetv2 semantic segmentation and 72.8% NDS nuScenes object detection benchmarks, ranking 1st on the nuScenes LIDAR leaderboard. It is further boosted to 74.2% NDS with a simple multi-modal fusion. LargeKernel3D attains comparable or superior results than its CNN and transformer counterparts. For the first time, we show that large kernels are feasible and essential for 3D networks.

* Code and models will be available at https://github.com/dvlab-research/LargeKernel3D

Via

Access Paper or Ask Questions

Duplex Conversation: Towards Human-like Interaction in Spoken Dialogue Systems

Jun 09, 2022

Ting-En Lin, Yuchuan Wu, Fei Huang, Luo Si, Jian Sun, Yongbin Li

Figure 1 for Duplex Conversation: Towards Human-like Interaction in Spoken Dialogue Systems

Figure 2 for Duplex Conversation: Towards Human-like Interaction in Spoken Dialogue Systems

Figure 3 for Duplex Conversation: Towards Human-like Interaction in Spoken Dialogue Systems

Figure 4 for Duplex Conversation: Towards Human-like Interaction in Spoken Dialogue Systems

Abstract:In this paper, we present Duplex Conversation, a multi-turn, multimodal spoken dialogue system that enables telephone-based agents to interact with customers like a human. We use the concept of full-duplex in telecommunication to demonstrate what a human-like interactive experience should be and how to achieve smooth turn-taking through three subtasks: user state detection, backchannel selection, and barge-in detection. Besides, we propose semi-supervised learning with multimodal data augmentation to leverage unlabeled data to increase model generalization. Experimental results on three sub-tasks show that the proposed method achieves consistent improvements compared with baselines. We deploy the Duplex Conversation to Alibaba intelligent customer service and share lessons learned in production. Online A/B experiments show that the proposed system can significantly reduce response latency by 50%.

* Accepted by KDD 2022, ADS track

Via

Access Paper or Ask Questions