Alert button
Picture for Jingwei Xu

Jingwei Xu

Alert button

Advancing Transformer Architecture in Long-Context Large Language Models: A Comprehensive Survey

Nov 21, 2023
Yunpeng Huang, Jingwei Xu, Zixu Jiang, Junyu Lai, Zenan Li, Yuan Yao, Taolue Chen, Lijuan Yang, Zhou Xin, Xiaoxing Ma

With the bomb ignited by ChatGPT, Transformer-based Large Language Models (LLMs) have paved a revolutionary path toward Artificial General Intelligence (AGI) and have been applied in diverse areas as knowledge bases, human interfaces, and dynamic agents. However, a prevailing limitation exists: many current LLMs, constrained by resources, are primarily pre-trained on shorter texts, rendering them less effective for longer-context prompts, commonly encountered in real-world settings. In this paper, we present a comprehensive survey focusing on the advancement of model architecture in Transformer-based LLMs to optimize long-context capabilities across all stages from pre-training to inference. We firstly delineate and analyze the problems of handling long-context input and output with the current Transformer-based models. Then, we mainly offer a holistic taxonomy to navigate the landscape of Transformer upgrades on architecture to solve these problems. Afterward, we provide the investigation on wildly used evaluation necessities tailored for long-context LLMs, including datasets, metrics, and baseline models, as well as some amazing optimization toolkits like libraries, systems, and compilers to augment LLMs' efficiency and efficacy across different stages. Finally, we further discuss the predominant challenges and potential avenues for future research in this domain. Additionally, we have established a repository where we curate relevant literature with real-time updates at https://github.com/Strivin0311/long-llms-learning.

* 35 pages, 3 figures, 4 tables 
Viaarxiv icon

DebSDF: Delving into the Details and Bias of Neural Indoor Scene Reconstruction

Sep 03, 2023
Yuting Xiao, Jingwei Xu, Zehao Yu, Shenghua Gao

Figure 1 for DebSDF: Delving into the Details and Bias of Neural Indoor Scene Reconstruction
Figure 2 for DebSDF: Delving into the Details and Bias of Neural Indoor Scene Reconstruction
Figure 3 for DebSDF: Delving into the Details and Bias of Neural Indoor Scene Reconstruction
Figure 4 for DebSDF: Delving into the Details and Bias of Neural Indoor Scene Reconstruction

In recent years, the neural implicit surface has emerged as a powerful representation for multi-view surface reconstruction due to its simplicity and state-of-the-art performance. However, reconstructing smooth and detailed surfaces in indoor scenes from multi-view images presents unique challenges. Indoor scenes typically contain large texture-less regions, making the photometric loss unreliable for optimizing the implicit surface. Previous work utilizes monocular geometry priors to improve the reconstruction in indoor scenes. However, monocular priors often contain substantial errors in thin structure regions due to domain gaps and the inherent inconsistencies when derived independently from different views. This paper presents \textbf{DebSDF} to address these challenges, focusing on the utilization of uncertainty in monocular priors and the bias in SDF-based volume rendering. We propose an uncertainty modeling technique that associates larger uncertainties with larger errors in the monocular priors. High-uncertainty priors are then excluded from optimization to prevent bias. This uncertainty measure also informs an importance-guided ray sampling and adaptive smoothness regularization, enhancing the learning of fine structures. We further introduce a bias-aware signed distance function to density transformation that takes into account the curvature and the angle between the view direction and the SDF normals to reconstruct fine details better. Our approach has been validated through extensive experiments on several challenging datasets, demonstrating improved qualitative and quantitative results in reconstructing thin structures in indoor scenes, thereby outperforming previous work.

Viaarxiv icon

Out-of-Domain Human Mesh Reconstruction via Dynamic Bilevel Online Adaptation

Nov 07, 2021
Shanyan Guan, Jingwei Xu, Michelle Z. He, Yunbo Wang, Bingbing Ni, Xiaokang Yang

Figure 1 for Out-of-Domain Human Mesh Reconstruction via Dynamic Bilevel Online Adaptation
Figure 2 for Out-of-Domain Human Mesh Reconstruction via Dynamic Bilevel Online Adaptation
Figure 3 for Out-of-Domain Human Mesh Reconstruction via Dynamic Bilevel Online Adaptation
Figure 4 for Out-of-Domain Human Mesh Reconstruction via Dynamic Bilevel Online Adaptation

We consider a new problem of adapting a human mesh reconstruction model to out-of-domain streaming videos, where performance of existing SMPL-based models are significantly affected by the distribution shift represented by different camera parameters, bone lengths, backgrounds, and occlusions. We tackle this problem through online adaptation, gradually correcting the model bias during testing. There are two main challenges: First, the lack of 3D annotations increases the training difficulty and results in 3D ambiguities. Second, non-stationary data distribution makes it difficult to strike a balance between fitting regular frames and hard samples with severe occlusions or dramatic changes. To this end, we propose the Dynamic Bilevel Online Adaptation algorithm (DynaBOA). It first introduces the temporal constraints to compensate for the unavailable 3D annotations, and leverages a bilevel optimization procedure to address the conflicts between multi-objectives. DynaBOA provides additional 3D guidance by co-training with similar source examples retrieved efficiently despite the distribution shift. Furthermore, it can adaptively adjust the number of optimization steps on individual frames to fully fit hard samples and avoid overfitting regular frames. DynaBOA achieves state-of-the-art results on three out-of-domain human mesh reconstruction benchmarks.

* 14 pages, 13 figures; code repositoty: https://github.com/syguan96/DynaBOA 
Viaarxiv icon

PyTouch: A Machine Learning Library for Touch Processing

May 26, 2021
Mike Lambeta, Huazhe Xu, Jingwei Xu, Po-Wei Chou, Shaoxiong Wang, Trevor Darrell, Roberto Calandra

Figure 1 for PyTouch: A Machine Learning Library for Touch Processing
Figure 2 for PyTouch: A Machine Learning Library for Touch Processing
Figure 3 for PyTouch: A Machine Learning Library for Touch Processing
Figure 4 for PyTouch: A Machine Learning Library for Touch Processing

With the increased availability of rich tactile sensors, there is an equally proportional need for open-source and integrated software capable of efficiently and effectively processing raw touch measurements into high-level signals that can be used for control and decision-making. In this paper, we present PyTouch -- the first machine learning library dedicated to the processing of touch sensing signals. PyTouch, is designed to be modular, easy-to-use and provides state-of-the-art touch processing capabilities as a service with the goal of unifying the tactile sensing community by providing a library for building scalable, proven, and performance-validated modules over which applications and research can be built upon. We evaluate PyTouch on real-world data from several tactile sensors on touch processing tasks such as touch detection, slip and object pose estimations. PyTouch is open-sourced at https://github.com/facebookresearch/pytouch .

* 7 pages. Accepted at ICRA 2021 
Viaarxiv icon

Bilevel Online Adaptation for Out-of-Domain Human Mesh Reconstruction

Mar 30, 2021
Shanyan Guan, Jingwei Xu, Yunbo Wang, Bingbing Ni, Xiaokang Yang

Figure 1 for Bilevel Online Adaptation for Out-of-Domain Human Mesh Reconstruction
Figure 2 for Bilevel Online Adaptation for Out-of-Domain Human Mesh Reconstruction
Figure 3 for Bilevel Online Adaptation for Out-of-Domain Human Mesh Reconstruction
Figure 4 for Bilevel Online Adaptation for Out-of-Domain Human Mesh Reconstruction

This paper considers a new problem of adapting a pre-trained model of human mesh reconstruction to out-of-domain streaming videos. However, most previous methods based on the parametric SMPL model \cite{loper2015smpl} underperform in new domains with unexpected, domain-specific attributes, such as camera parameters, lengths of bones, backgrounds, and occlusions. Our general idea is to dynamically fine-tune the source model on test video streams with additional temporal constraints, such that it can mitigate the domain gaps without over-fitting the 2D information of individual test frames. A subsequent challenge is how to avoid conflicts between the 2D and temporal constraints. We propose to tackle this problem using a new training algorithm named Bilevel Online Adaptation (BOA), which divides the optimization process of overall multi-objective into two steps of weight probe and weight update in a training iteration. We demonstrate that BOA leads to state-of-the-art results on two human mesh reconstruction benchmarks.

* CVPR 2021, the project page: https://sites.google.com/view/humanmeshboa 
Viaarxiv icon

Synthesizing Long-Term 3D Human Motion and Interaction in 3D Scenes

Dec 10, 2020
Jiashun Wang, Huazhe Xu, Jingwei Xu, Sifei Liu, Xiaolong Wang

Figure 1 for Synthesizing Long-Term 3D Human Motion and Interaction in 3D Scenes
Figure 2 for Synthesizing Long-Term 3D Human Motion and Interaction in 3D Scenes
Figure 3 for Synthesizing Long-Term 3D Human Motion and Interaction in 3D Scenes
Figure 4 for Synthesizing Long-Term 3D Human Motion and Interaction in 3D Scenes

Synthesizing 3D human motion plays an important role in many graphics applications as well as understanding human activity. While many efforts have been made on generating realistic and natural human motion, most approaches neglect the importance of modeling human-scene interactions and affordance. On the other hand, affordance reasoning (e.g., standing on the floor or sitting on the chair) has mainly been studied with static human pose and gestures, and it has rarely been addressed with human motion. In this paper, we propose to bridge human motion synthesis and scene affordance reasoning. We present a hierarchical generative framework to synthesize long-term 3D human motion conditioning on the 3D scene structure. Building on this framework, we further enforce multiple geometry constraints between the human mesh and scene point clouds via optimization to improve realistic synthesis. Our experiments show significant improvements over previous approaches on generating natural and physically plausible human motion in a scene.

Viaarxiv icon

Towards Good Practices of U-Net for Traffic Forecasting

Dec 04, 2020
Jingwei Xu, Jianjin Zhang, Zhiyu Yao, Yunbo Wang

Figure 1 for Towards Good Practices of U-Net for Traffic Forecasting
Figure 2 for Towards Good Practices of U-Net for Traffic Forecasting
Figure 3 for Towards Good Practices of U-Net for Traffic Forecasting
Figure 4 for Towards Good Practices of U-Net for Traffic Forecasting

This technical report presents a solution for the 2020 Traffic4Cast Challenge. We consider the traffic forecasting problem as a future frame prediction task with relatively weak temporal dependencies (might be due to stochastic urban traffic dynamics) and strong prior knowledge, \textit{i.e.}, the roadmaps of the cities. For these reasons, we use the U-Net as the backbone model, and we propose a roadmap generation method to make the predicted traffic flows more rational. Meanwhile, we use a fine-tuning strategy based on the validation set to prevent overfitting, which effectively improves the prediction results. At the end of this report, we further discuss several approaches that we have considered or could be explored in future work: (1) harnessing inherent data patterns, such as seasonality; (2) distilling and transferring common knowledge between different cities. We also analyze the validity of the evaluation metric.

* Code is available at \<https://github.com/ZJianjin/Traffic4cast2020_LDS> 
Viaarxiv icon

Hierarchical Style-based Networks for Motion Synthesis

Aug 24, 2020
Jingwei Xu, Huazhe Xu, Bingbing Ni, Xiaokang Yang, Xiaolong Wang, Trevor Darrell

Figure 1 for Hierarchical Style-based Networks for Motion Synthesis
Figure 2 for Hierarchical Style-based Networks for Motion Synthesis
Figure 3 for Hierarchical Style-based Networks for Motion Synthesis
Figure 4 for Hierarchical Style-based Networks for Motion Synthesis

Generating diverse and natural human motion is one of the long-standing goals for creating intelligent characters in the animated world. In this paper, we propose a self-supervised method for generating long-range, diverse and plausible behaviors to achieve a specific goal location. Our proposed method learns to model the motion of human by decomposing a long-range generation task in a hierarchical manner. Given the starting and ending states, a memory bank is used to retrieve motion references as source material for short-range clip generation. We first propose to explicitly disentangle the provided motion material into style and content counterparts via bi-linear transformation modelling, where diverse synthesis is achieved by free-form combination of these two components. The short-range clips are then connected to form a long-range motion sequence. Without ground truth annotation, we propose a parameterized bi-directional interpolation scheme to guarantee the physical validity and visual naturalness of generated results. On large-scale skeleton dataset, we show that the proposed method is able to synthesise long-range, diverse and plausible motion, which is also generalizable to unseen motion data during testing. Moreover, we demonstrate the generated sequences are useful as subgoals for actual physical execution in the animated world.

* ECCV 2020, Project Page:\<https://sites.google.com/view/hsnms> 
Viaarxiv icon

Video Prediction via Example Guidance

Jul 03, 2020
Jingwei Xu, Huazhe Xu, Bingbing Ni, Xiaokang Yang, Trevor Darrell

Figure 1 for Video Prediction via Example Guidance
Figure 2 for Video Prediction via Example Guidance
Figure 3 for Video Prediction via Example Guidance
Figure 4 for Video Prediction via Example Guidance

In video prediction tasks, one major challenge is to capture the multi-modal nature of future contents and dynamics. In this work, we propose a simple yet effective framework that can efficiently predict plausible future states. The key insight is that the potential distribution of a sequence could be approximated with analogous ones in a repertoire of training pool, namely, expert examples. By further incorporating a novel optimization scheme into the training procedure, plausible predictions can be sampled efficiently from distribution constructed from the retrieved examples. Meanwhile, our method could be seamlessly integrated with existing stochastic predictive models; significant enhancement is observed with comprehensive experiments in both quantitative and qualitative aspects. We also demonstrate the generalization ability to predict the motion of unseen class, i.e., without access to corresponding data during training phase.

* ICML 2020  
* Project Page: https://sites.google.com/view/vpeg-supp/home 
Viaarxiv icon