Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jian Xu

Recoverable Compression: A Multimodal Vision Token Recovery Mechanism Guided by Text Information

Sep 02, 2024

Yi Chen, Jian Xu, Xu-Yao Zhang, Wen-Zhuo Liu, Yang-Yang Liu, Cheng-Lin Liu

Figure 1 for Recoverable Compression: A Multimodal Vision Token Recovery Mechanism Guided by Text Information

Figure 2 for Recoverable Compression: A Multimodal Vision Token Recovery Mechanism Guided by Text Information

Figure 3 for Recoverable Compression: A Multimodal Vision Token Recovery Mechanism Guided by Text Information

Figure 4 for Recoverable Compression: A Multimodal Vision Token Recovery Mechanism Guided by Text Information

Abstract:With the advancement of large-scale language modeling techniques, large multimodal models combining visual encoders with large language models have demonstrated exceptional performance in various visual tasks. Most of the current large-scale multimodal models achieve this by mapping visual features obtained from the visual encoder into a large language model and using them as inputs alongside text for downstream tasks. Therefore, the number of visual tokens directly affects the training and inference speed of the model. There has been significant work on token pruning for visual transformers, but for large multimodal models, only relying on visual information for token pruning or compression may lead to significant loss of important information. On the other hand, the textual input in the form of a question may contain valuable information that can aid in answering the question, providing additional knowledge to the model. To address the potential oversimplification and excessive pruning that can occur with most purely visual token pruning methods, we propose a text information-guided dynamic visual token recovery mechanism that does not require training. This mechanism leverages the similarity between the question text and visual tokens to recover visually meaningful tokens with important text information while merging other less important tokens. Experimental results demonstrate that our proposed method achieves comparable performance to the original approach while compressing the visual tokens to an average of 10% of the original quantity. Our source code will be made publicly available following acceptance.

Via

Access Paper or Ask Questions

Multi-view Hand Reconstruction with a Point-Embedded Transformer

Aug 20, 2024

Lixin Yang, Licheng Zhong, Pengxiang Zhu, Xinyu Zhan, Junxiao Kong, Jian Xu, Cewu Lu

Figure 1 for Multi-view Hand Reconstruction with a Point-Embedded Transformer

Figure 2 for Multi-view Hand Reconstruction with a Point-Embedded Transformer

Figure 3 for Multi-view Hand Reconstruction with a Point-Embedded Transformer

Figure 4 for Multi-view Hand Reconstruction with a Point-Embedded Transformer

Abstract:This work introduces a novel and generalizable multi-view Hand Mesh Reconstruction (HMR) model, named POEM, designed for practical use in real-world hand motion capture scenarios. The advances of the POEM model consist of two main aspects. First, concerning the modeling of the problem, we propose embedding a static basis point within the multi-view stereo space. A point represents a natural form of 3D information and serves as an ideal medium for fusing features across different views, given its varied projections across these views. Consequently, our method harnesses a simple yet effective idea: a complex 3D hand mesh can be represented by a set of 3D basis points that 1) are embedded in the multi-view stereo, 2) carry features from the multi-view images, and 3) encompass the hand in it. The second advance lies in the training strategy. We utilize a combination of five large-scale multi-view datasets and employ randomization in the number, order, and poses of the cameras. By processing such a vast amount of data and a diverse array of camera configurations, our model demonstrates notable generalizability in the real-world applications. As a result, POEM presents a highly practical, plug-and-play solution that enables user-friendly, cost-effective multi-view motion capture for both left and right hands. The model and source codes are available at https://github.com/JubSteven/POEM-v2.

* Generalizable multi-view Hand Mesh Reconstruction (HMR) model. Extension of the original work at CVPR2023

Via

Access Paper or Ask Questions

StylePrompter: Enhancing Domain Generalization with Test-Time Style Priors

Aug 17, 2024

Jiao Zhang, Jian Xu, Xu-Yao Zhang, Cheng-Lin Liu

Figure 1 for StylePrompter: Enhancing Domain Generalization with Test-Time Style Priors

Figure 2 for StylePrompter: Enhancing Domain Generalization with Test-Time Style Priors

Figure 3 for StylePrompter: Enhancing Domain Generalization with Test-Time Style Priors

Figure 4 for StylePrompter: Enhancing Domain Generalization with Test-Time Style Priors

Abstract:In real-world applications, the sample distribution at the inference stage often differs from the one at the training stage, causing performance degradation of trained deep models. The research on domain generalization (DG) aims to develop robust algorithms that can improve the generalized performance in unseen domains by training on a few domains. However, the domain-agnostic vision model, trained on a limited number of domains using traditional domain generalization methods, cannot guarantee its effectiveness in dealing with unseen domains. The introduction of language can break the closed cognition space of the vision model, providing additional semantic information that cannot be inferred from vision-only datasets. In this paper, we propose to overcome the challenge in previous DG methods by introducing the style prompt in the language modality to adapt the trained model dynamically. In particular, we train a style prompter to extract style information of the current image into an embedding in the token embedding space and place it in front of the candidate category words as prior knowledge to prompt the model. Our open space partition of the style token embedding space and the hand-crafted style regularization enable the trained style prompter to handle data from unknown domains effectively. Extensive experiments verify the effectiveness of our method and demonstrate state-of-the-art performances on multiple public datasets. Codes will be available after the acceptance of this paper.

Via

Access Paper or Ask Questions

Variational Learning of Gaussian Process Latent Variable Models through Stochastic Gradient Annealed Importance Sampling

Aug 13, 2024

Jian Xu, Shian Du, Junmei Yang, Qianli Ma, Delu Zeng

Figure 1 for Variational Learning of Gaussian Process Latent Variable Models through Stochastic Gradient Annealed Importance Sampling

Figure 2 for Variational Learning of Gaussian Process Latent Variable Models through Stochastic Gradient Annealed Importance Sampling

Figure 3 for Variational Learning of Gaussian Process Latent Variable Models through Stochastic Gradient Annealed Importance Sampling

Figure 4 for Variational Learning of Gaussian Process Latent Variable Models through Stochastic Gradient Annealed Importance Sampling

Abstract:Gaussian Process Latent Variable Models (GPLVMs) have become increasingly popular for unsupervised tasks such as dimensionality reduction and missing data recovery due to their flexibility and non-linear nature. An importance-weighted version of the Bayesian GPLVMs has been proposed to obtain a tighter variational bound. However, this version of the approach is primarily limited to analyzing simple data structures, as the generation of an effective proposal distribution can become quite challenging in high-dimensional spaces or with complex data sets. In this work, we propose an Annealed Importance Sampling (AIS) approach to address these issues. By transforming the posterior into a sequence of intermediate distributions using annealing, we combine the strengths of Sequential Monte Carlo samplers and VI to explore a wider range of posterior distributions and gradually approach the target distribution. We further propose an efficient algorithm by reparameterizing all variables in the evidence lower bound (ELBO). Experimental results on both toy and image datasets demonstrate that our method outperforms state-of-the-art methods in terms of tighter variational bounds, higher log-likelihoods, and more robust convergence.

Via

Access Paper or Ask Questions

Information Geometry and Beta Link for Optimizing Sparse Variational Student-t Processes

Aug 13, 2024

Jian Xu, Delu Zeng, John Paisley

Figure 1 for Information Geometry and Beta Link for Optimizing Sparse Variational Student-t Processes

Figure 2 for Information Geometry and Beta Link for Optimizing Sparse Variational Student-t Processes

Abstract:Recently, a sparse version of Student-t Processes, termed sparse variational Student-t Processes, has been proposed to enhance computational efficiency and flexibility for real-world datasets using stochastic gradient descent. However, traditional gradient descent methods like Adam may not fully exploit the parameter space geometry, potentially leading to slower convergence and suboptimal performance. To mitigate these issues, we adopt natural gradient methods from information geometry for variational parameter optimization of Student-t Processes. This approach leverages the curvature and structure of the parameter space, utilizing tools such as the Fisher information matrix which is linked to the Beta function in our model. This method provides robust mathematical support for the natural gradient algorithm when using Student's t-distribution as the variational distribution. Additionally, we present a mini-batch algorithm for efficiently computing natural gradients. Experimental results across four benchmark datasets demonstrate that our method consistently accelerates convergence speed.

Via

Access Paper or Ask Questions

Fully Bayesian Differential Gaussian Processes through Stochastic Differential Equations

Aug 12, 2024

Jian Xu, Zhiqi Lin, Min Chen, Junmei Yang, Delu Zeng, John Paisley

Figure 1 for Fully Bayesian Differential Gaussian Processes through Stochastic Differential Equations

Figure 2 for Fully Bayesian Differential Gaussian Processes through Stochastic Differential Equations

Figure 3 for Fully Bayesian Differential Gaussian Processes through Stochastic Differential Equations

Figure 4 for Fully Bayesian Differential Gaussian Processes through Stochastic Differential Equations

Abstract:Traditional deep Gaussian processes model the data evolution using a discrete hierarchy, whereas differential Gaussian processes (DIFFGPs) represent the evolution as an infinitely deep Gaussian process. However, prior DIFFGP methods often overlook the uncertainty of kernel hyperparameters and assume them to be fixed and time-invariant, failing to leverage the unique synergy between continuous-time models and approximate inference. In this work, we propose a fully Bayesian approach that treats the kernel hyperparameters as random variables and constructs coupled stochastic differential equations (SDEs) to learn their posterior distribution and that of inducing points. By incorporating estimation uncertainty on hyperparameters, our method enhances the model's flexibility and adaptability to complex dynamics. Additionally, our approach provides a time-varying, comprehensive, and realistic posterior approximation through coupling variables using SDE methods. Experimental results demonstrate the advantages of our method over traditional approaches, showcasing its superior performance in terms of flexibility, accuracy, and other metrics. Our work opens up exciting research avenues for advancing Bayesian inference and offers a powerful modeling tool for continuous-time Gaussian processes.

Via

Access Paper or Ask Questions

Flexible Bayesian Last Layer Models Using Implicit Priors and Diffusion Posterior Sampling

Aug 07, 2024

Jian Xu, Zhiqi Lin, Shigui Li, Min Chen, Junmei Yang, Delu Zeng, John Paisley

Figure 1 for Flexible Bayesian Last Layer Models Using Implicit Priors and Diffusion Posterior Sampling

Figure 2 for Flexible Bayesian Last Layer Models Using Implicit Priors and Diffusion Posterior Sampling

Figure 3 for Flexible Bayesian Last Layer Models Using Implicit Priors and Diffusion Posterior Sampling

Figure 4 for Flexible Bayesian Last Layer Models Using Implicit Priors and Diffusion Posterior Sampling

Abstract:Bayesian Last Layer (BLL) models focus solely on uncertainty in the output layer of neural networks, demonstrating comparable performance to more complex Bayesian models. However, the use of Gaussian priors for last layer weights in Bayesian Last Layer (BLL) models limits their expressive capacity when faced with non-Gaussian, outlier-rich, or high-dimensional datasets. To address this shortfall, we introduce a novel approach that combines diffusion techniques and implicit priors for variational learning of Bayesian last layer weights. This method leverages implicit distributions for modeling weight priors in BLL, coupled with diffusion samplers for approximating true posterior predictions, thereby establishing a comprehensive Bayesian prior and posterior estimation strategy. By delivering an explicit and computationally efficient variational lower bound, our method aims to augment the expressive abilities of BLL models, enhancing model accuracy, calibration, and out-of-distribution detection proficiency. Through detailed exploration and experimental validation, We showcase the method's potential for improving predictive accuracy and uncertainty quantification while ensuring computational efficiency.

Via

Access Paper or Ask Questions

Enhancing Taobao Display Advertising with Multimodal Representations: Challenges, Approaches and Insights

Jul 28, 2024

Xiang-Rong Sheng, Feifan Yang, Litong Gong, Biao Wang, Zhangming Chan, Yujing Zhang, Yueyao Cheng, Yong-Nan Zhu, Tiezheng Ge, Han Zhu(+3 more)

Figure 1 for Enhancing Taobao Display Advertising with Multimodal Representations: Challenges, Approaches and Insights

Figure 2 for Enhancing Taobao Display Advertising with Multimodal Representations: Challenges, Approaches and Insights

Figure 3 for Enhancing Taobao Display Advertising with Multimodal Representations: Challenges, Approaches and Insights

Figure 4 for Enhancing Taobao Display Advertising with Multimodal Representations: Challenges, Approaches and Insights

Abstract:Despite the recognized potential of multimodal data to improve model accuracy, many large-scale industrial recommendation systems, including Taobao display advertising system, predominantly depend on sparse ID features in their models. In this work, we explore approaches to leverage multimodal data to enhance the recommendation accuracy. We start from identifying the key challenges in adopting multimodal data in a manner that is both effective and cost-efficient for industrial systems. To address these challenges, we introduce a two-phase framework, including: 1) the pre-training of multimodal representations to capture semantic similarity, and 2) the integration of these representations with existing ID-based models. Furthermore, we detail the architecture of our production system, which is designed to facilitate the deployment of multimodal representations. Since the integration of multimodal representations in mid-2023, we have observed significant performance improvements in Taobao display advertising system. We believe that the insights we have gathered will serve as a valuable resource for practitioners seeking to leverage multimodal data in their systems.

* Accepted at CIKM 2024

Via

Access Paper or Ask Questions

Sparse Inducing Points in Deep Gaussian Processes: Enhancing Modeling with Denoising Diffusion Variational Inference

Jul 24, 2024

Jian Xu, Delu Zeng, John Paisley

Figure 1 for Sparse Inducing Points in Deep Gaussian Processes: Enhancing Modeling with Denoising Diffusion Variational Inference

Figure 2 for Sparse Inducing Points in Deep Gaussian Processes: Enhancing Modeling with Denoising Diffusion Variational Inference

Figure 3 for Sparse Inducing Points in Deep Gaussian Processes: Enhancing Modeling with Denoising Diffusion Variational Inference

Figure 4 for Sparse Inducing Points in Deep Gaussian Processes: Enhancing Modeling with Denoising Diffusion Variational Inference

Abstract:Deep Gaussian processes (DGPs) provide a robust paradigm for Bayesian deep learning. In DGPs, a set of sparse integration locations called inducing points are selected to approximate the posterior distribution of the model. This is done to reduce computational complexity and improve model efficiency. However, inferring the posterior distribution of inducing points is not straightforward. Traditional variational inference approaches to posterior approximation often lead to significant bias. To address this issue, we propose an alternative method called Denoising Diffusion Variational Inference (DDVI) that uses a denoising diffusion stochastic differential equation (SDE) to generate posterior samples of inducing variables. We rely on score matching methods for denoising diffusion model to approximate score functions with a neural network. Furthermore, by combining classical mathematical theory of SDEs with the minimization of KL divergence between the approximate and true processes, we propose a novel explicit variational lower bound for the marginal likelihood function of DGP. Through experiments on various datasets and comparisons with baseline methods, we empirically demonstrate the effectiveness of DDVI for posterior inference of inducing points for DGP models.

Via

Access Paper or Ask Questions

COALA: A Practical and Vision-Centric Federated Learning Platform

Jul 23, 2024

Weiming Zhuang, Jian Xu, Chen Chen, Jingtao Li, Lingjuan Lyu

Figure 1 for COALA: A Practical and Vision-Centric Federated Learning Platform

Figure 2 for COALA: A Practical and Vision-Centric Federated Learning Platform

Figure 3 for COALA: A Practical and Vision-Centric Federated Learning Platform

Figure 4 for COALA: A Practical and Vision-Centric Federated Learning Platform

Abstract:We present COALA, a vision-centric Federated Learning (FL) platform, and a suite of benchmarks for practical FL scenarios, which we categorize into three levels: task, data, and model. At the task level, COALA extends support from simple classification to 15 computer vision tasks, including object detection, segmentation, pose estimation, and more. It also facilitates federated multiple-task learning, allowing clients to tackle multiple tasks simultaneously. At the data level, COALA goes beyond supervised FL to benchmark both semi-supervised FL and unsupervised FL. It also benchmarks feature distribution shifts other than commonly considered label distribution shifts. In addition to dealing with static data, it supports federated continual learning for continuously changing data in real-world scenarios. At the model level, COALA benchmarks FL with split models and different models in different clients. COALA platform offers three degrees of customization for these practical FL scenarios, including configuration customization, components customization, and workflow customization. We conduct systematic benchmarking experiments for the practical FL scenarios and highlight potential opportunities for further advancements in FL. Codes are open sourced at https://github.com/SonyResearch/COALA.

* ICML'24

Via

Access Paper or Ask Questions