Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Bing Li

Canon Medical Systems

SAM-Guided Masked Token Prediction for 3D Scene Understanding

Oct 17, 2024

Zhimin Chen, Liang Yang, Yingwei Li, Longlong Jing, Bing Li

Figure 1 for SAM-Guided Masked Token Prediction for 3D Scene Understanding

Figure 2 for SAM-Guided Masked Token Prediction for 3D Scene Understanding

Figure 3 for SAM-Guided Masked Token Prediction for 3D Scene Understanding

Figure 4 for SAM-Guided Masked Token Prediction for 3D Scene Understanding

Abstract:Foundation models have significantly enhanced 2D task performance, and recent works like Bridge3D have successfully applied these models to improve 3D scene understanding through knowledge distillation, marking considerable advancements. Nonetheless, challenges such as the misalignment between 2D and 3D representations and the persistent long-tail distribution in 3D datasets still restrict the effectiveness of knowledge distillation from 2D to 3D using foundation models. To tackle these issues, we introduce a novel SAM-guided tokenization method that seamlessly aligns 3D transformer structures with region-level knowledge distillation, replacing the traditional KNN-based tokenization techniques. Additionally, we implement a group-balanced re-weighting strategy to effectively address the long-tail problem in knowledge distillation. Furthermore, inspired by the recent success of masked feature prediction, our framework incorporates a two-stage masked token prediction process in which the student model predicts both the global embeddings and the token-wise local embeddings derived from the teacher models trained in the first stage. Our methodology has been validated across multiple datasets, including SUN RGB-D, ScanNet, and S3DIS, for tasks like 3D object detection and semantic segmentation. The results demonstrate significant improvements over current State-of-the-art self-supervised methods, establishing new benchmarks in this field.

* Accepted by NeurIPS 2024

Via

Access Paper or Ask Questions

SynFER: Towards Boosting Facial Expression Recognition with Synthetic Data

Oct 13, 2024

Xilin He, Cheng Luo, Xiaole Xian, Bing Li, Siyang Song, Muhammad Haris Khan, Weicheng Xie, Linlin Shen, Zongyuan Ge

Figure 1 for SynFER: Towards Boosting Facial Expression Recognition with Synthetic Data

Figure 2 for SynFER: Towards Boosting Facial Expression Recognition with Synthetic Data

Figure 3 for SynFER: Towards Boosting Facial Expression Recognition with Synthetic Data

Figure 4 for SynFER: Towards Boosting Facial Expression Recognition with Synthetic Data

Abstract:Facial expression datasets remain limited in scale due to privacy concerns, the subjectivity of annotations, and the labor-intensive nature of data collection. This limitation poses a significant challenge for developing modern deep learning-based facial expression analysis models, particularly foundation models, that rely on large-scale data for optimal performance. To tackle the overarching and complex challenge, we introduce SynFER (Synthesis of Facial Expressions with Refined Control), a novel framework for synthesizing facial expression image data based on high-level textual descriptions as well as more fine-grained and precise control through facial action units. To ensure the quality and reliability of the synthetic data, we propose a semantic guidance technique to steer the generation process and a pseudo-label generator to help rectify the facial expression labels for the synthetic images. To demonstrate the generation fidelity and the effectiveness of the synthetic data from SynFER, we conduct extensive experiments on representation learning using both synthetic data and real-world data. Experiment results validate the efficacy of the proposed approach and the synthetic data. Notably, our approach achieves a 67.23% classification accuracy on AffectNet when training solely with synthetic data equivalent to the AffectNet training set size, which increases to 69.84% when scaling up to five times the original size. Our code will be made publicly available.

Via

Access Paper or Ask Questions

Basis Sharing: Cross-Layer Parameter Sharing for Large Language Model Compression

Oct 02, 2024

Jingcun Wang, Yu-Guang Chen, Ing-Chao Lin, Bing Li, Grace Li Zhang

Figure 1 for Basis Sharing: Cross-Layer Parameter Sharing for Large Language Model Compression

Figure 2 for Basis Sharing: Cross-Layer Parameter Sharing for Large Language Model Compression

Figure 3 for Basis Sharing: Cross-Layer Parameter Sharing for Large Language Model Compression

Figure 4 for Basis Sharing: Cross-Layer Parameter Sharing for Large Language Model Compression

Abstract:Large Language Models (LLMs) have achieved remarkable breakthroughs. However, the huge number of parameters in LLMs require significant amount of memory storage in inference, which prevents their practical deployment in many applications. To reduce memory storage of LLMs, singular value decomposition (SVD) provides a promising solution to approximate weight matrices for compressing LLMs. In this paper, we take a step further to explore parameter sharing across different layers with SVD to achieve more effective compression for LLMs. Specifically, weight matrices in different layers are decomposed and represented as a linear combination of a set of shared basis vectors and unique coefficients. The types of weight matrices and the layer selection for basis sharing are examined when compressing LLMs to maintain the performance. Comprehensive experiments demonstrate that Basis Sharing outperforms state-of-the-art SVD-based compression approaches and parameter sharing techniques, especially under large compression ratios. Code is available at: https://github.com/TUDa-HWAI/Basis_Sharing

Via

Access Paper or Ask Questions

Token Caching for Diffusion Transformer Acceleration

Sep 27, 2024

Jinming Lou, Wenyang Luo, Yufan Liu, Bing Li, Xinmiao Ding, Weiming Hu, Jiajiong Cao, Yuming Li, Chenguang Ma

Figure 1 for Token Caching for Diffusion Transformer Acceleration

Figure 2 for Token Caching for Diffusion Transformer Acceleration

Figure 3 for Token Caching for Diffusion Transformer Acceleration

Figure 4 for Token Caching for Diffusion Transformer Acceleration

Abstract:Diffusion transformers have gained substantial interest in diffusion generative modeling due to their outstanding performance. However, their high computational cost, arising from the quadratic computational complexity of attention mechanisms and multi-step inference, presents a significant bottleneck. To address this challenge, we propose TokenCache, a novel post-training acceleration method that leverages the token-based multi-block architecture of transformers to reduce redundant computations among tokens across inference steps. TokenCache specifically addresses three critical questions in the context of diffusion transformers: (1) which tokens should be pruned to eliminate redundancy, (2) which blocks should be targeted for efficient pruning, and (3) at which time steps caching should be applied to balance speed and quality. In response to these challenges, TokenCache introduces a Cache Predictor that assigns importance scores to tokens, enabling selective pruning without compromising model performance. Furthermore, we propose an adaptive block selection strategy to focus on blocks with minimal impact on the network's output, along with a Two-Phase Round-Robin (TPRR) scheduling policy to optimize caching intervals throughout the denoising process. Experimental results across various models demonstrate that TokenCache achieves an effective trade-off between generation quality and inference speed for diffusion transformers. Our code will be publicly available.

Via

Access Paper or Ask Questions

MobileIQA: Exploiting Mobile-level Diverse Opinion Network For No-Reference Image Quality Assessment Using Knowledge Distillation

Sep 02, 2024

Zewen Chen, Sunhan Xu, Yun Zeng, Haochen Guo, Jian Guo, Shuai Liu, Juan Wang, Bing Li, Weiming Hu, Dehua Liu(+1 more)

Abstract:With the rising demand for high-resolution (HR) images, No-Reference Image Quality Assessment (NR-IQA) gains more attention, as it can ecaluate image quality in real-time on mobile devices and enhance user experience. However, existing NR-IQA methods often resize or crop the HR images into small resolution, which leads to a loss of important details. And most of them are of high computational complexity, which hinders their application on mobile devices due to limited computational resources. To address these challenges, we propose MobileIQA, a novel approach that utilizes lightweight backbones to efficiently assess image quality while preserving image details through high-resolution input. MobileIQA employs the proposed multi-view attention learning (MAL) module to capture diverse opinions, simulating subjective opinions provided by different annotators during the dataset annotation process. The model uses a teacher model to guide the learning of a student model through knowledge distillation. This method significantly reduces computational complexity while maintaining high performance. Experiments demonstrate that MobileIQA outperforms novel IQA methods on evaluation metrics and computational efficiency. The code is available at https://github.com/chencn2020/MobileIQA.

* Accepted by ECCV Workshop 2024

Via

Access Paper or Ask Questions

TrackNeRF: Bundle Adjusting NeRF from Sparse and Noisy Views via Feature Tracks

Aug 20, 2024

Jinjie Mai, Wenxuan Zhu, Sara Rojas, Jesus Zarzar, Abdullah Hamdi, Guocheng Qian, Bing Li, Silvio Giancola, Bernard Ghanem

Figure 1 for TrackNeRF: Bundle Adjusting NeRF from Sparse and Noisy Views via Feature Tracks

Figure 2 for TrackNeRF: Bundle Adjusting NeRF from Sparse and Noisy Views via Feature Tracks

Figure 3 for TrackNeRF: Bundle Adjusting NeRF from Sparse and Noisy Views via Feature Tracks

Figure 4 for TrackNeRF: Bundle Adjusting NeRF from Sparse and Noisy Views via Feature Tracks

Abstract:Neural radiance fields (NeRFs) generally require many images with accurate poses for accurate novel view synthesis, which does not reflect realistic setups where views can be sparse and poses can be noisy. Previous solutions for learning NeRFs with sparse views and noisy poses only consider local geometry consistency with pairs of views. Closely following \textit{bundle adjustment} in Structure-from-Motion (SfM), we introduce TrackNeRF for more globally consistent geometry reconstruction and more accurate pose optimization. TrackNeRF introduces \textit{feature tracks}, \ie connected pixel trajectories across \textit{all} visible views that correspond to the \textit{same} 3D points. By enforcing reprojection consistency among feature tracks, TrackNeRF encourages holistic 3D consistency explicitly. Through extensive experiments, TrackNeRF sets a new benchmark in noisy and sparse view reconstruction. In particular, TrackNeRF shows significant improvements over the state-of-the-art BARF and SPARF by $\sim8$ and $\sim1$ in terms of PSNR on DTU under various sparse and noisy view setups. The code is available at \href{https://tracknerf.github.io/}.

* ECCV 2024 (supplemental pages included)

Via

Access Paper or Ask Questions

SuperEncoder: Towards Universal Neural Approximate Quantum State Preparation

Aug 10, 2024

Yilun Zhao, Bingmeng Wang, Wenle Jiang, Xiwei Pan, Bing Li, Yinhe Han, Ying Wang

Abstract:Numerous quantum algorithms operate under the assumption that classical data has already been converted into quantum states, a process termed Quantum State Preparation (QSP). However, achieving precise QSP requires a circuit depth that scales exponentially with the number of qubits, making it a substantial obstacle in harnessing quantum advantage. Recent research suggests using a Parameterized Quantum Circuit (PQC) to approximate a target state, offering a more scalable solution with reduced circuit depth compared to precise QSP. Despite this, the need for iterative updates of circuit parameters results in a lengthy runtime, limiting its practical application. In this work, we demonstrate that it is possible to leverage a pre-trained neural network to directly generate the QSP circuit for arbitrary quantum state, thereby eliminating the significant overhead of online iterations. Our study makes a steady step towards a universal neural designer for approximate QSP.

Via

Access Paper or Ask Questions

MIBench: Evaluating Multimodal Large Language Models over Multiple Images

Jul 21, 2024

Haowei Liu, Xi Zhang, Haiyang Xu, Yaya Shi, Chaoya Jiang, Ming Yan, Ji Zhang, Fei Huang, Chunfeng Yuan, Bing Li(+1 more)

Figure 1 for MIBench: Evaluating Multimodal Large Language Models over Multiple Images

Figure 2 for MIBench: Evaluating Multimodal Large Language Models over Multiple Images

Figure 3 for MIBench: Evaluating Multimodal Large Language Models over Multiple Images

Figure 4 for MIBench: Evaluating Multimodal Large Language Models over Multiple Images

Abstract:Built on the power of LLMs, numerous multimodal large language models (MLLMs) have recently achieved remarkable performance on various vision-language tasks across multiple benchmarks. However, most existing MLLMs and benchmarks primarily focus on single-image input scenarios, leaving the performance of MLLMs when handling realistic multiple images remain underexplored. Although a few benchmarks consider multiple images, their evaluation dimensions and samples are very limited. Therefore, in this paper, we propose a new benchmark MIBench, to comprehensively evaluate fine-grained abilities of MLLMs in multi-image scenarios. Specifically, MIBench categorizes the multi-image abilities into three scenarios: multi-image instruction (MII), multimodal knowledge-seeking (MKS) and multimodal in-context learning (MIC), and constructs 13 tasks with a total of 13K annotated samples. During data construction, for MII and MKS, we extract correct options from manual annotations and create challenging distractors to obtain multiple-choice questions. For MIC, to enable an in-depth evaluation, we set four sub-tasks and transform the original datasets into in-context learning formats. We evaluate several open-source MLLMs and close-source MLLMs on the proposed MIBench. The results reveal that although current models excel in single-image tasks, they exhibit significant shortcomings when faced with multi-image inputs, such as confused fine-grained perception, limited multi-image reasoning, and unstable in-context learning. The annotated data in MIBench is available at https://huggingface.co/datasets/StarBottle/MIBench.

* 10 pages, 4 figures

Via

Access Paper or Ask Questions

How to Make Cross Encoder a Good Teacher for Efficient Image-Text Retrieval?

Jul 10, 2024

Yuxin Chen, Zongyang Ma, Ziqi Zhang, Zhongang Qi, Chunfeng Yuan, Bing Li, Junfu Pu, Ying Shan, Xiaojuan Qi, Weiming Hu

Figure 1 for How to Make Cross Encoder a Good Teacher for Efficient Image-Text Retrieval?

Figure 2 for How to Make Cross Encoder a Good Teacher for Efficient Image-Text Retrieval?

Figure 3 for How to Make Cross Encoder a Good Teacher for Efficient Image-Text Retrieval?

Figure 4 for How to Make Cross Encoder a Good Teacher for Efficient Image-Text Retrieval?

Abstract:Dominant dual-encoder models enable efficient image-text retrieval but suffer from limited accuracy while the cross-encoder models offer higher accuracy at the expense of efficiency. Distilling cross-modality matching knowledge from cross-encoder to dual-encoder provides a natural approach to harness their strengths. Thus we investigate the following valuable question: how to make cross-encoder a good teacher for dual-encoder? Our findings are threefold:(1) Cross-modal similarity score distribution of cross-encoder is more concentrated while the result of dual-encoder is nearly normal making vanilla logit distillation less effective. However ranking distillation remains practical as it is not affected by the score distribution.(2) Only the relative order between hard negatives conveys valid knowledge while the order information between easy negatives has little significance.(3) Maintaining the coordination between distillation loss and dual-encoder training loss is beneficial for knowledge transfer. Based on these findings we propose a novel Contrastive Partial Ranking Distillation (CPRD) method which implements the objective of mimicking relative order between hard negative samples with contrastive learning. This approach coordinates with the training of the dual-encoder effectively transferring valid knowledge from the cross-encoder to the dual-encoder. Extensive experiments on image-text retrieval and ranking tasks show that our method surpasses other distillation methods and significantly improves the accuracy of dual-encoder.

* Accepted by CVPR 2024

Via

Access Paper or Ask Questions

EA-VTR: Event-Aware Video-Text Retrieval

Jul 10, 2024

Zongyang Ma, Ziqi Zhang, Yuxin Chen, Zhongang Qi, Chunfeng Yuan, Bing Li, Yingmin Luo, Xu Li, Xiaojuan Qi, Ying Shan(+1 more)

Figure 1 for EA-VTR: Event-Aware Video-Text Retrieval

Figure 2 for EA-VTR: Event-Aware Video-Text Retrieval

Figure 3 for EA-VTR: Event-Aware Video-Text Retrieval

Figure 4 for EA-VTR: Event-Aware Video-Text Retrieval

Abstract:Understanding the content of events occurring in the video and their inherent temporal logic is crucial for video-text retrieval. However, web-crawled pre-training datasets often lack sufficient event information, and the widely adopted video-level cross-modal contrastive learning also struggles to capture detailed and complex video-text event alignment. To address these challenges, we make improvements from both data and model perspectives. In terms of pre-training data, we focus on supplementing the missing specific event content and event temporal transitions with the proposed event augmentation strategies. Based on the event-augmented data, we construct a novel Event-Aware Video-Text Retrieval model, ie, EA-VTR, which achieves powerful video-text retrieval ability through superior video event awareness. EA-VTR can efficiently encode frame-level and video-level visual representations simultaneously, enabling detailed event content and complex event temporal cross-modal alignment, ultimately enhancing the comprehensive understanding of video events. Our method not only significantly outperforms existing approaches on multiple datasets for Text-to-Video Retrieval and Video Action Recognition tasks, but also demonstrates superior event content perceive ability on Multi-event Video-Text Retrieval and Video Moment Retrieval tasks, as well as outstanding event temporal logic understanding ability on Test of Time task.

* Accepted by ECCV 2024

Via

Access Paper or Ask Questions