Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yun Fu

Accessing Vision Foundation Models at ImageNet-level Costs

Jul 15, 2024

Yitian Zhang, Xu Ma, Yue Bai, Huan Wang, Yun Fu

Figure 1 for Accessing Vision Foundation Models at ImageNet-level Costs

Figure 2 for Accessing Vision Foundation Models at ImageNet-level Costs

Figure 3 for Accessing Vision Foundation Models at ImageNet-level Costs

Figure 4 for Accessing Vision Foundation Models at ImageNet-level Costs

Abstract:Vision foundation models are renowned for their generalization ability due to massive training data. Nevertheless, they demand tremendous training resources, and the training data is often inaccessible, e.g., CLIP, DINOv2, posing great challenges to developing derivatives that could advance research in this field. In this work, we offer a very simple and general solution, named Proteus, to distill foundation models into smaller equivalents on ImageNet-1K without access to the original training data. Specifically, we remove the designs from conventional knowledge distillation settings that result in dataset bias and present three levels of training objectives, i.e., token, patch, and feature, to maximize the efficacy of knowledge transfer. In this manner, Proteus is trained at ImageNet-level costs with surprising ability, facilitating the accessibility of training foundation models for the broader research community. Leveraging DINOv2-g/14 as the teacher, Proteus-L/14 matches the performance of the Oracle method DINOv2-L/14 (142M training data) across 15 benchmarks and outperforms other vision foundation models including CLIP-L/14 (400M), OpenCLIP-L/14 (400M/2B) and SynCLR-L/14 (600M).

Via

Access Paper or Ask Questions

SoupLM: Model Integration in Large Language and Multi-Modal Models

Jul 11, 2024

Yue Bai, Zichen Zhang, Jiasen Lu, Yun Fu

Figure 1 for SoupLM: Model Integration in Large Language and Multi-Modal Models

Figure 2 for SoupLM: Model Integration in Large Language and Multi-Modal Models

Figure 3 for SoupLM: Model Integration in Large Language and Multi-Modal Models

Figure 4 for SoupLM: Model Integration in Large Language and Multi-Modal Models

Abstract:Training large language models (LLMs) and multimodal LLMs necessitates significant computing resources, and existing publicly available LLMs are typically pre-trained on diverse, privately curated datasets spanning various tasks. For instance, LLaMA, Vicuna, and LLaVA are three LLM variants trained with LLaMA base models using very different training recipes, tasks, and data modalities. The training cost and complexity for such LLM variants grow rapidly. In this study, we propose to use a soup strategy to assemble these LLM variants into a single well-generalized multimodal LLM (SoupLM) in a cost-efficient manner. Assembling these LLM variants efficiently brings knowledge and specialities trained from different domains and data modalities into an integrated one (e.g., chatbot speciality from user-shared conversations for Vicuna, and visual capacity from vision-language data for LLaVA), therefore, to avoid computing costs of repetitive training on several different domains. We propose series of soup strategies to systematically benchmark performance gains across various configurations, and probe the soup behavior across base models in the interpolation space.

Via

Access Paper or Ask Questions

Through the Theory of Mind's Eye: Reading Minds with Multimodal Video Large Language Models

Jun 19, 2024

Zhawnen Chen, Tianchun Wang, Yizhou Wang, Michal Kosinski, Xiang Zhang, Yun Fu, Sheng Li

Figure 1 for Through the Theory of Mind's Eye: Reading Minds with Multimodal Video Large Language Models

Figure 2 for Through the Theory of Mind's Eye: Reading Minds with Multimodal Video Large Language Models

Figure 3 for Through the Theory of Mind's Eye: Reading Minds with Multimodal Video Large Language Models

Figure 4 for Through the Theory of Mind's Eye: Reading Minds with Multimodal Video Large Language Models

Abstract:Can large multimodal models have a human-like ability for emotional and social reasoning, and if so, how does it work? Recent research has discovered emergent theory-of-mind (ToM) reasoning capabilities in large language models (LLMs). LLMs can reason about people's mental states by solving various text-based ToM tasks that ask questions about the actors' ToM (e.g., human belief, desire, intention). However, human reasoning in the wild is often grounded in dynamic scenes across time. Thus, we consider videos a new medium for examining spatio-temporal ToM reasoning ability. Specifically, we ask explicit probing questions about videos with abundant social and emotional reasoning content. We develop a pipeline for multimodal LLM for ToM reasoning using video and text. We also enable explicit ToM reasoning by retrieving key frames for answering a ToM question, which reveals how multimodal LLMs reason about ToM.

Via

Access Paper or Ask Questions

Deciphering Movement: Unified Trajectory Generation Model for Multi-Agent

May 27, 2024

Yi Xu, Yun Fu

Figure 1 for Deciphering Movement: Unified Trajectory Generation Model for Multi-Agent

Figure 2 for Deciphering Movement: Unified Trajectory Generation Model for Multi-Agent

Figure 3 for Deciphering Movement: Unified Trajectory Generation Model for Multi-Agent

Figure 4 for Deciphering Movement: Unified Trajectory Generation Model for Multi-Agent

Abstract:Understanding multi-agent behavior is critical across various fields. The conventional approach involves analyzing agent movements through three primary tasks: trajectory prediction, imputation, and spatial-temporal recovery. Considering the unique input formulation and constraint of these tasks, most existing methods are tailored to address only one specific task. However, in real-world applications, these scenarios frequently occur simultaneously. Consequently, methods designed for one task often fail to adapt to others, resulting in performance drops. To overcome this limitation, we propose a Unified Trajectory Generation model, UniTraj, that processes arbitrary trajectories as masked inputs, adaptable to diverse scenarios. Specifically, we introduce a Ghost Spatial Masking (GSM) module embedded within a Transformer encoder for spatial feature extraction. We further extend recent successful State Space Models (SSMs), particularly the Mamba model, into a Bidirectional Temporal Mamba to effectively capture temporal dependencies. Additionally, we incorporate a Bidirectional Temporal Scaled (BTS) module to comprehensively scan trajectories while maintaining the temporal missing relationships within the sequence. We curate and benchmark three practical sports game datasets, Basketball-U, Football-U, and Soccer-U, for evaluation. Extensive experiments demonstrate the superior performance of our model. To the best of our knowledge, this is the first work that addresses this unified problem through a versatile generative framework, thereby enhancing our understanding of multi-agent movement. Our datasets, code, and model weights are available at https://github.com/colorfulfuture/UniTraj-pytorch.

* Datasets, code, and model weights at available at: https://github.com/colorfulfuture/UniTraj-pytorch

Via

Access Paper or Ask Questions

Consistency and Uncertainty: Identifying Unreliable Responses From Black-Box Vision-Language Models for Selective Visual Question Answering

Apr 16, 2024

Zaid Khan, Yun Fu

Figure 1 for Consistency and Uncertainty: Identifying Unreliable Responses From Black-Box Vision-Language Models for Selective Visual Question Answering

Figure 2 for Consistency and Uncertainty: Identifying Unreliable Responses From Black-Box Vision-Language Models for Selective Visual Question Answering

Figure 3 for Consistency and Uncertainty: Identifying Unreliable Responses From Black-Box Vision-Language Models for Selective Visual Question Answering

Figure 4 for Consistency and Uncertainty: Identifying Unreliable Responses From Black-Box Vision-Language Models for Selective Visual Question Answering

Abstract:The goal of selective prediction is to allow an a model to abstain when it may not be able to deliver a reliable prediction, which is important in safety-critical contexts. Existing approaches to selective prediction typically require access to the internals of a model, require retraining a model or study only unimodal models. However, the most powerful models (e.g. GPT-4) are typically only available as black boxes with inaccessible internals, are not retrainable by end-users, and are frequently used for multimodal tasks. We study the possibility of selective prediction for vision-language models in a realistic, black-box setting. We propose using the principle of \textit{neighborhood consistency} to identify unreliable responses from a black-box vision-language model in question answering tasks. We hypothesize that given only a visual question and model response, the consistency of the model's responses over the neighborhood of a visual question will indicate reliability. It is impossible to directly sample neighbors in feature space in a black-box setting. Instead, we show that it is possible to use a smaller proxy model to approximately sample from the neighborhood. We find that neighborhood consistency can be used to identify model responses to visual questions that are likely unreliable, even in adversarial settings or settings that are out-of-distribution to the proxy model.

* CVPR 2024

Via

Access Paper or Ask Questions

Self-Training Large Language Models for Improved Visual Program Synthesis With Visual Reinforcement

Apr 06, 2024

Zaid Khan, Vijay Kumar BG, Samuel Schulter, Yun Fu, Manmohan Chandraker

Figure 1 for Self-Training Large Language Models for Improved Visual Program Synthesis With Visual Reinforcement

Figure 2 for Self-Training Large Language Models for Improved Visual Program Synthesis With Visual Reinforcement

Figure 3 for Self-Training Large Language Models for Improved Visual Program Synthesis With Visual Reinforcement

Figure 4 for Self-Training Large Language Models for Improved Visual Program Synthesis With Visual Reinforcement

Abstract:Visual program synthesis is a promising approach to exploit the reasoning abilities of large language models for compositional computer vision tasks. Previous work has used few-shot prompting with frozen LLMs to synthesize visual programs. Training an LLM to write better visual programs is an attractive prospect, but it is unclear how to accomplish this. No dataset of visual programs for training exists, and acquisition of a visual program dataset cannot be easily crowdsourced due to the need for expert annotators. To get around the lack of direct supervision, we explore improving the program synthesis abilities of an LLM using feedback from interactive experience. We propose a method where we exploit existing annotations for a vision-language task to improvise a coarse reward signal for that task, treat the LLM as a policy, and apply reinforced self-training to improve the visual program synthesis ability of the LLM for that task. We describe a series of experiments on object detection, compositional visual question answering, and image-text retrieval, and show that in each case, the self-trained LLM outperforms or performs on par with few-shot frozen LLMs that are an order of magnitude larger. Website: https://zaidkhan.me/ViReP

* CVPR 2024

Via

Access Paper or Ask Questions

OOSTraj: Out-of-Sight Trajectory Prediction With Vision-Positioning Denoising

Apr 02, 2024

Haichao Zhang, Yi Xu, Hongsheng Lu, Takayuki Shimizu, Yun Fu

Abstract:Trajectory prediction is fundamental in computer vision and autonomous driving, particularly for understanding pedestrian behavior and enabling proactive decision-making. Existing approaches in this field often assume precise and complete observational data, neglecting the challenges associated with out-of-view objects and the noise inherent in sensor data due to limited camera range, physical obstructions, and the absence of ground truth for denoised sensor data. Such oversights are critical safety concerns, as they can result in missing essential, non-visible objects. To bridge this gap, we present a novel method for out-of-sight trajectory prediction that leverages a vision-positioning technique. Our approach denoises noisy sensor observations in an unsupervised manner and precisely maps sensor-based trajectories of out-of-sight objects into visual trajectories. This method has demonstrated state-of-the-art performance in out-of-sight noisy sensor trajectory denoising and prediction on the Vi-Fi and JRDB datasets. By enhancing trajectory prediction accuracy and addressing the challenges of out-of-sight objects, our work significantly contributes to improving the safety and reliability of autonomous driving in complex environments. Our work represents the first initiative towards Out-Of-Sight Trajectory prediction (OOSTraj), setting a new benchmark for future research. The code is available at \url{https://github.com/Hai-chao-Zhang/OOSTraj}.

* In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition 2024 (CVPR)

Via

Access Paper or Ask Questions

Adapting to Length Shift: FlexiLength Network for Trajectory Prediction

Mar 31, 2024

Yi Xu, Yun Fu

Abstract:Trajectory prediction plays an important role in various applications, including autonomous driving, robotics, and scene understanding. Existing approaches mainly focus on developing compact neural networks to increase prediction precision on public datasets, typically employing a standardized input duration. However, a notable issue arises when these models are evaluated with varying observation lengths, leading to a significant performance drop, a phenomenon we term the Observation Length Shift. To address this issue, we introduce a general and effective framework, the FlexiLength Network (FLN), to enhance the robustness of existing trajectory prediction techniques against varying observation periods. Specifically, FLN integrates trajectory data with diverse observation lengths, incorporates FlexiLength Calibration (FLC) to acquire temporal invariant representations, and employs FlexiLength Adaptation (FLA) to further refine these representations for more accurate future trajectory predictions. Comprehensive experiments on multiple datasets, ie, ETH/UCY, nuScenes, and Argoverse 1, demonstrate the effectiveness and flexibility of our proposed FLN framework.

* Accepted by CVPR 2024

Via

Access Paper or Ask Questions

Rewrite the Stars

Mar 29, 2024

Xu Ma, Xiyang Dai, Yue Bai, Yizhou Wang, Yun Fu

Abstract:Recent studies have drawn attention to the untapped potential of the "star operation" (element-wise multiplication) in network design. While intuitive explanations abound, the foundational rationale behind its application remains largely unexplored. Our study attempts to reveal the star operation's ability to map inputs into high-dimensional, non-linear feature spaces -- akin to kernel tricks -- without widening the network. We further introduce StarNet, a simple yet powerful prototype, demonstrating impressive performance and low latency under compact network structure and efficient budget. Like stars in the sky, the star operation appears unremarkable but holds a vast universe of potential. Our work encourages further exploration across tasks, with codes available at https://github.com/ma-xu/Rewrite-the-Stars.

* Accepted by CVPR 2024. Codes are made publically available at https://github.com/ma-xu/Rewrite-the-Stars

Via

Access Paper or Ask Questions

Efficient Modulation for Vision Networks

Mar 29, 2024

Xu Ma, Xiyang Dai, Jianwei Yang, Bin Xiao, Yinpeng Chen, Yun Fu, Lu Yuan

Abstract:In this work, we present efficient modulation, a novel design for efficient vision networks. We revisit the modulation mechanism, which operates input through convolutional context modeling and feature projection layers, and fuses features via element-wise multiplication and an MLP block. We demonstrate that the modulation mechanism is particularly well suited for efficient networks and further tailor the modulation design by proposing the efficient modulation (EfficientMod) block, which is considered the essential building block for our networks. Benefiting from the prominent representational ability of modulation mechanism and the proposed efficient design, our network can accomplish better trade-offs between accuracy and efficiency and set new state-of-the-art performance in the zoo of efficient networks. When integrating EfficientMod with the vanilla self-attention block, we obtain the hybrid architecture which further improves the performance without loss of efficiency. We carry out comprehensive experiments to verify EfficientMod's performance. With fewer parameters, our EfficientMod-s performs 0.6 top-1 accuracy better than EfficientFormerV2-s2 and is 25% faster on GPU, and 2.9 better than MobileViTv2-1.0 at the same GPU latency. Additionally, our method presents a notable improvement in downstream tasks, outperforming EfficientFormerV2-s by 3.6 mIoU on the ADE20K benchmark. Code and checkpoints are available at https://github.com/ma-xu/EfficientMod.

* Accepted by ICLR 2024. Codes are made publically available at https://github.com/ma-xu/EfficientMod

Via

Access Paper or Ask Questions