Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Chen Sun

Learning Visual Grounding from Generative Vision and Language Model

Jul 18, 2024

Shijie Wang, Dahun Kim, Ali Taalimi, Chen Sun, Weicheng Kuo

Figure 1 for Learning Visual Grounding from Generative Vision and Language Model

Figure 2 for Learning Visual Grounding from Generative Vision and Language Model

Figure 3 for Learning Visual Grounding from Generative Vision and Language Model

Figure 4 for Learning Visual Grounding from Generative Vision and Language Model

Abstract:Visual grounding tasks aim to localize image regions based on natural language references. In this work, we explore whether generative VLMs predominantly trained on image-text data could be leveraged to scale up the text annotation of visual grounding data. We find that grounding knowledge already exists in generative VLM and can be elicited by proper prompting. We thus prompt a VLM to generate object-level descriptions by feeding it object regions from existing object detection datasets. We further propose attribute modeling to explicitly capture the important object attributes, and spatial relation modeling to capture inter-object relationship, both of which are common linguistic pattern in referring expression. Our constructed dataset (500K images, 1M objects, 16M referring expressions) is one of the largest grounding datasets to date, and the first grounding dataset with purely model-generated queries and human-annotated objects. To verify the quality of this data, we conduct zero-shot transfer experiments to the popular RefCOCO benchmarks for both referring expression comprehension (REC) and segmentation (RES) tasks. On both tasks, our model significantly outperform the state-of-the-art approaches without using human annotated visual grounding data. Our results demonstrate the promise of generative VLM to scale up visual grounding in the real world. Code and models will be released.

Via

Access Paper or Ask Questions

Potential Based Diffusion Motion Planning

Jul 08, 2024

Yunhao Luo, Chen Sun, Joshua B. Tenenbaum, Yilun Du

Abstract:Effective motion planning in high dimensional spaces is a long-standing open problem in robotics. One class of traditional motion planning algorithms corresponds to potential-based motion planning. An advantage of potential based motion planning is composability -- different motion constraints can be easily combined by adding corresponding potentials. However, constructing motion paths from potentials requires solving a global optimization across configuration space potential landscape, which is often prone to local minima. We propose a new approach towards learning potential based motion planning, where we train a neural network to capture and learn an easily optimizable potentials over motion planning trajectories. We illustrate the effectiveness of such approach, significantly outperforming both classical and recent learned motion planning approaches and avoiding issues with local minima. We further illustrate its inherent composability, enabling us to generalize to a multitude of different motion constraints.

* ICML 2024. Project page and code at https://energy-based-model.github.io/potential-motion-plan/

Via

Access Paper or Ask Questions

Text-Aware Diffusion for Policy Learning

Jul 02, 2024

Calvin Luo, Mandy He, Zilai Zeng, Chen Sun

Figure 1 for Text-Aware Diffusion for Policy Learning

Figure 2 for Text-Aware Diffusion for Policy Learning

Figure 3 for Text-Aware Diffusion for Policy Learning

Figure 4 for Text-Aware Diffusion for Policy Learning

Abstract:Training an agent to achieve particular goals or perform desired behaviors is often accomplished through reinforcement learning, especially in the absence of expert demonstrations. However, supporting novel goals or behaviors through reinforcement learning requires the ad-hoc design of appropriate reward functions, which quickly becomes intractable. To address this challenge, we propose Text-Aware Diffusion for Policy Learning (TADPoLe), which uses a pretrained, frozen text-conditioned diffusion model to compute dense zero-shot reward signals for text-aligned policy learning. We hypothesize that large-scale pretrained generative models encode rich priors that can supervise a policy to behave not only in a text-aligned manner, but also in alignment with a notion of naturalness summarized from internet-scale training data. In our experiments, we demonstrate that TADPoLe is able to learn policies for novel goal-achievement and continuous locomotion behaviors specified by natural language, in both Humanoid and Dog environments. The behaviors are learned zero-shot without ground-truth rewards or expert demonstrations, and are qualitatively more natural according to human evaluation. We further show that TADPoLe performs competitively when applied to robotic manipulation tasks in the Meta-World environment.

Via

Access Paper or Ask Questions

Multi-Beam Integrated Sensing and Communication: State-of-the-Art, Challenges and Opportunities

May 31, 2024

Yinxiao Zhuo, Tianqi Mao, Haojin Li, Chen Sun, Zhaocheng Wang, Zhu Han, Sheng Chen

Figure 1 for Multi-Beam Integrated Sensing and Communication: State-of-the-Art, Challenges and Opportunities

Figure 2 for Multi-Beam Integrated Sensing and Communication: State-of-the-Art, Challenges and Opportunities

Figure 3 for Multi-Beam Integrated Sensing and Communication: State-of-the-Art, Challenges and Opportunities

Figure 4 for Multi-Beam Integrated Sensing and Communication: State-of-the-Art, Challenges and Opportunities

Abstract:Integrated sensing and communication (ISAC) has been envisioned as a critical enabling technology for the next-generation wireless communication, which can realize location/motion detection of surroundings with communication devices. This additional sensing capability leads to a substantial network quality gain and expansion of the service scenarios. As the system evolves to millimeter wave (mmWave) and above, ISAC can realize simultaneous communications and sensing of the ultra-high throughput level and radar resolution with compact design, which relies on directional beamforming against the path loss. With the multi-beam technology, the dual functions of ISAC can be seamlessly incorporated at the beamspace level by unleashing the potential of joint beamforming. To this end, this article investigates the key technologies for multi-beam ISAC system. We begin with an overview of the current state-of-the-art solutions in multi-beam ISAC. Subsequently, a detailed analysis of the advantages associated with the multi-beam ISAC is provided. Additionally, the key technologies for transmitter, channel and receiver of the multi-beam ISAC are introduced. Finally, we explore the challenges and opportunities presented by multi-beam ISAC, offering valuable insights into this emerging field.

Via

Access Paper or Ask Questions

Pre-trained Vision-Language Models Learn Discoverable Visual Concepts

Apr 19, 2024

Yuan Zang, Tian Yun, Hao Tan, Trung Bui, Chen Sun

Abstract:Do vision-language models (VLMs) pre-trained to caption an image of a "durian" learn visual concepts such as "brown" (color) and "spiky" (texture) at the same time? We aim to answer this question as visual concepts learned "for free" would enable wide applications such as neuro-symbolic reasoning or human-interpretable object classification. We assume that the visual concepts, if captured by pre-trained VLMs, can be extracted by their vision-language interface with text-based concept prompts. We observe that recent works prompting VLMs with concepts often differ in their strategies to define and evaluate the visual concepts, leading to conflicting conclusions. We propose a new concept definition strategy based on two observations: First, certain concept prompts include shortcuts that recognize correct concepts for wrong reasons; Second, multimodal information (e.g. visual discriminativeness, and textual knowledge) should be leveraged when selecting the concepts. Our proposed concept discovery and learning (CDL) framework is thus designed to identify a diverse list of generic visual concepts (e.g. "spiky" as opposed to "spiky durian"), which are ranked and selected based on visual and language mutual information. We carefully design quantitative and human evaluations of the discovered concepts on six diverse visual recognition datasets, which confirm that pre-trained VLMs do learn visual concepts that provide accurate and thorough descriptions for the recognized objects. All code and models are publicly released.

Via

Access Paper or Ask Questions

Precoder Design for User-Centric Network Massive MIMO with Matrix Manifold Optimization

Apr 11, 2024

Rui Sun, Li You, An-An Lu, Chen Sun, Xiqi Gao, Xiang-Gen Xia

Abstract:In this paper, we investigate the precoder design for user-centric network (UCN) massive multiple-input multiple-output (mMIMO) downlink with matrix manifold optimization. In UCN mMIMO systems, each user terminal (UT) is served by a subset of base stations (BSs) instead of all the BSs, facilitating the implementation of the system and lowering the dimension of the precoders to be designed. By proving that the precoder set satisfying the per-BS power constraints forms a Riemannian submanifold of a linear product manifold, we transform the constrained precoder design problem in Euclidean space to an unconstrained one on the Riemannian submanifold. Riemannian ingredients, including orthogonal projection, Riemannian gradient, retraction and vector transport, of the problem on the Riemannian submanifold are further derived, with which the Riemannian conjugate gradient (RCG) design method is proposed for solving the unconstrained problem. The proposed method avoids the inverses of large dimensional matrices, which is beneficial in practice. The complexity analyses show the high computational efficiency of RCG precoder design. Simulation results demonstrate the numerical superiority of the proposed precoder design and the high efficiency of the UCN mMIMO system.

* 13 pages, 9 figures, journal

Via

Access Paper or Ask Questions

Self-Correcting Self-Consuming Loops for Generative Model Training

Feb 11, 2024

Nate Gillman, Michael Freeman, Daksh Aggarwal, Chia-Hong Hsu, Calvin Luo, Yonglong Tian, Chen Sun

Figure 1 for Self-Correcting Self-Consuming Loops for Generative Model Training

Figure 2 for Self-Correcting Self-Consuming Loops for Generative Model Training

Figure 3 for Self-Correcting Self-Consuming Loops for Generative Model Training

Figure 4 for Self-Correcting Self-Consuming Loops for Generative Model Training

Abstract:As synthetic data becomes higher quality and proliferates on the internet, machine learning models are increasingly trained on a mix of human- and machine-generated data. Despite the successful stories of using synthetic data for representation learning, using synthetic data for generative model training creates "self-consuming loops" which may lead to training instability or even collapse, unless certain conditions are met. Our paper aims to stabilize self-consuming generative model training. Our theoretical results demonstrate that by introducing an idealized correction function, which maps a data point to be more likely under the true data distribution, self-consuming loops can be made exponentially more stable. We then propose self-correction functions, which rely on expert knowledge (e.g. the laws of physics programmed in a simulator), and aim to approximate the idealized corrector automatically and at scale. We empirically validate the effectiveness of self-correcting self-consuming loops on the challenging human motion synthesis task, and observe that it successfully avoids model collapse, even when the ratio of synthetic data to real data is as high as 100%.

* Under submission. Code will be released at https://nategillman.com/sc-sc.html

Via

Access Paper or Ask Questions

Pixel Aligned Language Models

Dec 14, 2023

Jiarui Xu, Xingyi Zhou, Shen Yan, Xiuye Gu, Anurag Arnab, Chen Sun, Xiaolong Wang, Cordelia Schmid

Abstract:Large language models have achieved great success in recent years, so as their variants in vision. Existing vision-language models can describe images in natural languages, answer visual-related questions, or perform complex reasoning about the image. However, it is yet unclear how localization tasks, such as word grounding or referring localization, can be performed using large language models. In this work, we aim to develop a vision-language model that can take locations, for example, a set of points or boxes, as either inputs or outputs. When taking locations as inputs, the model performs location-conditioned captioning, which generates captions for the indicated object or region. When generating locations as outputs, our model regresses pixel coordinates for each output word generated by the language model, and thus performs dense word grounding. Our model is pre-trained on the Localized Narrative dataset, which contains pixel-word-aligned captioning from human attention. We show our model can be applied to various location-aware vision-language tasks, including referring localization, location-conditioned captioning, and dense object captioning, archiving state-of-the-art performance on RefCOCO and Visual Genome. Project page: https://jerryxu.net/PixelLLM .

* Project page: https://jerryxu.net/PixelLLM

Via

Access Paper or Ask Questions

Spacewalk-18: A Benchmark for Multimodal and Long-form Procedural Video Understanding in Novel Domains

Nov 30, 2023

Rohan Myer Krishnan, Zitian Tang, Zhiqiu Yu, Chen Sun

Figure 1 for Spacewalk-18: A Benchmark for Multimodal and Long-form Procedural Video Understanding in Novel Domains

Figure 2 for Spacewalk-18: A Benchmark for Multimodal and Long-form Procedural Video Understanding in Novel Domains

Figure 3 for Spacewalk-18: A Benchmark for Multimodal and Long-form Procedural Video Understanding in Novel Domains

Figure 4 for Spacewalk-18: A Benchmark for Multimodal and Long-form Procedural Video Understanding in Novel Domains

Abstract:Learning from videos is an emerging research area that enables robots to acquire skills from human demonstrations, such as procedural videos. To do this, video-language models must be able to obtain structured understandings, such as the temporal segmentation of a demonstration into sequences of actions and skills, and to generalize the understandings to novel domains. In pursuit of this goal, we introduce Spacewalk-18, a benchmark containing two tasks: (1) step recognition and (2) intra-video retrieval over a dataset of temporally segmented and labeled tasks in International Space Station spacewalk recordings. In tandem, the two tasks quantify a model's ability to make use of: (1) out-of-domain visual information; (2) a high temporal context window; and (3) multimodal (text + video) domains. This departs from existing benchmarks for procedural video understanding, which typically deal with short context lengths and can be solved with a single modality. Spacewalk-18, with its inherent multimodal and long-form complexity, exposes the high difficulty of task recognition and segmentation. We find that state-of-the-art methods perform poorly on our benchmark, demonstrating that the goal of generalizable procedural video understanding models is far out and underscoring the need to develop new approaches to these tasks. Data, model, and code will be publicly released.

* Under submission. Code and models will be released at https://brown-palm.github.io/Spacewalk-18/

Via

Access Paper or Ask Questions

Vamos: Versatile Action Models for Video Understanding

Nov 22, 2023

Shijie Wang, Qi Zhao, Minh Quan Do, Nakul Agarwal, Kwonjoon Lee, Chen Sun

Figure 1 for Vamos: Versatile Action Models for Video Understanding

Figure 2 for Vamos: Versatile Action Models for Video Understanding

Figure 3 for Vamos: Versatile Action Models for Video Understanding

Figure 4 for Vamos: Versatile Action Models for Video Understanding

Abstract:What makes good video representations for video understanding, such as anticipating future activities, or answering video-conditioned questions? While earlier approaches focus on end-to-end learning directly from video pixels, we propose to revisit text-based representations, such as discrete action labels, or free-form video captions, which are interpretable and can be directly consumed by large language models (LLMs). Intuitively, different video understanding tasks may require representations that are complementary and at different granularities. To this end, we propose versatile action models (Vamos), a learning framework powered by a large language model as the "reasoner", and can flexibly leverage visual embeddings, action labels, and free-form descriptions extracted from videos as its input. We evaluate Vamos on four complementary video understanding benchmarks, Ego4D, Next-QA, IntentQA, and EgoSchema, on its capability to model temporal dynamics, encode visual history, and perform reasoning. Surprisingly, we observe that text-based representations consistently achieve competitive performance on all benchmarks, and that visual embeddings provide marginal or no performance improvement, demonstrating the effectiveness of text-based video representation in the LLM era. We perform extensive ablation study and qualitative analysis to support our observations, and achieve state-of-the-art performance on three benchmarks.

* Under submission. Code and models will be released at https://brown-palm.github.io/Vamos/

Via

Access Paper or Ask Questions