Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yanwei Fu

Open-DDVM: A Reproduction and Extension of Diffusion Model for Optical Flow Estimation

Dec 04, 2023

Qiaole Dong, Bo Zhao, Yanwei Fu

Abstract:Recently, Google proposes DDVM which for the first time demonstrates that a general diffusion model for image-to-image translation task works impressively well on optical flow estimation task without any specific designs like RAFT. However, DDVM is still a closed-source model with the expensive and private Palette-style pretraining. In this technical report, we present the first open-source DDVM by reproducing it. We study several design choices and find those important ones. By training on 40k public data with 4 GPUs, our reproduction achieves comparable performance to the closed-source DDVM. The code and model have been released in https://github.com/DQiaole/FlowDiffusion_pytorch.

* Technical Report

Via

Access Paper or Ask Questions

fMRI-PTE: A Large-scale fMRI Pretrained Transformer Encoder for Multi-Subject Brain Activity Decoding

Nov 01, 2023

Xuelin Qian, Yun Wang, Jingyang Huo, Jianfeng Feng, Yanwei Fu

Figure 1 for fMRI-PTE: A Large-scale fMRI Pretrained Transformer Encoder for Multi-Subject Brain Activity Decoding

Figure 2 for fMRI-PTE: A Large-scale fMRI Pretrained Transformer Encoder for Multi-Subject Brain Activity Decoding

Figure 3 for fMRI-PTE: A Large-scale fMRI Pretrained Transformer Encoder for Multi-Subject Brain Activity Decoding

Figure 4 for fMRI-PTE: A Large-scale fMRI Pretrained Transformer Encoder for Multi-Subject Brain Activity Decoding

Abstract:The exploration of brain activity and its decoding from fMRI data has been a longstanding pursuit, driven by its potential applications in brain-computer interfaces, medical diagnostics, and virtual reality. Previous approaches have primarily focused on individual subject analysis, highlighting the need for a more universal and adaptable framework, which is the core motivation behind our work. In this work, we propose fMRI-PTE, an innovative auto-encoder approach for fMRI pre-training, with a focus on addressing the challenges of varying fMRI data dimensions due to individual brain differences. Our approach involves transforming fMRI signals into unified 2D representations, ensuring consistency in dimensions and preserving distinct brain activity patterns. We introduce a novel learning strategy tailored for pre-training 2D fMRI images, enhancing the quality of reconstruction. fMRI-PTE's adaptability with image generators enables the generation of well-represented fMRI features, facilitating various downstream tasks, including within-subject and cross-subject brain activity decoding. Our contributions encompass introducing fMRI-PTE, innovative data transformation, efficient training, a novel learning strategy, and the universal applicability of our approach. Extensive experiments validate and support our claims, offering a promising foundation for further research in this domain.

Via

Access Paper or Ask Questions

The Blessings of Multiple Treatments and Outcomes in Treatment Effect Estimation

Oct 14, 2023

Yong Wu, Mingzhou Liu, Jing Yan, Yanwei Fu, Shouyan Wang, Yizhou Wang, Xinwei Sun

Figure 1 for The Blessings of Multiple Treatments and Outcomes in Treatment Effect Estimation

Figure 2 for The Blessings of Multiple Treatments and Outcomes in Treatment Effect Estimation

Figure 3 for The Blessings of Multiple Treatments and Outcomes in Treatment Effect Estimation

Figure 4 for The Blessings of Multiple Treatments and Outcomes in Treatment Effect Estimation

Abstract:Assessing causal effects in the presence of unobserved confounding is a challenging problem. Existing studies leveraged proxy variables or multiple treatments to adjust for the confounding bias. In particular, the latter approach attributes the impact on a single outcome to multiple treatments, allowing estimating latent variables for confounding control. Nevertheless, these methods primarily focus on a single outcome, whereas in many real-world scenarios, there is greater interest in studying the effects on multiple outcomes. Besides, these outcomes are often coupled with multiple treatments. Examples include the intensive care unit (ICU), where health providers evaluate the effectiveness of therapies on multiple health indicators. To accommodate these scenarios, we consider a new setting dubbed as multiple treatments and multiple outcomes. We then show that parallel studies of multiple outcomes involved in this setting can assist each other in causal identification, in the sense that we can exploit other treatments and outcomes as proxies for each treatment effect under study. We proceed with a causal discovery method that can effectively identify such proxies for causal estimation. The utility of our method is demonstrated in synthetic data and sepsis disease.

* Preprint, under review

Via

Access Paper or Ask Questions

Rethinking Amodal Video Segmentation from Learning Supervised Signals with Object-centric Representation

Sep 23, 2023

Ke Fan, Jingshi Lei, Xuelin Qian, Miaopeng Yu, Tianjun Xiao, Tong He, Zheng Zhang, Yanwei Fu

Figure 1 for Rethinking Amodal Video Segmentation from Learning Supervised Signals with Object-centric Representation

Figure 2 for Rethinking Amodal Video Segmentation from Learning Supervised Signals with Object-centric Representation

Figure 3 for Rethinking Amodal Video Segmentation from Learning Supervised Signals with Object-centric Representation

Figure 4 for Rethinking Amodal Video Segmentation from Learning Supervised Signals with Object-centric Representation

Abstract:Video amodal segmentation is a particularly challenging task in computer vision, which requires to deduce the full shape of an object from the visible parts of it. Recently, some studies have achieved promising performance by using motion flow to integrate information across frames under a self-supervised setting. However, motion flow has a clear limitation by the two factors of moving cameras and object deformation. This paper presents a rethinking to previous works. We particularly leverage the supervised signals with object-centric representation in \textit{real-world scenarios}. The underlying idea is the supervision signal of the specific object and the features from different views can mutually benefit the deduction of the full mask in any specific frame. We thus propose an Efficient object-centric Representation amodal Segmentation (EoRaS). Specially, beyond solely relying on supervision signals, we design a translation module to project image features into the Bird's-Eye View (BEV), which introduces 3D information to improve current feature quality. Furthermore, we propose a multi-view fusion layer based temporal module which is equipped with a set of object slots and interacts with features from different views by attention mechanism to fulfill sufficient object representation completion. As a result, the full mask of the object can be decoded from image features updated by object slots. Extensive experiments on both real-world and synthetic benchmarks demonstrate the superiority of our proposed method, achieving state-of-the-art performance. Our code will be released at \url{https://github.com/kfan21/EoRaS}.

* Accepted by ICCV 2023

Via

Access Paper or Ask Questions

Doubly Robust Proximal Causal Learning for Continuous Treatments

Sep 22, 2023

Yong Wu, Yanwei Fu, Shouyan Wang, Xinwei Sun

Figure 1 for Doubly Robust Proximal Causal Learning for Continuous Treatments

Figure 2 for Doubly Robust Proximal Causal Learning for Continuous Treatments

Figure 3 for Doubly Robust Proximal Causal Learning for Continuous Treatments

Figure 4 for Doubly Robust Proximal Causal Learning for Continuous Treatments

Abstract:Proximal causal learning is a promising framework for identifying the causal effect under the existence of unmeasured confounders. Within this framework, the doubly robust (DR) estimator was derived and has shown its effectiveness in estimation, especially when the model assumption is violated. However, the current form of the DR estimator is restricted to binary treatments, while the treatment can be continuous in many real-world applications. The primary obstacle to continuous treatments resides in the delta function present in the original DR estimator, making it infeasible in causal effect estimation and introducing a heavy computational burden in nuisance function estimation. To address these challenges, we propose a kernel-based DR estimator that can well handle continuous treatments. Equipped with its smoothness, we show that its oracle form is a consistent approximation of the influence function. Further, we propose a new approach to efficiently solve the nuisance functions. We then provide a comprehensive convergence analysis in terms of the mean square error. We demonstrate the utility of our estimator on synthetic datasets and real-world applications.

* Preprint, under review

Via

Access Paper or Ask Questions

Unsupervised Open-Vocabulary Object Localization in Videos

Sep 18, 2023

Ke Fan, Zechen Bai, Tianjun Xiao, Dominik Zietlow, Max Horn, Zixu Zhao, Carl-Johann Simon-Gabriel, Mike Zheng Shou, Francesco Locatello, Bernt Schiele(+4 more)

Figure 1 for Unsupervised Open-Vocabulary Object Localization in Videos

Figure 2 for Unsupervised Open-Vocabulary Object Localization in Videos

Figure 3 for Unsupervised Open-Vocabulary Object Localization in Videos

Figure 4 for Unsupervised Open-Vocabulary Object Localization in Videos

Abstract:In this paper, we show that recent advances in video representation learning and pre-trained vision-language models allow for substantial improvements in self-supervised video object localization. We propose a method that first localizes objects in videos via a slot attention approach and then assigns text to the obtained slots. The latter is achieved by an unsupervised way to read localized semantic information from the pre-trained CLIP model. The resulting video object localization is entirely unsupervised apart from the implicit annotation contained in CLIP, and it is effectively the first unsupervised approach that yields good results on regular video benchmarks.

* Accepted by ICCV 2023

Via

Access Paper or Ask Questions

Object-Centric Multiple Object Tracking

Sep 05, 2023

Zixu Zhao, Jiaze Wang, Max Horn, Yizhuo Ding, Tong He, Zechen Bai, Dominik Zietlow, Carl-Johann Simon-Gabriel, Bing Shuai, Zhuowen Tu(+6 more)

Figure 1 for Object-Centric Multiple Object Tracking

Figure 2 for Object-Centric Multiple Object Tracking

Figure 3 for Object-Centric Multiple Object Tracking

Figure 4 for Object-Centric Multiple Object Tracking

Abstract:Unsupervised object-centric learning methods allow the partitioning of scenes into entities without additional localization information and are excellent candidates for reducing the annotation burden of multiple-object tracking (MOT) pipelines. Unfortunately, they lack two key properties: objects are often split into parts and are not consistently tracked over time. In fact, state-of-the-art models achieve pixel-level accuracy and temporal consistency by relying on supervised object detection with additional ID labels for the association through time. This paper proposes a video object-centric model for MOT. It consists of an index-merge module that adapts the object-centric slots into detection outputs and an object memory module that builds complete object prototypes to handle occlusions. Benefited from object-centric learning, we only require sparse detection labels (0%-6.25%) for object localization and feature binding. Relying on our self-supervised Expectation-Maximization-inspired loss for object association, our approach requires no ID labels. Our experiments significantly narrow the gap between the existing object-centric model and the fully supervised state-of-the-art and outperform several unsupervised trackers.

* ICCV 2023 camera-ready version

Via

Access Paper or Ask Questions

WALL-E: Embodied Robotic WAiter Load Lifting with Large Language Model

Aug 31, 2023

Tianyu Wang, Yifan Li, Haitao Lin, Xiangyang Xue, Yanwei Fu

Abstract:Enabling robots to understand language instructions and react accordingly to visual perception has been a long-standing goal in the robotics research community. Achieving this goal requires cutting-edge advances in natural language processing, computer vision, and robotics engineering. Thus, this paper mainly investigates the potential of integrating the most recent Large Language Models (LLMs) and existing visual grounding and robotic grasping system to enhance the effectiveness of the human-robot interaction. We introduce the WALL-E (Embodied Robotic WAiter load lifting with Large Language model) as an example of this integration. The system utilizes the LLM of ChatGPT to summarize the preference object of the users as a target instruction via the multi-round interactive dialogue. The target instruction is then forwarded to a visual grounding system for object pose and size estimation, following which the robot grasps the object accordingly. We deploy this LLM-empowered system on the physical robot to provide a more user-friendly interface for the instruction-guided grasping task. The further experimental results on various real-world scenarios demonstrated the feasibility and efficacy of our proposed framework. See the project website at: https://star-uu-wang.github.io/WALL-E/

* 14 pages, 8 figures. See https://star-uu-wang.github.io/WALL-E/

Via

Access Paper or Ask Questions

Coarse-to-Fine Amodal Segmentation with Shape Prior

Aug 31, 2023

Jianxiong Gao, Xuelin Qian, Yikai Wang, Tianjun Xiao, Tong He, Zheng Zhang, Yanwei Fu

Figure 1 for Coarse-to-Fine Amodal Segmentation with Shape Prior

Figure 2 for Coarse-to-Fine Amodal Segmentation with Shape Prior

Figure 3 for Coarse-to-Fine Amodal Segmentation with Shape Prior

Figure 4 for Coarse-to-Fine Amodal Segmentation with Shape Prior

Abstract:Amodal object segmentation is a challenging task that involves segmenting both visible and occluded parts of an object. In this paper, we propose a novel approach, called Coarse-to-Fine Segmentation (C2F-Seg), that addresses this problem by progressively modeling the amodal segmentation. C2F-Seg initially reduces the learning space from the pixel-level image space to the vector-quantized latent space. This enables us to better handle long-range dependencies and learn a coarse-grained amodal segment from visual features and visible segments. However, this latent space lacks detailed information about the object, which makes it difficult to provide a precise segmentation directly. To address this issue, we propose a convolution refine module to inject fine-grained information and provide a more precise amodal object segmentation based on visual features and coarse-predicted segmentation. To help the studies of amodal object segmentation, we create a synthetic amodal dataset, named as MOViD-Amodal (MOViD-A), which can be used for both image and video amodal object segmentation. We extensively evaluate our model on two benchmark datasets: KINS and COCO-A. Our empirical results demonstrate the superiority of C2F-Seg. Moreover, we exhibit the potential of our approach for video amodal object segmentation tasks on FISHBOWL and our proposed MOViD-A. Project page at: http://jianxgao.github.io/C2F-Seg.

* Accepted to ICCV 2023

Via

Access Paper or Ask Questions

Rethinking Person Re-identification from a Projection-on-Prototypes Perspective

Aug 21, 2023

Qizao Wang, Xuelin Qian, Bin Li, Yanwei Fu, Xiangyang Xue

Abstract:Person Re-IDentification (Re-ID) as a retrieval task, has achieved tremendous development over the past decade. Existing state-of-the-art methods follow an analogous framework to first extract features from the input images and then categorize them with a classifier. However, since there is no identity overlap between training and testing sets, the classifier is often discarded during inference. Only the extracted features are used for person retrieval via distance metrics. In this paper, we rethink the role of the classifier in person Re-ID, and advocate a new perspective to conceive the classifier as a projection from image features to class prototypes. These prototypes are exactly the learned parameters of the classifier. In this light, we describe the identity of input images as similarities to all prototypes, which are then utilized as more discriminative features to perform person Re-ID. We thereby propose a new baseline ProNet, which innovatively reserves the function of the classifier at the inference stage. To facilitate the learning of class prototypes, both triplet loss and identity classification loss are applied to features that undergo the projection by the classifier. An improved version of ProNet++ is presented by further incorporating multi-granularity designs. Experiments on four benchmarks demonstrate that our proposed ProNet is simple yet effective, and significantly beats previous baselines. ProNet++ also achieves competitive or even better results than transformer-based competitors.

Via

Access Paper or Ask Questions