Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xiangyang Ji

Tsinghua University, Beijing, China

Solution for Point Tracking Task of ICCV 1st Perception Test Challenge 2023

Mar 26, 2024

Hongpeng Pan, Yang Yang, Zhongtian Fu, Yuxuan Zhang, Shian Du, Yi Xu, Xiangyang Ji

Abstract:This report proposes an improved method for the Tracking Any Point (TAP) task, which tracks any physical surface through a video. Several existing approaches have explored the TAP by considering the temporal relationships to obtain smooth point motion trajectories, however, they still suffer from the cumulative error caused by temporal prediction. To address this issue, we propose a simple yet effective approach called TAP with confident static points (TAPIR+), which focuses on rectifying the tracking of the static point in the videos shot by a static camera. To clarify, our approach contains two key components: (1) Multi-granularity Camera Motion Detection, which could identify the video sequence by the static camera shot. (2) CMR-based point trajectory prediction with one moving object segmentation approach to isolate the static point from the moving object. Our approach ranked first in the final test with a score of 0.46.

Via

Access Paper or Ask Questions

SynFog: A Photo-realistic Synthetic Fog Dataset based on End-to-end Imaging Simulation for Advancing Real-World Defogging in Autonomous Driving

Mar 25, 2024

Yiming Xie, Henglu Wei, Zhenyi Liu, Xiaoyu Wang, Xiangyang Ji

Figure 1 for SynFog: A Photo-realistic Synthetic Fog Dataset based on End-to-end Imaging Simulation for Advancing Real-World Defogging in Autonomous Driving

Figure 2 for SynFog: A Photo-realistic Synthetic Fog Dataset based on End-to-end Imaging Simulation for Advancing Real-World Defogging in Autonomous Driving

Figure 3 for SynFog: A Photo-realistic Synthetic Fog Dataset based on End-to-end Imaging Simulation for Advancing Real-World Defogging in Autonomous Driving

Figure 4 for SynFog: A Photo-realistic Synthetic Fog Dataset based on End-to-end Imaging Simulation for Advancing Real-World Defogging in Autonomous Driving

Abstract:To advance research in learning-based defogging algorithms, various synthetic fog datasets have been developed. However, existing datasets created using the Atmospheric Scattering Model (ASM) or real-time rendering engines often struggle to produce photo-realistic foggy images that accurately mimic the actual imaging process. This limitation hinders the effective generalization of models from synthetic to real data. In this paper, we introduce an end-to-end simulation pipeline designed to generate photo-realistic foggy images. This pipeline comprehensively considers the entire physically-based foggy scene imaging process, closely aligning with real-world image capture methods. Based on this pipeline, we present a new synthetic fog dataset named SynFog, which features both sky light and active lighting conditions, as well as three levels of fog density. Experimental results demonstrate that models trained on SynFog exhibit superior performance in visual perception and detection accuracy compared to others when applied to real-world foggy images.

Via

Access Paper or Ask Questions

KP-RED: Exploiting Semantic Keypoints for Joint 3D Shape Retrieval and Deformation

Mar 20, 2024

Ruida Zhang, Chenyangguang Zhang, Yan Di, Fabian Manhardt, Xingyu Liu, Federico Tombari, Xiangyang Ji

Figure 1 for KP-RED: Exploiting Semantic Keypoints for Joint 3D Shape Retrieval and Deformation

Figure 2 for KP-RED: Exploiting Semantic Keypoints for Joint 3D Shape Retrieval and Deformation

Figure 3 for KP-RED: Exploiting Semantic Keypoints for Joint 3D Shape Retrieval and Deformation

Figure 4 for KP-RED: Exploiting Semantic Keypoints for Joint 3D Shape Retrieval and Deformation

Abstract:In this paper, we present KP-RED, a unified KeyPoint-driven REtrieval and Deformation framework that takes object scans as input and jointly retrieves and deforms the most geometrically similar CAD models from a pre-processed database to tightly match the target. Unlike existing dense matching based methods that typically struggle with noisy partial scans, we propose to leverage category-consistent sparse keypoints to naturally handle both full and partial object scans. Specifically, we first employ a lightweight retrieval module to establish a keypoint-based embedding space, measuring the similarity among objects by dynamically aggregating deformation-aware local-global features around extracted keypoints. Objects that are close in the embedding space are considered similar in geometry. Then we introduce the neural cage-based deformation module that estimates the influence vector of each keypoint upon cage vertices inside its local support region to control the deformation of the retrieved shape. Extensive experiments on the synthetic dataset PartNet and the real-world dataset Scan2CAD demonstrate that KP-RED surpasses existing state-of-the-art approaches by a large margin. Codes and trained models will be released in https://github.com/lolrudy/KP-RED.

* Accepted by CVPR 2024

Via

Access Paper or Ask Questions

Stimulate the Potential of Robots via Competition

Mar 15, 2024

Kangyao Huang, Di Guo, Xinyu Zhang, Xiangyang Ji, Huaping Liu

Figure 1 for Stimulate the Potential of Robots via Competition

Figure 2 for Stimulate the Potential of Robots via Competition

Figure 3 for Stimulate the Potential of Robots via Competition

Figure 4 for Stimulate the Potential of Robots via Competition

Abstract:It is common for us to feel pressure in a competition environment, which arises from the desire to obtain success comparing with other individuals or opponents. Although we might get anxious under the pressure, it could also be a drive for us to stimulate our potentials to the best in order to keep up with others. Inspired by this, we propose a competitive learning framework which is able to help individual robot to acquire knowledge from the competition, fully stimulating its dynamics potential in the race. Specifically, the competition information among competitors is introduced as the additional auxiliary signal to learn advantaged actions. We further build a Multiagent-Race environment, and extensive experiments are conducted, demonstrating that robots trained in competitive environments outperform ones that are trained with SoTA algorithms in single robot environment.

Via

Access Paper or Ask Questions

FaceChain-SuDe: Building Derived Class to Inherit Category Attributes for One-shot Subject-Driven Generation

Mar 11, 2024

Pengchong Qiao, Lei Shang, Chang Liu, Baigui Sun, Xiangyang Ji, Jie Chen

Abstract:Subject-driven generation has garnered significant interest recently due to its ability to personalize text-to-image generation. Typical works focus on learning the new subject's private attributes. However, an important fact has not been taken seriously that a subject is not an isolated new concept but should be a specialization of a certain category in the pre-trained model. This results in the subject failing to comprehensively inherit the attributes in its category, causing poor attribute-related generations. In this paper, motivated by object-oriented programming, we model the subject as a derived class whose base class is its semantic category. This modeling enables the subject to inherit public attributes from its category while learning its private attributes from the user-provided example. Specifically, we propose a plug-and-play method, Subject-Derived regularization (SuDe). It constructs the base-derived class modeling by constraining the subject-driven generated images to semantically belong to the subject's category. Extensive experiments under three baselines and two backbones on various subjects show that our SuDe enables imaginative attribute-related generations while maintaining subject fidelity. Codes will be open sourced soon at FaceChain (https://github.com/modelscope/facechain).

* accepted by CVPR2024

Via

Access Paper or Ask Questions

MamMIL: Multiple Instance Learning for Whole Slide Images with State Space Models

Mar 08, 2024

Zijie Fang, Yifeng Wang, Zhi Wang, Jian Zhang, Xiangyang Ji, Yongbing Zhang

Figure 1 for MamMIL: Multiple Instance Learning for Whole Slide Images with State Space Models

Figure 2 for MamMIL: Multiple Instance Learning for Whole Slide Images with State Space Models

Figure 3 for MamMIL: Multiple Instance Learning for Whole Slide Images with State Space Models

Figure 4 for MamMIL: Multiple Instance Learning for Whole Slide Images with State Space Models

Abstract:Recently, pathological diagnosis, the gold standard for cancer diagnosis, has achieved superior performance by combining the Transformer with the multiple instance learning (MIL) framework using whole slide images (WSIs). However, the giga-pixel nature of WSIs poses a great challenge for the quadratic-complexity self-attention mechanism in Transformer to be applied in MIL. Existing studies usually use linear attention to improve computing efficiency but inevitably bring performance bottlenecks. To tackle this challenge, we propose a MamMIL framework for WSI classification by cooperating the selective structured state space model (i.e., Mamba) with MIL for the first time, enabling the modeling of instance dependencies while maintaining linear complexity. Specifically, to solve the problem that Mamba can only conduct unidirectional one-dimensional (1D) sequence modeling, we innovatively introduce a bidirectional state space model and a 2D context-aware block to enable MamMIL to learn the bidirectional instance dependencies with 2D spatial relationships. Experiments on two datasets show that MamMIL can achieve advanced classification performance with smaller memory footprints than the state-of-the-art MIL frameworks based on the Transformer. The code will be open-sourced if accepted.

* 11 pages, 2 figures

Via

Access Paper or Ask Questions

VideoElevator: Elevating Video Generation Quality with Versatile Text-to-Image Diffusion Models

Mar 08, 2024

Yabo Zhang, Yuxiang Wei, Xianhui Lin, Zheng Hui, Peiran Ren, Xuansong Xie, Xiangyang Ji, Wangmeng Zuo

Abstract:Text-to-image diffusion models (T2I) have demonstrated unprecedented capabilities in creating realistic and aesthetic images. On the contrary, text-to-video diffusion models (T2V) still lag far behind in frame quality and text alignment, owing to insufficient quality and quantity of training videos. In this paper, we introduce VideoElevator, a training-free and plug-and-play method, which elevates the performance of T2V using superior capabilities of T2I. Different from conventional T2V sampling (i.e., temporal and spatial modeling), VideoElevator explicitly decomposes each sampling step into temporal motion refining and spatial quality elevating. Specifically, temporal motion refining uses encapsulated T2V to enhance temporal consistency, followed by inverting to the noise distribution required by T2I. Then, spatial quality elevating harnesses inflated T2I to directly predict less noisy latent, adding more photo-realistic details. We have conducted experiments in extensive prompts under the combination of various T2V and T2I. The results show that VideoElevator not only improves the performance of T2V baselines with foundational T2I, but also facilitates stylistic video synthesis with personalized T2I. Our code is available at https://github.com/YBYBZhang/VideoElevator.

* Project page: https://videoelevator.github.io Code: https://github.com/YBYBZhang/VideoElevator

Via

Access Paper or Ask Questions

Unveiling the Depths: A Multi-Modal Fusion Framework for Challenging Scenarios

Feb 19, 2024

Jialei Xu, Xianming Liu, Junjun Jiang, Kui Jiang, Rui Li, Kai Cheng, Xiangyang Ji

Figure 1 for Unveiling the Depths: A Multi-Modal Fusion Framework for Challenging Scenarios

Figure 2 for Unveiling the Depths: A Multi-Modal Fusion Framework for Challenging Scenarios

Figure 3 for Unveiling the Depths: A Multi-Modal Fusion Framework for Challenging Scenarios

Figure 4 for Unveiling the Depths: A Multi-Modal Fusion Framework for Challenging Scenarios

Abstract:Monocular depth estimation from RGB images plays a pivotal role in 3D vision. However, its accuracy can deteriorate in challenging environments such as nighttime or adverse weather conditions. While long-wave infrared cameras offer stable imaging in such challenging conditions, they are inherently low-resolution, lacking rich texture and semantics as delivered by the RGB image. Current methods focus solely on a single modality due to the difficulties to identify and integrate faithful depth cues from both sources. To address these issues, this paper presents a novel approach that identifies and integrates dominant cross-modality depth features with a learning-based framework. Concretely, we independently compute the coarse depth maps with separate networks by fully utilizing the individual depth cues from each modality. As the advantageous depth spreads across both modalities, we propose a novel confidence loss steering a confidence predictor network to yield a confidence map specifying latent potential depth areas. With the resulting confidence map, we propose a multi-modal fusion network that fuses the final depth in an end-to-end manner. Harnessing the proposed pipeline, our method demonstrates the ability of robust depth estimation in a variety of difficult scenarios. Experimental results on the challenging MS$^2$ and ViViD++ datasets demonstrate the effectiveness and robustness of our method.

Via

Access Paper or Ask Questions

PAS-SLAM: A Visual SLAM System for Planar Ambiguous Scenes

Feb 09, 2024

Xinggang Hu, Yanmin Wu, Mingyuan Zhao, Linghao Yang, Xiangkui Zhang, Xiangyang Ji

Abstract:Visual SLAM (Simultaneous Localization and Mapping) based on planar features has found widespread applications in fields such as environmental structure perception and augmented reality. However, current research faces challenges in accurately localizing and mapping in planar ambiguous scenes, primarily due to the poor accuracy of the employed planar features and data association methods. In this paper, we propose a visual SLAM system based on planar features designed for planar ambiguous scenes, encompassing planar processing, data association, and multi-constraint factor graph optimization. We introduce a planar processing strategy that integrates semantic information with planar features, extracting the edges and vertices of planes to be utilized in tasks such as plane selection, data association, and pose optimization. Next, we present an integrated data association strategy that combines plane parameters, semantic information, projection IoU (Intersection over Union), and non-parametric tests, achieving accurate and robust plane data association in planar ambiguous scenes. Finally, we design a set of multi-constraint factor graphs for camera pose optimization. Qualitative and quantitative experiments conducted on publicly available datasets demonstrate that our proposed system competes effectively in both accuracy and robustness in terms of map construction and camera localization compared to state-of-the-art methods.

Via

Access Paper or Ask Questions

Enhancing Consistency and Mitigating Bias: A Data Replay Approach for Incremental Learning

Jan 12, 2024

Chenyang Wang, Junjun Jiang, Xingyu Hu, Xianming Liu, Xiangyang Ji

Figure 1 for Enhancing Consistency and Mitigating Bias: A Data Replay Approach for Incremental Learning

Figure 2 for Enhancing Consistency and Mitigating Bias: A Data Replay Approach for Incremental Learning

Figure 3 for Enhancing Consistency and Mitigating Bias: A Data Replay Approach for Incremental Learning

Figure 4 for Enhancing Consistency and Mitigating Bias: A Data Replay Approach for Incremental Learning

Abstract:Deep learning systems are prone to catastrophic forgetting when learning from a sequence of tasks, where old data from experienced tasks is unavailable when learning from a new task. To mitigate the problem, a line of methods propose to replay the data of experienced tasks when learning new tasks. These methods usually adopt an extra memory to store the data for replay. However, it is not expected in practice considering the memory constraint or data privacy issue. As a replacement, data-free data replay methods are proposed by inverting samples from the classification model. Though achieving good results, these methods still suffer from the inconsistency of the inverted and real training data, which is neglected in the inversion stage in recent works. To that effect, we propose to measure the data consistency quantitatively by some simplification and assumptions. Using the measurement, we analyze existing techniques for inverting samples and get some insightful information that inspires a novel loss function to reduce the inconsistency. Specifically, the loss minimizes the KL divergence of the distributions of inverted and real data under the tied multivariate Gaussian assumption, which is easy to implement in continual learning. In addition, we observe that the norms of old class weights turn to decrease continually as learning progresses. We thus analyze the underlying reasons and propose a simple regularization term to balance the class weights so that the samples of old classes are more distinguishable. To conclude, we propose the Consistency enhanced data replay with debiased classifier for Class Incremental Learning (CCIL). Extensive experiments on CIFAR-100, Tiny-ImageNet, and ImageNet100 show consistently improved performance of CCIL compared to previous approaches.

Via

Access Paper or Ask Questions