Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ming-Hsuan Yang

PTT: Point-Trajectory Transformer for Efficient Temporal 3D Object Detection

Dec 13, 2023

Kuan-Chih Huang, Weijie Lyu, Ming-Hsuan Yang, Yi-Hsuan Tsai

Figure 1 for PTT: Point-Trajectory Transformer for Efficient Temporal 3D Object Detection

Figure 2 for PTT: Point-Trajectory Transformer for Efficient Temporal 3D Object Detection

Figure 3 for PTT: Point-Trajectory Transformer for Efficient Temporal 3D Object Detection

Figure 4 for PTT: Point-Trajectory Transformer for Efficient Temporal 3D Object Detection

Abstract:Recent temporal LiDAR-based 3D object detectors achieve promising performance based on the two-stage proposal-based approach. They generate 3D box candidates from the first-stage dense detector, followed by different temporal aggregation methods. However, these approaches require per-frame objects or whole point clouds, posing challenges related to memory bank utilization. Moreover, point clouds and trajectory features are combined solely based on concatenation, which may neglect effective interactions between them. In this paper, we propose a point-trajectory transformer with long short-term memory for efficient temporal 3D object detection. To this end, we only utilize point clouds of current-frame objects and their historical trajectories as input to minimize the memory bank storage requirement. Furthermore, we introduce modules to encode trajectory features, focusing on long short-term and future-aware perspectives, and then effectively aggregate them with point cloud features. We conduct extensive experiments on the large-scale Waymo dataset to demonstrate that our approach performs well against state-of-the-art methods. Code and models will be made publicly available at https://github.com/kuanchihhuang/PTT.

* Project page: https://github.com/kuanchihhuang/PTT

Via

Access Paper or Ask Questions

Weakly Supervised 3D Object Detection via Multi-Level Visual Guidance

Dec 12, 2023

Kuan-Chih Huang, Yi-Hsuan Tsai, Ming-Hsuan Yang

Abstract:Weakly supervised 3D object detection aims to learn a 3D detector with lower annotation cost, e.g., 2D labels. Unlike prior work which still relies on few accurate 3D annotations, we propose a framework to study how to leverage constraints between 2D and 3D domains without requiring any 3D labels. Specifically, we employ visual data from three perspectives to establish connections between 2D and 3D domains. First, we design a feature-level constraint to align LiDAR and image features based on object-aware regions. Second, the output-level constraint is developed to enforce the overlap between 2D and projected 3D box estimations. Finally, the training-level constraint is utilized by producing accurate and consistent 3D pseudo-labels that align with the visual data. We conduct extensive experiments on the KITTI dataset to validate the effectiveness of the proposed three constraints. Without using any 3D labels, our method achieves favorable performance against state-of-the-art approaches and is competitive with the method that uses 500-frame 3D annotations. Code and models will be made publicly available at https://github.com/kuanchihhuang/VG-W3D.

* Project page: https://github.com/kuanchihhuang/VG-W3D

Via

Access Paper or Ask Questions

Exploring Plain ViT Reconstruction for Multi-class Unsupervised Anomaly Detection

Dec 12, 2023

Jiangning Zhang, Xuhai Chen, Yabiao Wang, Chengjie Wang, Yong Liu, Xiangtai Li, Ming-Hsuan Yang, Dacheng Tao

Figure 1 for Exploring Plain ViT Reconstruction for Multi-class Unsupervised Anomaly Detection

Figure 2 for Exploring Plain ViT Reconstruction for Multi-class Unsupervised Anomaly Detection

Figure 3 for Exploring Plain ViT Reconstruction for Multi-class Unsupervised Anomaly Detection

Figure 4 for Exploring Plain ViT Reconstruction for Multi-class Unsupervised Anomaly Detection

Abstract:This work studies the recently proposed challenging and practical Multi-class Unsupervised Anomaly Detection (MUAD) task, which only requires normal images for training while simultaneously testing both normal/anomaly images for multiple classes. Existing reconstruction-based methods typically adopt pyramid networks as encoders/decoders to obtain multi-resolution features, accompanied by elaborate sub-modules with heavier handcraft engineering designs for more precise localization. In contrast, a plain Vision Transformer (ViT) with simple architecture has been shown effective in multiple domains, which is simpler, more effective, and elegant. Following this spirit, this paper explores plain ViT architecture for MUAD. Specifically, we abstract a Meta-AD concept by inducing current reconstruction-based methods. Then, we instantiate a novel and elegant plain ViT-based symmetric ViTAD structure, effectively designed step by step from three macro and four micro perspectives. In addition, this paper reveals several interesting findings for further exploration. Finally, we propose a comprehensive and fair evaluation benchmark on eight metrics for the MUAD task. Based on a naive training recipe, ViTAD achieves state-of-the-art (SoTA) results and efficiency on the MVTec AD and VisA datasets without bells and whistles, obtaining 85.4 mAD that surpasses SoTA UniAD by +3.0, and only requiring 1.1 hours and 2.3G GPU memory to complete model training by a single V100 GPU. Source code, models, and more results are available at https://zhangzjn.github.io/projects/ViTAD.

Via

Access Paper or Ask Questions

Weakly Supervised Video Individual CountingWeakly Supervised Video Individual Counting

Dec 10, 2023

Xinyan Liu, Guorong Li, Yuankai Qi, Ziheng Yan, Zhenjun Han, Anton van den Hengel, Ming-Hsuan Yang, Qingming Huang

Figure 1 for Weakly Supervised Video Individual CountingWeakly Supervised Video Individual Counting

Figure 2 for Weakly Supervised Video Individual CountingWeakly Supervised Video Individual Counting

Figure 3 for Weakly Supervised Video Individual CountingWeakly Supervised Video Individual Counting

Figure 4 for Weakly Supervised Video Individual CountingWeakly Supervised Video Individual Counting

Abstract:Video Individual Counting (VIC) aims to predict the number of unique individuals in a single video. % Existing methods learn representations based on trajectory labels for individuals, which are annotation-expensive. % To provide a more realistic reflection of the underlying practical challenge, we introduce a weakly supervised VIC task, wherein trajectory labels are not provided. Instead, two types of labels are provided to indicate traffic entering the field of view (inflow) and leaving the field view (outflow). % We also propose the first solution as a baseline that formulates the task as a weakly supervised contrastive learning problem under group-level matching. In doing so, we devise an end-to-end trainable soft contrastive loss to drive the network to distinguish inflow, outflow, and the remaining. % To facilitate future study in this direction, we generate annotations from the existing VIC datasets SenseCrowd and CroHD and also build a new dataset, UAVVIC. % Extensive results show that our baseline weakly supervised method outperforms supervised methods, and thus, little information is lost in the transition to the more practically relevant weakly supervised task. The code and trained model will be public at \href{https://github.com/streamer-AP/CGNet}{CGNet}

Via

Access Paper or Ask Questions

CSL: Class-Agnostic Structure-Constrained Learning for Segmentation Including the Unseen

Dec 09, 2023

Hao Zhang, Fang Li, Lu Qi, Ming-Hsuan Yang, Narendra Ahuja

Figure 1 for CSL: Class-Agnostic Structure-Constrained Learning for Segmentation Including the Unseen

Figure 2 for CSL: Class-Agnostic Structure-Constrained Learning for Segmentation Including the Unseen

Figure 3 for CSL: Class-Agnostic Structure-Constrained Learning for Segmentation Including the Unseen

Figure 4 for CSL: Class-Agnostic Structure-Constrained Learning for Segmentation Including the Unseen

Abstract:Addressing Out-Of-Distribution (OOD) Segmentation and Zero-Shot Semantic Segmentation (ZS3) is challenging, necessitating segmenting unseen classes. Existing strategies adapt the class-agnostic Mask2Former (CA-M2F) tailored to specific tasks. However, these methods cater to singular tasks, demand training from scratch, and we demonstrate certain deficiencies in CA-M2F, which affect performance. We propose the Class-Agnostic Structure-Constrained Learning (CSL), a plug-in framework that can integrate with existing methods, thereby embedding structural constraints and achieving performance gain, including the unseen, specifically OOD, ZS3, and domain adaptation (DA) tasks. There are two schemes for CSL to integrate with existing methods (1) by distilling knowledge from a base teacher network, enforcing constraints across training and inference phrases, or (2) by leveraging established models to obtain per-pixel distributions without retraining, appending constraints during the inference phase. We propose soft assignment and mask split methodologies that enhance OOD object segmentation. Empirical evaluations demonstrate CSL's prowess in boosting the performance of existing algorithms spanning OOD segmentation, ZS3, and DA segmentation, consistently transcending the state-of-art across all three tasks.

* Accepted by AAAI 2024

Via

Access Paper or Ask Questions

DreaMo: Articulated 3D Reconstruction From A Single Casual Video

Dec 07, 2023

Tao Tu, Ming-Feng Li, Chieh Hubert Lin, Yen-Chi Cheng, Min Sun, Ming-Hsuan Yang

Figure 1 for DreaMo: Articulated 3D Reconstruction From A Single Casual Video

Figure 2 for DreaMo: Articulated 3D Reconstruction From A Single Casual Video

Figure 3 for DreaMo: Articulated 3D Reconstruction From A Single Casual Video

Figure 4 for DreaMo: Articulated 3D Reconstruction From A Single Casual Video

Abstract:Articulated 3D reconstruction has valuable applications in various domains, yet it remains costly and demands intensive work from domain experts. Recent advancements in template-free learning methods show promising results with monocular videos. Nevertheless, these approaches necessitate a comprehensive coverage of all viewpoints of the subject in the input video, thus limiting their applicability to casually captured videos from online sources. In this work, we study articulated 3D shape reconstruction from a single and casually captured internet video, where the subject's view coverage is incomplete. We propose DreaMo that jointly performs shape reconstruction while solving the challenging low-coverage regions with view-conditioned diffusion prior and several tailored regularizations. In addition, we introduce a skeleton generation strategy to create human-interpretable skeletons from the learned neural bones and skinning weights. We conduct our study on a self-collected internet video collection characterized by incomplete view coverage. DreaMo shows promising quality in novel-view rendering, detailed articulated shape reconstruction, and skeleton generation. Extensive qualitative and quantitative studies validate the efficacy of each proposed component, and show existing methods are unable to solve correct geometry due to the incomplete view coverage.

* Project page: https://ttaoretw.github.io/DreaMo/

Via

Access Paper or Ask Questions

Towards 4D Human Video Stylization

Dec 07, 2023

Tiantian Wang, Xinxin Zuo, Fangzhou Mu, Jian Wang, Ming-Hsuan Yang

Figure 1 for Towards 4D Human Video Stylization

Figure 2 for Towards 4D Human Video Stylization

Figure 3 for Towards 4D Human Video Stylization

Figure 4 for Towards 4D Human Video Stylization

Abstract:We present a first step towards 4D (3D and time) human video stylization, which addresses style transfer, novel view synthesis and human animation within a unified framework. While numerous video stylization methods have been developed, they are often restricted to rendering images in specific viewpoints of the input video, lacking the capability to generalize to novel views and novel poses in dynamic scenes. To overcome these limitations, we leverage Neural Radiance Fields (NeRFs) to represent videos, conducting stylization in the rendered feature space. Our innovative approach involves the simultaneous representation of both the human subject and the surrounding scene using two NeRFs. This dual representation facilitates the animation of human subjects across various poses and novel viewpoints. Specifically, we introduce a novel geometry-guided tri-plane representation, significantly enhancing feature representation robustness compared to direct tri-plane optimization. Following the video reconstruction, stylization is performed within the NeRFs' rendered feature space. Extensive experiments demonstrate that the proposed method strikes a superior balance between stylized textures and temporal coherence, surpassing existing approaches. Furthermore, our framework uniquely extends its capabilities to accommodate novel poses and viewpoints, making it a versatile tool for creative human video stylization.

* Under Review

Via

Access Paper or Ask Questions

Diffusion-SS3D: Diffusion Model for Semi-supervised 3D Object Detection

Dec 05, 2023

Cheng-Ju Ho, Chen-Hsuan Tai, Yen-Yu Lin, Ming-Hsuan Yang, Yi-Hsuan Tsai

Figure 1 for Diffusion-SS3D: Diffusion Model for Semi-supervised 3D Object Detection

Figure 2 for Diffusion-SS3D: Diffusion Model for Semi-supervised 3D Object Detection

Figure 3 for Diffusion-SS3D: Diffusion Model for Semi-supervised 3D Object Detection

Figure 4 for Diffusion-SS3D: Diffusion Model for Semi-supervised 3D Object Detection

Abstract:Semi-supervised object detection is crucial for 3D scene understanding, efficiently addressing the limitation of acquiring large-scale 3D bounding box annotations. Existing methods typically employ a teacher-student framework with pseudo-labeling to leverage unlabeled point clouds. However, producing reliable pseudo-labels in a diverse 3D space still remains challenging. In this work, we propose Diffusion-SS3D, a new perspective of enhancing the quality of pseudo-labels via the diffusion model for semi-supervised 3D object detection. Specifically, we include noises to produce corrupted 3D object size and class label distributions, and then utilize the diffusion model as a denoising process to obtain bounding box outputs. Moreover, we integrate the diffusion model into the teacher-student framework, so that the denoised bounding boxes can be used to improve pseudo-label generation, as well as the entire semi-supervised learning process. We conduct experiments on the ScanNet and SUN RGB-D benchmark datasets to demonstrate that our approach achieves state-of-the-art performance against existing methods. We also present extensive analysis to understand how our diffusion model design affects performance in semi-supervised learning.

* Accepted in NeurIPS 2023. Code is available at https://github.com/luluho1208/Diffusion-SS3D

Via

Access Paper or Ask Questions

Fine-grained Controllable Video Generation via Object Appearance and Context

Dec 05, 2023

Hsin-Ping Huang, Yu-Chuan Su, Deqing Sun, Lu Jiang, Xuhui Jia, Yukun Zhu, Ming-Hsuan Yang

Figure 1 for Fine-grained Controllable Video Generation via Object Appearance and Context

Figure 2 for Fine-grained Controllable Video Generation via Object Appearance and Context

Figure 3 for Fine-grained Controllable Video Generation via Object Appearance and Context

Figure 4 for Fine-grained Controllable Video Generation via Object Appearance and Context

Abstract:Text-to-video generation has shown promising results. However, by taking only natural languages as input, users often face difficulties in providing detailed information to precisely control the model's output. In this work, we propose fine-grained controllable video generation (FACTOR) to achieve detailed control. Specifically, FACTOR aims to control objects' appearances and context, including their location and category, in conjunction with the text prompt. To achieve detailed control, we propose a unified framework to jointly inject control signals into the existing text-to-video model. Our model consists of a joint encoder and adaptive cross-attention layers. By optimizing the encoder and the inserted layer, we adapt the model to generate videos that are aligned with both text prompts and fine-grained control. Compared to existing methods relying on dense control signals such as edge maps, we provide a more intuitive and user-friendly interface to allow object-level fine-grained control. Our method achieves controllability of object appearances without finetuning, which reduces the per-subject optimization efforts for the users. Extensive experiments on standard benchmark datasets and user-provided inputs validate that our model obtains a 70% improvement in controllability metrics over competitive baselines.

* Project page: https://hhsinping.github.io/factor

Via

Access Paper or Ask Questions

Multi-task Image Restoration Guided By Robust DINO Features

Dec 05, 2023

Xin Lin, Chao Ren, Kelvin C. K. Chan, Lu Qi, Jinshan Pan, Ming-Hsuan Yang

Abstract:Multi-task image restoration has gained significant interest due to its inherent versatility and efficiency compared to its single-task counterpart. Despite its potential, performance degradation is observed with an increase in the number of tasks, primarily attributed to the distinct nature of each restoration task. Addressing this challenge, we introduce \mbox{\textbf{DINO-IR}}, a novel multi-task image restoration approach leveraging robust features extracted from DINOv2. Our empirical analysis shows that while shallow features of DINOv2 capture rich low-level image characteristics, the deep features ensure a robust semantic representation insensitive to degradations while preserving high-frequency contour details. Building on these features, we devise specialized components, including multi-layer semantic fusion module, DINO-Restore adaption and fusion module, and DINO perception contrastive loss, to integrate DINOv2 features into the restoration paradigm. Equipped with the aforementioned components, our DINO-IR performs favorably against existing multi-task image restoration approaches in various tasks by a large margin, indicating the superiority and necessity of reinforcing the robust features for multi-task image restoration.

* Some important information need to add

Via

Access Paper or Ask Questions