Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Fang Liu

GaussianOcc: Fully Self-supervised and Efficient 3D Occupancy Estimation with Gaussian Splatting

Aug 21, 2024

Wanshui Gan, Fang Liu, Hongbin Xu, Ningkai Mo, Naoto Yokoya

Figure 1 for GaussianOcc: Fully Self-supervised and Efficient 3D Occupancy Estimation with Gaussian Splatting

Figure 2 for GaussianOcc: Fully Self-supervised and Efficient 3D Occupancy Estimation with Gaussian Splatting

Figure 3 for GaussianOcc: Fully Self-supervised and Efficient 3D Occupancy Estimation with Gaussian Splatting

Figure 4 for GaussianOcc: Fully Self-supervised and Efficient 3D Occupancy Estimation with Gaussian Splatting

Abstract:We introduce GaussianOcc, a systematic method that investigates the two usages of Gaussian splatting for fully self-supervised and efficient 3D occupancy estimation in surround views. First, traditional methods for self-supervised 3D occupancy estimation still require ground truth 6D poses from sensors during training. To address this limitation, we propose Gaussian Splatting for Projection (GSP) module to provide accurate scale information for fully self-supervised training from adjacent view projection. Additionally, existing methods rely on volume rendering for final 3D voxel representation learning using 2D signals (depth maps, semantic maps), which is both time-consuming and less effective. We propose Gaussian Splatting from Voxel space (GSV) to leverage the fast rendering properties of Gaussian splatting. As a result, the proposed GaussianOcc method enables fully self-supervised (no ground truth pose) 3D occupancy estimation in competitive performance with low computational cost (2.7 times faster in training and 5 times faster in rendering).

* Project page: https://ganwanshui.github.io/GaussianOcc/

Via

Access Paper or Ask Questions

Fast and Efficient: Mask Neural Fields for 3D Scene Segmentation

Jul 01, 2024

Zihan Gao, Lingling Li, Licheng Jiao, Fang Liu, Xu Liu, Wenping Ma, Yuwei Guo, Shuyuan Yang

Figure 1 for Fast and Efficient: Mask Neural Fields for 3D Scene Segmentation

Figure 2 for Fast and Efficient: Mask Neural Fields for 3D Scene Segmentation

Figure 3 for Fast and Efficient: Mask Neural Fields for 3D Scene Segmentation

Figure 4 for Fast and Efficient: Mask Neural Fields for 3D Scene Segmentation

Abstract:Understanding 3D scenes is a crucial challenge in computer vision research with applications spanning multiple domains. Recent advancements in distilling 2D vision-language foundation models into neural fields, like NeRF and 3DGS, enables open-vocabulary segmentation of 3D scenes from 2D multi-view images without the need for precise 3D annotations. While effective, however, the per-pixel distillation of high-dimensional CLIP features introduces ambiguity and necessitates complex regularization strategies, adding inefficiencies during training. This paper presents MaskField, which enables fast and efficient 3D open-vocabulary segmentation with neural fields under weak supervision. Unlike previous methods, MaskField distills masks rather than dense high-dimensional CLIP features. MaskFields employ neural fields as binary mask generators and supervise them with masks generated by SAM and classified by coarse CLIP features. MaskField overcomes the ambiguous object boundaries by naturally introducing SAM segmented object shapes without extra regularization during training. By circumventing the direct handling of high-dimensional CLIP features during training, MaskField is particularly compatible with explicit scene representations like 3DGS. Our extensive experiments show that MaskField not only surpasses prior state-of-the-art methods but also achieves remarkably fast convergence, outperforming previous methods with just 5 minutes of training. We hope that MaskField will inspire further exploration into how neural fields can be trained to comprehend 3D scenes from 2D models.

* 16 pages, 7 figures

Via

Access Paper or Ask Questions

SKD-TSTSAN: Three-Stream Temporal-Shift Attention Network Based on Self-Knowledge Distillation for Micro-Expression Recognition

Jun 25, 2024

Guanghao Zhu, Lin Liu, Yuhao Hu, Haixin Sun, Fang Liu, Xiaohui Du, Ruqian Hao, Juanxiu Liu, Yong Liu, Hao Deng(+1 more)

Figure 1 for SKD-TSTSAN: Three-Stream Temporal-Shift Attention Network Based on Self-Knowledge Distillation for Micro-Expression Recognition

Figure 2 for SKD-TSTSAN: Three-Stream Temporal-Shift Attention Network Based on Self-Knowledge Distillation for Micro-Expression Recognition

Figure 3 for SKD-TSTSAN: Three-Stream Temporal-Shift Attention Network Based on Self-Knowledge Distillation for Micro-Expression Recognition

Figure 4 for SKD-TSTSAN: Three-Stream Temporal-Shift Attention Network Based on Self-Knowledge Distillation for Micro-Expression Recognition

Abstract:Micro-expressions (MEs) are subtle facial movements that occur spontaneously when people try to conceal the real emotions. Micro-expression recognition (MER) is crucial in many fields, including criminal analysis and psychotherapy. However, MER is challenging since MEs have low intensity and ME datasets are small in size. To this end, a three-stream temporal-shift attention network based on self-knowledge distillation (SKD-TSTSAN) is proposed in this paper. Firstly, to address the low intensity of ME muscle movements, we utilize learning-based motion magnification modules to enhance the intensity of ME muscle movements. Secondly, we employ efficient channel attention (ECA) modules in the local-spatial stream to make the network focus on facial regions that are highly relevant to MEs. In addition, temporal shift modules (TSMs) are used in the dynamic-temporal stream, which enables temporal modeling with no additional parameters by mixing ME motion information from two different temporal domains. Furthermore, we introduce self-knowledge distillation (SKD) into the MER task by introducing auxiliary classifiers and using the deepest section of the network for supervision, encouraging all blocks to fully explore the features of the training set. Finally, extensive experiments are conducted on four ME datasets: CASME II, SAMM, MMEW, and CAS(ME)3. The experimental results demonstrate that our SKD-TSTSAN outperforms other existing methods and achieves new state-of-the-art performance. Our code will be available at https://github.com/GuanghaoZhu663/SKD-TSTSAN.

Via

Access Paper or Ask Questions

Technique Report of CVPR 2024 PBDL Challenges

Jun 15, 2024

Ying Fu, Yu Li, Shaodi You, Boxin Shi, Jose Alvarez, Coert van Gemeren, Linwei Chen, Yunhao Zou, Zichun Wang, Yichen Li(+91 more)

Figure 1 for Technique Report of CVPR 2024 PBDL Challenges

Figure 2 for Technique Report of CVPR 2024 PBDL Challenges

Figure 3 for Technique Report of CVPR 2024 PBDL Challenges

Figure 4 for Technique Report of CVPR 2024 PBDL Challenges

Abstract:The intersection of physics-based vision and deep learning presents an exciting frontier for advancing computer vision technologies. By leveraging the principles of physics to inform and enhance deep learning models, we can develop more robust and accurate vision systems. Physics-based vision aims to invert the processes to recover scene properties such as shape, reflectance, light distribution, and medium properties from images. In recent years, deep learning has shown promising improvements for various vision tasks, and when combined with physics-based vision, these approaches can enhance the robustness and accuracy of vision systems. This technical report summarizes the outcomes of the Physics-Based Vision Meets Deep Learning (PBDL) 2024 challenge, held in CVPR 2024 workshop. The challenge consisted of eight tracks, focusing on Low-Light Enhancement and Detection as well as High Dynamic Range (HDR) Imaging. This report details the objectives, methodologies, and results of each track, highlighting the top-performing solutions and their innovative approaches.

* CVPR 2024 Workshop - PBDL Challenge Report

Via

Access Paper or Ask Questions

Multiplane Prior Guided Few-Shot Aerial Scene Rendering

Jun 07, 2024

Zihan Gao, Licheng Jiao, Lingling Li, Xu Liu, Fang Liu, Puhua Chen, Yuwei Guo

Figure 1 for Multiplane Prior Guided Few-Shot Aerial Scene Rendering

Figure 2 for Multiplane Prior Guided Few-Shot Aerial Scene Rendering

Figure 3 for Multiplane Prior Guided Few-Shot Aerial Scene Rendering

Figure 4 for Multiplane Prior Guided Few-Shot Aerial Scene Rendering

Abstract:Neural Radiance Fields (NeRF) have been successfully applied in various aerial scenes, yet they face challenges with sparse views due to limited supervision. The acquisition of dense aerial views is often prohibitive, as unmanned aerial vehicles (UAVs) may encounter constraints in perspective range and energy constraints. In this work, we introduce Multiplane Prior guided NeRF (MPNeRF), a novel approach tailored for few-shot aerial scene rendering-marking a pioneering effort in this domain. Our key insight is that the intrinsic geometric regularities specific to aerial imagery could be leveraged to enhance NeRF in sparse aerial scenes. By investigating NeRF's and Multiplane Image (MPI)'s behavior, we propose to guide the training process of NeRF with a Multiplane Prior. The proposed Multiplane Prior draws upon MPI's benefits and incorporates advanced image comprehension through a SwinV2 Transformer, pre-trained via SimMIM. Our extensive experiments demonstrate that MPNeRF outperforms existing state-of-the-art methods applied in non-aerial contexts, by tripling the performance in SSIM and LPIPS even with three views available. We hope our work offers insights into the development of NeRF-based applications in aerial scenes with limited data.

* CVPR 2024
* 17 pages, 8 figures, accepted at CVPR 2024

Via

Access Paper or Ask Questions

Automatic Graph Topology-Aware Transformer

May 30, 2024

Chao Wang, Jiaxuan Zhao, Lingling Li, Licheng Jiao, Fang Liu, Shuyuan Yang

Figure 1 for Automatic Graph Topology-Aware Transformer

Figure 2 for Automatic Graph Topology-Aware Transformer

Figure 3 for Automatic Graph Topology-Aware Transformer

Figure 4 for Automatic Graph Topology-Aware Transformer

Abstract:Existing efforts are dedicated to designing many topologies and graph-aware strategies for the graph Transformer, which greatly improve the model's representation capabilities. However, manually determining the suitable Transformer architecture for a specific graph dataset or task requires extensive expert knowledge and laborious trials. This paper proposes an evolutionary graph Transformer architecture search framework (EGTAS) to automate the construction of strong graph Transformers. We build a comprehensive graph Transformer search space with the micro-level and macro-level designs. EGTAS evolves graph Transformer topologies at the macro level and graph-aware strategies at the micro level. Furthermore, a surrogate model based on generic architectural coding is proposed to directly predict the performance of graph Transformers, substantially reducing the evaluation cost of evolutionary search. We demonstrate the efficacy of EGTAS across a range of graph-level and node-level tasks, encompassing both small-scale and large-scale graph datasets. Experimental results and ablation studies show that EGTAS can construct high-performance architectures that rival state-of-the-art manual and automated baselines.

* This work has been submitted to the IEEE (Under Second Review). Copyright may be transferred without notice, after which this version may no longer be accessible

Via

Access Paper or Ask Questions

LingML: Linguistic-Informed Machine Learning for Enhanced Fake News Detection

May 07, 2024

Jasraj Singh, Fang Liu, Hong Xu, Bee Chin Ng, Wei Zhang

Figure 1 for LingML: Linguistic-Informed Machine Learning for Enhanced Fake News Detection

Figure 2 for LingML: Linguistic-Informed Machine Learning for Enhanced Fake News Detection

Figure 3 for LingML: Linguistic-Informed Machine Learning for Enhanced Fake News Detection

Figure 4 for LingML: Linguistic-Informed Machine Learning for Enhanced Fake News Detection

Abstract:Nowadays, Information spreads at an unprecedented pace in social media and discerning truth from misinformation and fake news has become an acute societal challenge. Machine learning (ML) models have been employed to identify fake news but are far from perfect with challenging problems like limited accuracy, interpretability, and generalizability. In this paper, we enhance ML-based solutions with linguistics input and we propose LingML, linguistic-informed ML, for fake news detection. We conducted an experimental study with a popular dataset on fake news during the pandemic. The experiment results show that our proposed solution is highly effective. There are fewer than two errors out of every ten attempts with only linguistic input used in ML and the knowledge is highly explainable. When linguistics input is integrated with advanced large-scale ML models for natural language processing, our solution outperforms existing ones with 1.8% average error rate. LingML creates a new path with linguistics to push the frontier of effective and efficient fake news detection. It also sheds light on real-world multi-disciplinary applications requiring both ML and domain expertise to achieve optimal performance.

* 7 pages

Via

Access Paper or Ask Questions

Edit-Your-Motion: Space-Time Diffusion Decoupling Learning for Video Motion Editing

May 07, 2024

Yi Zuo, Lingling Li, Licheng Jiao, Fang Liu, Xu Liu, Wenping Ma, Shuyuan Yang, Yuwei Guo

Figure 1 for Edit-Your-Motion: Space-Time Diffusion Decoupling Learning for Video Motion Editing

Figure 2 for Edit-Your-Motion: Space-Time Diffusion Decoupling Learning for Video Motion Editing

Figure 3 for Edit-Your-Motion: Space-Time Diffusion Decoupling Learning for Video Motion Editing

Figure 4 for Edit-Your-Motion: Space-Time Diffusion Decoupling Learning for Video Motion Editing

Abstract:Existing diffusion-based video editing methods have achieved impressive results in motion editing. Most of the existing methods focus on the motion alignment between the edited video and the reference video. However, these methods do not constrain the background and object content of the video to remain unchanged, which makes it possible for users to generate unexpected videos. In this paper, we propose a one-shot video motion editing method called Edit-Your-Motion that requires only a single text-video pair for training. Specifically, we design the Detailed Prompt-Guided Learning Strategy (DPL) to decouple spatio-temporal features in space-time diffusion models. DPL separates learning object content and motion into two training stages. In the first training stage, we focus on learning the spatial features (the features of object content) and breaking down the temporal relationships in the video frames by shuffling them. We further propose Recurrent-Causal Attention (RC-Attn) to learn the consistent content features of the object from unordered video frames. In the second training stage, we restore the temporal relationship in video frames to learn the temporal feature (the features of the background and object's motion). We also adopt the Noise Constraint Loss to smooth out inter-frame differences. Finally, in the inference stage, we inject the content features of the source object into the editing branch through a two-branch structure (editing branch and reconstruction branch). With Edit-Your-Motion, users can edit the motion of objects in the source video to generate more exciting and diverse videos. Comprehensive qualitative experiments, quantitative experiments and user preference studies demonstrate that Edit-Your-Motion performs better than other methods.

Via

Access Paper or Ask Questions

Exploring Beyond Logits: Hierarchical Dynamic Labeling Based on Embeddings for Semi-Supervised Classification

Apr 26, 2024

Yanbiao Ma, Licheng Jiao, Fang Liu, Lingling Li, Shuyuan Yang, Xu Liu

Abstract:In semi-supervised learning, methods that rely on confidence learning to generate pseudo-labels have been widely proposed. However, increasing research finds that when faced with noisy and biased data, the model's representation network is more reliable than the classification network. Additionally, label generation methods based on model predictions often show poor adaptability across different datasets, necessitating customization of the classification network. Therefore, we propose a Hierarchical Dynamic Labeling (HDL) algorithm that does not depend on model predictions and utilizes image embeddings to generate sample labels. We also introduce an adaptive method for selecting hyperparameters in HDL, enhancing its versatility. Moreover, HDL can be combined with general image encoders (e.g., CLIP) to serve as a fundamental data processing module. We extract embeddings from datasets with class-balanced and long-tailed distributions using pre-trained semi-supervised models. Subsequently, samples are re-labeled using HDL, and the re-labeled samples are used to further train the semi-supervised models. Experiments demonstrate improved model performance, validating the motivation that representation networks are more reliable than classifiers or predictors. Our approach has the potential to change the paradigm of pseudo-label generation in semi-supervised learning.

Via

Access Paper or Ask Questions

FedStyle: Style-Based Federated Learning Crowdsourcing Framework for Art Commissions

Apr 25, 2024

Changjuan Ran, Yeting Guo, Fang Liu, Shenglan Cui, Yunfan Ye

Figure 1 for FedStyle: Style-Based Federated Learning Crowdsourcing Framework for Art Commissions

Figure 2 for FedStyle: Style-Based Federated Learning Crowdsourcing Framework for Art Commissions

Figure 3 for FedStyle: Style-Based Federated Learning Crowdsourcing Framework for Art Commissions

Figure 4 for FedStyle: Style-Based Federated Learning Crowdsourcing Framework for Art Commissions

Abstract:The unique artistic style is crucial to artists' occupational competitiveness, yet prevailing Art Commission Platforms rarely support style-based retrieval. Meanwhile, the fast-growing generative AI techniques aggravate artists' concerns about releasing personal artworks to public platforms. To achieve artistic style-based retrieval without exposing personal artworks, we propose FedStyle, a style-based federated learning crowdsourcing framework. It allows artists to train local style models and share model parameters rather than artworks for collaboration. However, most artists possess a unique artistic style, resulting in severe model drift among them. FedStyle addresses such extreme data heterogeneity by having artists learn their abstract style representations and align with the server, rather than merely aggregating model parameters lacking semantics. Besides, we introduce contrastive learning to meticulously construct the style representation space, pulling artworks with similar styles closer and keeping different ones apart in the embedding space. Extensive experiments on the proposed datasets demonstrate the superiority of FedStyle.

* Accepted to ICME 2024

Via

Access Paper or Ask Questions