Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Junsong Yuan

Towards Generic Image Manipulation Detection with Weakly-Supervised Self-Consistency Learning

Sep 03, 2023

Yuanhao Zhai, Tianyu Luan, David Doermann, Junsong Yuan

Abstract:As advanced image manipulation techniques emerge, detecting the manipulation becomes increasingly important. Despite the success of recent learning-based approaches for image manipulation detection, they typically require expensive pixel-level annotations to train, while exhibiting degraded performance when testing on images that are differently manipulated compared with training images. To address these limitations, we propose weakly-supervised image manipulation detection, such that only binary image-level labels (authentic or tampered with) are required for training purpose. Such a weakly-supervised setting can leverage more training images and has the potential to adapt quickly to new manipulation techniques. To improve the generalization ability, we propose weakly-supervised self-consistency learning (WSCL) to leverage the weakly annotated images. Specifically, two consistency properties are learned: multi-source consistency (MSC) and inter-patch consistency (IPC). MSC exploits different content-agnostic information and enables cross-source learning via an online pseudo label generation and refinement process. IPC performs global pair-wise patch-patch relationship reasoning to discover a complete region of manipulation. Extensive experiments validate that our WSCL, even though is weakly supervised, exhibits competitive performance compared with fully-supervised counterpart under both in-distribution and out-of-distribution evaluations, as well as reasonable manipulation localization ability.

* Accepted to ICCV 2023, code: https://github.com/yhZhai/WSCL

Via

Access Paper or Ask Questions

Language-guided Human Motion Synthesis with Atomic Actions

Aug 18, 2023

Yuanhao Zhai, Mingzhen Huang, Tianyu Luan, Lu Dong, Ifeoma Nwogu, Siwei Lyu, David Doermann, Junsong Yuan

Abstract:Language-guided human motion synthesis has been a challenging task due to the inherent complexity and diversity of human behaviors. Previous methods face limitations in generalization to novel actions, often resulting in unrealistic or incoherent motion sequences. In this paper, we propose ATOM (ATomic mOtion Modeling) to mitigate this problem, by decomposing actions into atomic actions, and employing a curriculum learning strategy to learn atomic action composition. First, we disentangle complex human motions into a set of atomic actions during learning, and then assemble novel actions using the learned atomic actions, which offers better adaptability to new actions. Moreover, we introduce a curriculum learning training strategy that leverages masked motion modeling with a gradual increase in the mask ratio, and thus facilitates atomic action assembly. This approach mitigates the overfitting problem commonly encountered in previous methods while enforcing the model to learn better motion representations. We demonstrate the effectiveness of ATOM through extensive experiments, including text-to-motion and action-to-motion synthesis tasks. We further illustrate its superiority in synthesizing plausible and coherent text-guided human motion sequences.

* Accepted to ACM MM 2023, code: https://github.com/yhZhai/ATOM

Via

Access Paper or Ask Questions

Source-Free Domain Adaptation for Medical Image Segmentation via Prototype-Anchored Feature Alignment and Contrastive Learning

Jul 19, 2023

Qinji Yu, Nan Xi, Junsong Yuan, Ziyu Zhou, Kang Dang, Xiaowei Ding

Abstract:Unsupervised domain adaptation (UDA) has increasingly gained interests for its capacity to transfer the knowledge learned from a labeled source domain to an unlabeled target domain. However, typical UDA methods require concurrent access to both the source and target domain data, which largely limits its application in medical scenarios where source data is often unavailable due to privacy concern. To tackle the source data-absent problem, we present a novel two-stage source-free domain adaptation (SFDA) framework for medical image segmentation, where only a well-trained source segmentation model and unlabeled target data are available during domain adaptation. Specifically, in the prototype-anchored feature alignment stage, we first utilize the weights of the pre-trained pixel-wise classifier as source prototypes, which preserve the information of source features. Then, we introduce the bi-directional transport to align the target features with class prototypes by minimizing its expected cost. On top of that, a contrastive learning stage is further devised to utilize those pixels with unreliable predictions for a more compact target feature distribution. Extensive experiments on a cross-modality medical segmentation task demonstrate the superiority of our method in large domain discrepancy settings compared with the state-of-the-art SFDA approaches and even some UDA methods. Code is available at https://github.com/CSCYQJ/MICCAI23-ProtoContra-SFDA.

* Accepted by MICCAI23

Via

Access Paper or Ask Questions

Uncertainty-aware State Space Transformer for Egocentric 3D Hand Trajectory Forecasting

Jul 17, 2023

Wentao Bao, Lele Chen, Libing Zeng, Zhong Li, Yi Xu, Junsong Yuan, Yu Kong

Figure 1 for Uncertainty-aware State Space Transformer for Egocentric 3D Hand Trajectory Forecasting

Figure 2 for Uncertainty-aware State Space Transformer for Egocentric 3D Hand Trajectory Forecasting

Figure 3 for Uncertainty-aware State Space Transformer for Egocentric 3D Hand Trajectory Forecasting

Figure 4 for Uncertainty-aware State Space Transformer for Egocentric 3D Hand Trajectory Forecasting

Abstract:Hand trajectory forecasting from egocentric views is crucial for enabling a prompt understanding of human intentions when interacting with AR/VR systems. However, existing methods handle this problem in a 2D image space which is inadequate for 3D real-world applications. In this paper, we set up an egocentric 3D hand trajectory forecasting task that aims to predict hand trajectories in a 3D space from early observed RGB videos in a first-person view. To fulfill this goal, we propose an uncertainty-aware state space Transformer (USST) that takes the merits of the attention mechanism and aleatoric uncertainty within the framework of the classical state-space model. The model can be further enhanced by the velocity constraint and visual prompt tuning (VPT) on large vision transformers. Moreover, we develop an annotation workflow to collect 3D hand trajectories with high quality. Experimental results on H2O and EgoPAT3D datasets demonstrate the superiority of USST for both 2D and 3D trajectory forecasting. The code and datasets are publicly released: https://github.com/Cogito2012/USST.

* ICCV 2023 Accepted

Via

Access Paper or Ask Questions

High Fidelity 3D Hand Shape Reconstruction via Scalable Graph Frequency Decomposition

Jul 08, 2023

Tianyu Luan, Yuanhao Zhai, Jingjing Meng, Zhong Li, Zhang Chen, Yi Xu, Junsong Yuan

Figure 1 for High Fidelity 3D Hand Shape Reconstruction via Scalable Graph Frequency Decomposition

Figure 2 for High Fidelity 3D Hand Shape Reconstruction via Scalable Graph Frequency Decomposition

Figure 3 for High Fidelity 3D Hand Shape Reconstruction via Scalable Graph Frequency Decomposition

Figure 4 for High Fidelity 3D Hand Shape Reconstruction via Scalable Graph Frequency Decomposition

Abstract:Despite the impressive performance obtained by recent single-image hand modeling techniques, they lack the capability to capture sufficient details of the 3D hand mesh. This deficiency greatly limits their applications when high-fidelity hand modeling is required, e.g., personalized hand modeling. To address this problem, we design a frequency split network to generate 3D hand mesh using different frequency bands in a coarse-to-fine manner. To capture high-frequency personalized details, we transform the 3D mesh into the frequency domain, and propose a novel frequency decomposition loss to supervise each frequency component. By leveraging such a coarse-to-fine scheme, hand details that correspond to the higher frequency domain can be preserved. In addition, the proposed network is scalable, and can stop the inference at any resolution level to accommodate different hardware with varying computational powers. To quantitatively evaluate the performance of our method in terms of recovering personalized shape details, we introduce a new evaluation metric named Mean Signal-to-Noise Ratio (MSNR) to measure the signal-to-noise ratio of each mesh frequency component. Extensive experiments demonstrate that our approach generates fine-grained details for high-fidelity 3D hand reconstruction, and our evaluation metric is more effective for measuring mesh details compared with traditional metrics.

* In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16795-16804. 2023
* CVPR 2023

Via

Access Paper or Ask Questions

RoomDreamer: Text-Driven 3D Indoor Scene Synthesis with Coherent Geometry and Texture

May 18, 2023

Liangchen Song, Liangliang Cao, Hongyu Xu, Kai Kang, Feng Tang, Junsong Yuan, Yang Zhao

Figure 1 for RoomDreamer: Text-Driven 3D Indoor Scene Synthesis with Coherent Geometry and Texture

Figure 2 for RoomDreamer: Text-Driven 3D Indoor Scene Synthesis with Coherent Geometry and Texture

Figure 3 for RoomDreamer: Text-Driven 3D Indoor Scene Synthesis with Coherent Geometry and Texture

Figure 4 for RoomDreamer: Text-Driven 3D Indoor Scene Synthesis with Coherent Geometry and Texture

Abstract:The techniques for 3D indoor scene capturing are widely used, but the meshes produced leave much to be desired. In this paper, we propose "RoomDreamer", which leverages powerful natural language to synthesize a new room with a different style. Unlike existing image synthesis methods, our work addresses the challenge of synthesizing both geometry and texture aligned to the input scene structure and prompt simultaneously. The key insight is that a scene should be treated as a whole, taking into account both scene texture and geometry. The proposed framework consists of two significant components: Geometry Guided Diffusion and Mesh Optimization. Geometry Guided Diffusion for 3D Scene guarantees the consistency of the scene style by applying the 2D prior to the entire scene simultaneously. Mesh Optimization improves the geometry and texture jointly and eliminates the artifacts in the scanned scene. To validate the proposed method, real indoor scenes scanned with smartphones are used for extensive experiments, through which the effectiveness of our method is demonstrated.

* Video results: https://youtu.be/p4xgwj4QJcQ

Via

Access Paper or Ask Questions

Neural Voting Field for Camera-Space 3D Hand Pose Estimation

May 07, 2023

Lin Huang, Chung-Ching Lin, Kevin Lin, Lin Liang, Lijuan Wang, Junsong Yuan, Zicheng Liu

Figure 1 for Neural Voting Field for Camera-Space 3D Hand Pose Estimation

Figure 2 for Neural Voting Field for Camera-Space 3D Hand Pose Estimation

Figure 3 for Neural Voting Field for Camera-Space 3D Hand Pose Estimation

Figure 4 for Neural Voting Field for Camera-Space 3D Hand Pose Estimation

Abstract:We present a unified framework for camera-space 3D hand pose estimation from a single RGB image based on 3D implicit representation. As opposed to recent works, most of which first adopt holistic or pixel-level dense regression to obtain relative 3D hand pose and then follow with complex second-stage operations for 3D global root or scale recovery, we propose a novel unified 3D dense regression scheme to estimate camera-space 3D hand pose via dense 3D point-wise voting in camera frustum. Through direct dense modeling in 3D domain inspired by Pixel-aligned Implicit Functions for 3D detailed reconstruction, our proposed Neural Voting Field (NVF) fully models 3D dense local evidence and hand global geometry, helping to alleviate common 2D-to-3D ambiguities. Specifically, for a 3D query point in camera frustum and its pixel-aligned image feature, NVF, represented by a Multi-Layer Perceptron, regresses: (i) its signed distance to the hand surface; (ii) a set of 4D offset vectors (1D voting weight and 3D directional vector to each hand joint). Following a vote-casting scheme, 4D offset vectors from near-surface points are selected to calculate the 3D hand joint coordinates by a weighted average. Experiments demonstrate that NVF outperforms existing state-of-the-art algorithms on FreiHAND dataset for camera-space 3D hand pose estimation. We also adapt NVF to the classic task of root-relative 3D hand pose estimation, for which NVF also obtains state-of-the-art results on HO3D dataset.

* Accepted to CVPR 2023

Via

Access Paper or Ask Questions

Dynamic Voxel Grid Optimization for High-Fidelity RGB-D Supervised Surface Reconstruction

Apr 12, 2023

Xiangyu Xu, Lichang Chen, Changjiang Cai, Huangying Zhan, Qingan Yan, Pan Ji, Junsong Yuan, Heng Huang, Yi Xu

Abstract:Direct optimization of interpolated features on multi-resolution voxel grids has emerged as a more efficient alternative to MLP-like modules. However, this approach is constrained by higher memory expenses and limited representation capabilities. In this paper, we introduce a novel dynamic grid optimization method for high-fidelity 3D surface reconstruction that incorporates both RGB and depth observations. Rather than treating each voxel equally, we optimize the process by dynamically modifying the grid and assigning more finer-scale voxels to regions with higher complexity, allowing us to capture more intricate details. Furthermore, we develop a scheme to quantify the dynamic subdivision of voxel grid during optimization without requiring any priors. The proposed approach is able to generate high-quality 3D reconstructions with fine details on both synthetic and real-world data, while maintaining computational efficiency, which is substantially faster than the baseline method NeuralRGBD.

* For the project, see https://yanqingan.github.io/

Via

Access Paper or Ask Questions

Harnessing Low-Frequency Neural Fields for Few-Shot View Synthesis

Mar 15, 2023

Liangchen Song, Zhong Li, Xuan Gong, Lele Chen, Zhang Chen, Yi Xu, Junsong Yuan

Figure 1 for Harnessing Low-Frequency Neural Fields for Few-Shot View Synthesis

Figure 2 for Harnessing Low-Frequency Neural Fields for Few-Shot View Synthesis

Figure 3 for Harnessing Low-Frequency Neural Fields for Few-Shot View Synthesis

Figure 4 for Harnessing Low-Frequency Neural Fields for Few-Shot View Synthesis

Abstract:Neural Radiance Fields (NeRF) have led to breakthroughs in the novel view synthesis problem. Positional Encoding (P.E.) is a critical factor that brings the impressive performance of NeRF, where low-dimensional coordinates are mapped to high-dimensional space to better recover scene details. However, blindly increasing the frequency of P.E. leads to overfitting when the reconstruction problem is highly underconstrained, \eg, few-shot images for training. We harness low-frequency neural fields to regularize high-frequency neural fields from overfitting to better address the problem of few-shot view synthesis. We propose reconstructing with a low-frequency only field and then finishing details with a high-frequency equipped field. Unlike most existing solutions that regularize the output space (\ie, rendered images), our regularization is conducted in the input space (\ie, signal frequency). We further propose a simple-yet-effective strategy for tuning the frequency to avoid overfitting few-shot inputs: enforcing consistency among the frequency domain of rendered 2D images. Thanks to the input space regularizing scheme, our method readily applies to inputs beyond spatial locations, such as the time dimension in dynamic scenes. Comparisons with state-of-the-art on both synthetic and natural datasets validate the effectiveness of our proposed solution for few-shot view synthesis. Code is available at \href{https://github.com/lsongx/halo}{https://github.com/lsongx/halo}.

Via

Access Paper or Ask Questions

Progressive Multi-view Human Mesh Recovery with Self-Supervision

Dec 10, 2022

Xuan Gong, Liangchen Song, Meng Zheng, Benjamin Planche, Terrence Chen, Junsong Yuan, David Doermann, Ziyan Wu

Figure 1 for Progressive Multi-view Human Mesh Recovery with Self-Supervision

Figure 2 for Progressive Multi-view Human Mesh Recovery with Self-Supervision

Figure 3 for Progressive Multi-view Human Mesh Recovery with Self-Supervision

Figure 4 for Progressive Multi-view Human Mesh Recovery with Self-Supervision

Abstract:To date, little attention has been given to multi-view 3D human mesh estimation, despite real-life applicability (e.g., motion capture, sport analysis) and robustness to single-view ambiguities. Existing solutions typically suffer from poor generalization performance to new settings, largely due to the limited diversity of image-mesh pairs in multi-view training data. To address this shortcoming, people have explored the use of synthetic images. But besides the usual impact of visual gap between rendered and target data, synthetic-data-driven multi-view estimators also suffer from overfitting to the camera viewpoint distribution sampled during training which usually differs from real-world distributions. Tackling both challenges, we propose a novel simulation-based training pipeline for multi-view human mesh recovery, which (a) relies on intermediate 2D representations which are more robust to synthetic-to-real domain gap; (b) leverages learnable calibration and triangulation to adapt to more diversified camera setups; and (c) progressively aggregates multi-view information in a canonical 3D space to remove ambiguities in 2D representations. Through extensive benchmarking, we demonstrate the superiority of the proposed solution especially for unseen in-the-wild scenarios.

* Accepted by AAAI2023

Via

Access Paper or Ask Questions