Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Angela Yao

National University of Singapore

HiFiHR: Enhancing 3D Hand Reconstruction from a Single Image via High-Fidelity Texture

Aug 25, 2023

Jiayin Zhu, Zhuoran Zhao, Linlin Yang, Angela Yao

Figure 1 for HiFiHR: Enhancing 3D Hand Reconstruction from a Single Image via High-Fidelity Texture

Figure 2 for HiFiHR: Enhancing 3D Hand Reconstruction from a Single Image via High-Fidelity Texture

Figure 3 for HiFiHR: Enhancing 3D Hand Reconstruction from a Single Image via High-Fidelity Texture

Figure 4 for HiFiHR: Enhancing 3D Hand Reconstruction from a Single Image via High-Fidelity Texture

Abstract:We present HiFiHR, a high-fidelity hand reconstruction approach that utilizes render-and-compare in the learning-based framework from a single image, capable of generating visually plausible and accurate 3D hand meshes while recovering realistic textures. Our method achieves superior texture reconstruction by employing a parametric hand model with predefined texture assets, and by establishing a texture reconstruction consistency between the rendered and input images during training. Moreover, based on pretraining the network on an annotated dataset, we apply varying degrees of supervision using our pipeline, i.e., self-supervision, weak supervision, and full supervision, and discuss the various levels of contributions of the learned high-fidelity textures in enhancing hand pose and shape estimation. Experimental results on public benchmarks including FreiHAND and HO-3D demonstrate that our method outperforms the state-of-the-art hand reconstruction methods in texture reconstruction quality while maintaining comparable accuracy in pose and shape estimation. Our code is available at https://github.com/viridityzhu/HiFiHR.

* Accepted to DAGM German Conference on Pattern Recognition 2023

Via

Access Paper or Ask Questions

Opening the Vocabulary of Egocentric Actions

Aug 22, 2023

Dibyadip Chatterjee, Fadime Sener, Shugao Ma, Angela Yao

Figure 1 for Opening the Vocabulary of Egocentric Actions

Figure 2 for Opening the Vocabulary of Egocentric Actions

Figure 3 for Opening the Vocabulary of Egocentric Actions

Figure 4 for Opening the Vocabulary of Egocentric Actions

Abstract:Human actions in egocentric videos are often hand-object interactions composed from a verb (performed by the hand) applied to an object. Despite their extensive scaling up, egocentric datasets still face two limitations - sparsity of action compositions and a closed set of interacting objects. This paper proposes a novel open vocabulary action recognition task. Given a set of verbs and objects observed during training, the goal is to generalize the verbs to an open vocabulary of actions with seen and novel objects. To this end, we decouple the verb and object predictions via an object-agnostic verb encoder and a prompt-based object encoder. The prompting leverages CLIP representations to predict an open vocabulary of interacting objects. We create open vocabulary benchmarks on the EPIC-KITCHENS-100 and Assembly101 datasets; whereas closed-action methods fail to generalize, our proposed method is effective. In addition, our object encoder significantly outperforms existing open-vocabulary visual recognition methods in recognizing novel interacting objects.

* 20 pages, 7 figures; https://dibschat.github.io/openvocab-egoAR/

Via

Access Paper or Ask Questions

Learning to Generate Training Datasets for Robust Semantic Segmentation

Aug 18, 2023

Marwane Hariat, Olivier Laurent, Rémi Kazmierczak, Shihao Zhang, Andrei Bursuc, Angela Yao, Gianni Franchi

Figure 1 for Learning to Generate Training Datasets for Robust Semantic Segmentation

Figure 2 for Learning to Generate Training Datasets for Robust Semantic Segmentation

Figure 3 for Learning to Generate Training Datasets for Robust Semantic Segmentation

Figure 4 for Learning to Generate Training Datasets for Robust Semantic Segmentation

Abstract:Semantic segmentation techniques have shown significant progress in recent years, but their robustness to real-world perturbations and data samples not seen during training remains a challenge, particularly in safety-critical applications. In this paper, we propose a novel approach to improve the robustness of semantic segmentation techniques by leveraging the synergy between label-to-image generators and image-to-label segmentation models. Specifically, we design and train Robusta, a novel robust conditional generative adversarial network to generate realistic and plausible perturbed or outlier images that can be used to train reliable segmentation models. We conduct in-depth studies of the proposed generative model, assess the performance and robustness of the downstream segmentation network, and demonstrate that our approach can significantly enhance the robustness of semantic segmentation techniques in the face of real-world perturbations, distribution shifts, and out-of-distribution samples. Our results suggest that this approach could be valuable in safety-critical applications, where the reliability of semantic segmentation techniques is of utmost importance and comes with a limited computational budget in inference. We will release our code shortly.

Via

Access Paper or Ask Questions

Every Mistake Counts in Assembly

Jul 31, 2023

Guodong Ding, Fadime Sener, Shugao Ma, Angela Yao

Figure 1 for Every Mistake Counts in Assembly

Figure 2 for Every Mistake Counts in Assembly

Figure 3 for Every Mistake Counts in Assembly

Figure 4 for Every Mistake Counts in Assembly

Abstract:One promising use case of AI assistants is to help with complex procedures like cooking, home repair, and assembly tasks. Can we teach the assistant to interject after the user makes a mistake? This paper targets the problem of identifying ordering mistakes in assembly procedures. We propose a system that can detect ordering mistakes by utilizing a learned knowledge base. Our framework constructs a knowledge base with spatial and temporal beliefs based on observed mistakes. Spatial beliefs depict the topological relationship of the assembling components, while temporal beliefs aggregate prerequisite actions as ordering constraints. With an episodic memory design, our algorithm can dynamically update and construct the belief sets as more actions are observed, all in an online fashion. We demonstrate experimentally that our inferred spatial and temporal beliefs are capable of identifying incorrect orderings in real-world action sequences. To construct the spatial beliefs, we collect a new set of coarse-level action annotations for Assembly101 based on the positioning of the toy parts. Finally, we demonstrate the superior performance of our belief inference algorithm in detecting ordering mistakes on the Assembly101 dataset.

* 10 pages, 5 figures

Via

Access Paper or Ask Questions

Overcoming the Trade-off Between Accuracy and Plausibility in 3D Hand Shape Reconstruction

May 01, 2023

Ziwei Yu, Chen Li, Linlin Yang, Xiaoxu Zheng, Michael Bi Mi, Gim Hee Lee, Angela Yao

Figure 1 for Overcoming the Trade-off Between Accuracy and Plausibility in 3D Hand Shape Reconstruction

Figure 2 for Overcoming the Trade-off Between Accuracy and Plausibility in 3D Hand Shape Reconstruction

Figure 3 for Overcoming the Trade-off Between Accuracy and Plausibility in 3D Hand Shape Reconstruction

Figure 4 for Overcoming the Trade-off Between Accuracy and Plausibility in 3D Hand Shape Reconstruction

Abstract:Direct mesh fitting for 3D hand shape reconstruction is highly accurate. However, the reconstructed meshes are prone to artifacts and do not appear as plausible hand shapes. Conversely, parametric models like MANO ensure plausible hand shapes but are not as accurate as the non-parametric methods. In this work, we introduce a novel weakly-supervised hand shape estimation framework that integrates non-parametric mesh fitting with MANO model in an end-to-end fashion. Our joint model overcomes the tradeoff in accuracy and plausibility to yield well-aligned and high-quality 3D meshes, especially in challenging two-hand and hand-object interaction scenarios.

* CVPR 2023

Via

Access Paper or Ask Questions

An Implicit Alignment for Video Super-Resolution

Apr 29, 2023

Kai Xu, Ziwei Yu, Xin Wang, Michael Bi Mi, Angela Yao

Figure 1 for An Implicit Alignment for Video Super-Resolution

Figure 2 for An Implicit Alignment for Video Super-Resolution

Figure 3 for An Implicit Alignment for Video Super-Resolution

Figure 4 for An Implicit Alignment for Video Super-Resolution

Abstract:Video super-resolution commonly uses a frame-wise alignment to support the propagation of information over time. The role of alignment is well-studied for low-level enhancement in video, but existing works have overlooked one critical step -- re-sampling. Most works, regardless of how they compensate for motion between frames, be it flow-based warping or deformable convolution/attention, use the default choice of bilinear interpolation for re-sampling. However, bilinear interpolation acts effectively as a low-pass filter and thus hinders the aim of recovering high-frequency content for super-resolution. This paper studies the impact of re-sampling on alignment for video super-resolution. Extensive experiments reveal that for alignment to be effective, the re-sampling should preserve the original sharpness of the features and prevent distortions. From these observations, we propose an implicit alignment method that re-samples through a window-based cross-attention with sampling positions encoded by sinusoidal positional encoding. The re-sampling is implicitly computed by learned network weights. Experiments show that the proposed implicit alignment enhances the performance of state-of-the-art frameworks with minimal impact on both synthetic and real-world datasets.

Via

Access Paper or Ask Questions

Contrastive Video Question Answering via Video Graph Transformer

Feb 27, 2023

Junbin Xiao, Pan Zhou, Angela Yao, Yicong Li, Richang Hong, Shuicheng Yan, Tat-Seng Chua

Figure 1 for Contrastive Video Question Answering via Video Graph Transformer

Figure 2 for Contrastive Video Question Answering via Video Graph Transformer

Figure 3 for Contrastive Video Question Answering via Video Graph Transformer

Figure 4 for Contrastive Video Question Answering via Video Graph Transformer

Abstract:We propose to perform video question answering (VideoQA) in a Contrastive manner via a Video Graph Transformer model (CoVGT). CoVGT's uniqueness and superiority are three-fold: 1) It proposes a dynamic graph transformer module which encodes video by explicitly capturing the visual objects, their relations and dynamics, for complex spatio-temporal reasoning. 2) It designs separate video and text transformers for contrastive learning between the video and text to perform QA, instead of multi-modal transformer for answer classification. Fine-grained video-text communication is done by additional cross-modal interaction modules. 3) It is optimized by the joint fully- and self-supervised contrastive objectives between the correct and incorrect answers, as well as the relevant and irrelevant questions respectively. With superior video encoding and QA solution, we show that CoVGT can achieve much better performances than previous arts on video reasoning tasks. Its performances even surpass those models that are pretrained with millions of external data. We further show that CoVGT can also benefit from cross-modal pretraining, yet with orders of magnitude smaller data. The results demonstrate the effectiveness and superiority of CoVGT, and additionally reveal its potential for more data-efficient pretraining. We hope our success can advance VideoQA beyond coarse recognition/description towards fine-grained relation reasoning of video contents. Our code will be available at https://github.com/doc-doc/CoVGT.

* Manuscript was submitted for reviewing at IEEE T-PAMI on 11 Oct. 2022. This version is with small modification

Via

Access Paper or Ask Questions

Bias-Compensated Integral Regression for Human Pose Estimation

Jan 25, 2023

Kerui Gu, Linlin Yang, Michael Bi Mi, Angela Yao

Abstract:In human and hand pose estimation, heatmaps are a crucial intermediate representation for a body or hand keypoint. Two popular methods to decode the heatmap into a final joint coordinate are via an argmax, as done in heatmap detection, or via softmax and expectation, as done in integral regression. Integral regression is learnable end-to-end, but has lower accuracy than detection. This paper uncovers an induced bias from integral regression that results from combining the softmax and the expectation operation. This bias often forces the network to learn degenerately localized heatmaps, obscuring the keypoint's true underlying distribution and leads to lower accuracies. Training-wise, by investigating the gradients of integral regression, we show that the implicit guidance of integral regression to update the heatmap makes it slower to converge than detection. To counter the above two limitations, we propose Bias Compensated Integral Regression (BCIR), an integral regression-based framework that compensates for the bias. BCIR also incorporates a Gaussian prior loss to speed up training and improve prediction accuracy. Experimental results on both the human body and hand benchmarks show that BCIR is faster to train and more accurate than the original integral regression, making it competitive with state-of-the-art detection methods.

Via

Access Paper or Ask Questions

Improving Deep Regression with Ordinal Entropy

Jan 21, 2023

Shihao Zhang, Linlin Yang, Michael Bi Mi, Xiaoxu Zheng, Angela Yao

Figure 1 for Improving Deep Regression with Ordinal Entropy

Figure 2 for Improving Deep Regression with Ordinal Entropy

Figure 3 for Improving Deep Regression with Ordinal Entropy

Figure 4 for Improving Deep Regression with Ordinal Entropy

Abstract:In computer vision, it is often observed that formulating regression problems as a classification task often yields better performance. We investigate this curious phenomenon and provide a derivation to show that classification, with the cross-entropy loss, outperforms regression with a mean squared error loss in its ability to learn high-entropy feature representations. Based on the analysis, we propose an ordinal entropy loss to encourage higher-entropy feature spaces while maintaining ordinal relationships to improve the performance of regression tasks. Experiments on synthetic and real-world regression tasks demonstrate the importance and benefits of increasing entropy for regression.

* Accepted to ICLR 2023. Project page: https://github.com/needylove/OrdinalEntropy

Via

Access Paper or Ask Questions

C2F-TCN: A Framework for Semi and Fully Supervised Temporal Action Segmentation

Dec 20, 2022

Dipika Singhania, Rahul Rahaman, Angela Yao

Abstract:Temporal action segmentation tags action labels for every frame in an input untrimmed video containing multiple actions in a sequence. For the task of temporal action segmentation, we propose an encoder-decoder-style architecture named C2F-TCN featuring a "coarse-to-fine" ensemble of decoder outputs. The C2F-TCN framework is enhanced with a novel model agnostic temporal feature augmentation strategy formed by the computationally inexpensive strategy of the stochastic max-pooling of segments. It produces more accurate and well-calibrated supervised results on three benchmark action segmentation datasets. We show that the architecture is flexible for both supervised and representation learning. In line with this, we present a novel unsupervised way to learn frame-wise representation from C2F-TCN. Our unsupervised learning approach hinges on the clustering capabilities of the input features and the formation of multi-resolution features from the decoder's implicit structure. Further, we provide the first semi-supervised temporal action segmentation results by merging representation learning with conventional supervised learning. Our semi-supervised learning scheme, called ``Iterative-Contrastive-Classify (ICC)'', progressively improves in performance with more labeled data. The ICC semi-supervised learning in C2F-TCN, with 40% labeled videos, performs similar to fully supervised counterparts.

* arXiv admin note: text overlap with arXiv:2112.01402

Via

Access Paper or Ask Questions