Alert button
Picture for ByungIn Yoo

ByungIn Yoo

Alert button

BackTrack: Robust template update via Backward Tracking of candidate template

Aug 21, 2023
Dongwook Lee, Wonjun Choi, Seohyung Lee, ByungIn Yoo, Eunho Yang, Seongju Hwang

Figure 1 for BackTrack: Robust template update via Backward Tracking of candidate template
Figure 2 for BackTrack: Robust template update via Backward Tracking of candidate template
Figure 3 for BackTrack: Robust template update via Backward Tracking of candidate template
Figure 4 for BackTrack: Robust template update via Backward Tracking of candidate template

Variations of target appearance such as deformations, illumination variance, occlusion, etc., are the major challenges of visual object tracking that negatively impact the performance of a tracker. An effective method to tackle these challenges is template update, which updates the template to reflect the change of appearance in the target object during tracking. However, with template updates, inadequate quality of new templates or inappropriate timing of updates may induce a model drift problem, which severely degrades the tracking performance. Here, we propose BackTrack, a robust and reliable method to quantify the confidence of the candidate template by backward tracking it on the past frames. Based on the confidence score of candidates from BackTrack, we can update the template with a reliable candidate at the right time while rejecting unreliable candidates. BackTrack is a generic template update scheme and is applicable to any template-based trackers. Extensive experiments on various tracking benchmarks verify the effectiveness of BackTrack over existing template update algorithms, as it achieves SOTA performance on various tracking benchmarks.

* 14 pages, 7 figures 
Viaarxiv icon

Object-Centric Multi-Task Learning for Human Instances

Mar 13, 2023
Hyeongseok Son, Sangil Jung, Solae Lee, Seongeun Kim, Seung-In Park, ByungIn Yoo

Figure 1 for Object-Centric Multi-Task Learning for Human Instances
Figure 2 for Object-Centric Multi-Task Learning for Human Instances
Figure 3 for Object-Centric Multi-Task Learning for Human Instances
Figure 4 for Object-Centric Multi-Task Learning for Human Instances

Human is one of the most essential classes in visual recognition tasks such as detection, segmentation, and pose estimation. Although much effort has been put into individual tasks, multi-task learning for these three tasks has been rarely studied. In this paper, we explore a compact multi-task network architecture that maximally shares the parameters of the multiple tasks via object-centric learning. To this end, we propose a novel query design to encode the human instance information effectively, called human-centric query (HCQ). HCQ enables for the query to learn explicit and structural information of human as well such as keypoints. Besides, we utilize HCQ in prediction heads of the target tasks directly and also interweave HCQ with the deformable attention in Transformer decoders to exploit a well-learned object-centric representation. Experimental results show that the proposed multi-task network achieves comparable accuracy to state-of-the-art task-specific models in human detection, segmentation, and pose estimation task, while it consumes less computational costs.

Viaarxiv icon

Slot-VPS: Object-centric Representation Learning for Video Panoptic Segmentation

Dec 16, 2021
Yi Zhou, Hui Zhang, Hana Lee, Shuyang Sun, Pingjun Li, Yangguang Zhu, ByungIn Yoo, Xiaojuan Qi, Jae-Joon Han

Figure 1 for Slot-VPS: Object-centric Representation Learning for Video Panoptic Segmentation
Figure 2 for Slot-VPS: Object-centric Representation Learning for Video Panoptic Segmentation
Figure 3 for Slot-VPS: Object-centric Representation Learning for Video Panoptic Segmentation
Figure 4 for Slot-VPS: Object-centric Representation Learning for Video Panoptic Segmentation

Video Panoptic Segmentation (VPS) aims at assigning a class label to each pixel, uniquely segmenting and identifying all object instances consistently across all frames. Classic solutions usually decompose the VPS task into several sub-tasks and utilize multiple surrogates (e.g. boxes and masks, centres and offsets) to represent objects. However, this divide-and-conquer strategy requires complex post-processing in both spatial and temporal domains and is vulnerable to failures from surrogate tasks. In this paper, inspired by object-centric learning which learns compact and robust object representations, we present Slot-VPS, the first end-to-end framework for this task. We encode all panoptic entities in a video, including both foreground instances and background semantics, with a unified representation called panoptic slots. The coherent spatio-temporal object's information is retrieved and encoded into the panoptic slots by the proposed Video Panoptic Retriever, enabling it to localize, segment, differentiate, and associate objects in a unified manner. Finally, the output panoptic slots can be directly converted into the class, mask, and object ID of panoptic objects in the video. We conduct extensive ablation studies and demonstrate the effectiveness of our approach on two benchmark datasets, Cityscapes-VPS (\textit{val} and test sets) and VIPER (\textit{val} set), achieving new state-of-the-art performance of 63.7, 63.3 and 56.2 VPQ, respectively.

Viaarxiv icon

Joint Learning of Generative Translator and Classifier for Visually Similar Classes

Dec 15, 2019
ByungIn Yoo, Tristan Sylvain, Yoshua Bengio, Junmo Kim

Figure 1 for Joint Learning of Generative Translator and Classifier for Visually Similar Classes
Figure 2 for Joint Learning of Generative Translator and Classifier for Visually Similar Classes
Figure 3 for Joint Learning of Generative Translator and Classifier for Visually Similar Classes
Figure 4 for Joint Learning of Generative Translator and Classifier for Visually Similar Classes

In this paper, we propose a Generative Translation Classification Network (GTCN) for improving visual classification accuracy in settings where classes are visually similar and data is scarce. For this purpose, we propose joint learning to train a classifier and a generative stochastic translation network end-to-end. The translation network is used to perform on-line data augmentation across classes, whereas previous works have mostly involved domain adaptation. To help the model further benefit from this data-augmentation, we introduce an adaptive fade-in loss and a quadruplet loss. We perform experiments on multiple datasets to demonstrate the proposed method's performance in varied settings. Of particular interest, training on 40% of the dataset is enough for our model to surpass the performance of baselines trained on the full dataset. When our architecture is trained on the full dataset, we achieve comparable performance with state-of-the-art methods despite using a light-weight architecture.

* 13 pages, 16 figures, 14 tables 
Viaarxiv icon

Deep generative-contrastive networks for facial expression recognition

Oct 25, 2018
Youngsung Kim, ByungIn Yoo, Youngjun Kwak, Changkyu Choi, Junmo Kim

Figure 1 for Deep generative-contrastive networks for facial expression recognition
Figure 2 for Deep generative-contrastive networks for facial expression recognition
Figure 3 for Deep generative-contrastive networks for facial expression recognition
Figure 4 for Deep generative-contrastive networks for facial expression recognition

As the expressive depth of an emotional face differs with individuals or expressions, recognizing an expression using a single facial image at a moment is difficult. A relative expression of a query face compared to a reference face might alleviate this difficulty. In this paper, we propose to utilize contrastive representation that embeds a distinctive expressive factor for a discriminative purpose. The contrastive representation is calculated at the embedding layer of deep networks by comparing a given (query) image with the reference image. We attempt to utilize a generative reference image that is estimated based on the given image. Consequently, we deploy deep neural networks that embed a combination of a generative model, a contrastive model, and a discriminative model with an end-to-end training manner. In our proposed networks, we attempt to disentangle a facial expressive factor in two steps including learning of a generator network and a contrastive encoder network. We conducted extensive experiments on publicly available face expression databases (CK+, MMI, Oulu-CASIA, and in-the-wild databases) that have been widely adopted in the recent literatures. The proposed method outperforms the known state-of-the art methods in terms of the recognition accuracy.

Viaarxiv icon