Alert button
Picture for Seonwook Park

Seonwook Park

Alert button

EFE: End-to-end Frame-to-Gaze Estimation

May 09, 2023
Haldun Balim, Seonwook Park, Xi Wang, Xucong Zhang, Otmar Hilliges

Figure 1 for EFE: End-to-end Frame-to-Gaze Estimation
Figure 2 for EFE: End-to-end Frame-to-Gaze Estimation
Figure 3 for EFE: End-to-end Frame-to-Gaze Estimation
Figure 4 for EFE: End-to-end Frame-to-Gaze Estimation

Despite the recent development of learning-based gaze estimation methods, most methods require one or more eye or face region crops as inputs and produce a gaze direction vector as output. Cropping results in a higher resolution in the eye regions and having fewer confounding factors (such as clothing and hair) is believed to benefit the final model performance. However, this eye/face patch cropping process is expensive, erroneous, and implementation-specific for different methods. In this paper, we propose a frame-to-gaze network that directly predicts both 3D gaze origin and 3D gaze direction from the raw frame out of the camera without any face or eye cropping. Our method demonstrates that direct gaze regression from the raw downscaled frame, from FHD/HD to VGA/HVGA resolution, is possible despite the challenges of having very few pixels in the eye region. The proposed method achieves comparable results to state-of-the-art methods in Point-of-Gaze (PoG) estimation on three public gaze datasets: GazeCapture, MPIIFaceGaze, and EVE, and generalizes well to extreme camera view changes.

Viaarxiv icon

OCELOT: Overlapped Cell on Tissue Dataset for Histopathology

Mar 24, 2023
Jeongun Ryu, Aaron Valero Puche, JaeWoong Shin, Seonwook Park, Biagio Brattoli, Jinhee Lee, Wonkyung Jung, Soo Ick Cho, Kyunghyun Paeng, Chan-Young Ock, Donggeun Yoo, Sérgio Pereira

Figure 1 for OCELOT: Overlapped Cell on Tissue Dataset for Histopathology
Figure 2 for OCELOT: Overlapped Cell on Tissue Dataset for Histopathology
Figure 3 for OCELOT: Overlapped Cell on Tissue Dataset for Histopathology
Figure 4 for OCELOT: Overlapped Cell on Tissue Dataset for Histopathology

Cell detection is a fundamental task in computational pathology that can be used for extracting high-level medical information from whole-slide images. For accurate cell detection, pathologists often zoom out to understand the tissue-level structures and zoom in to classify cells based on their morphology and the surrounding context. However, there is a lack of efforts to reflect such behaviors by pathologists in the cell detection models, mainly due to the lack of datasets containing both cell and tissue annotations with overlapping regions. To overcome this limitation, we propose and publicly release OCELOT, a dataset purposely dedicated to the study of cell-tissue relationships for cell detection in histopathology. OCELOT provides overlapping cell and tissue annotations on images acquired from multiple organs. Within this setting, we also propose multi-task learning approaches that benefit from learning both cell and tissue tasks simultaneously. When compared against a model trained only for the cell detection task, our proposed approaches improve cell detection performance on 3 datasets: proposed OCELOT, public TIGER, and internal CARP datasets. On the OCELOT test set in particular, we show up to 6.79 improvement in F1-score. We believe the contributions of this paper, including the release of the OCELOT dataset at https://lunit-io.github.io/research/publications/ocelot are a crucial starting point toward the important research direction of incorporating cell-tissue relationships in computation pathology.

* Accepted for publication at CVPR'23 
Viaarxiv icon

Benchmarking Self-Supervised Learning on Diverse Pathology Datasets

Dec 09, 2022
Mingu Kang, Heon Song, Seonwook Park, Donggeun Yoo, Sérgio Pereira

Figure 1 for Benchmarking Self-Supervised Learning on Diverse Pathology Datasets
Figure 2 for Benchmarking Self-Supervised Learning on Diverse Pathology Datasets
Figure 3 for Benchmarking Self-Supervised Learning on Diverse Pathology Datasets
Figure 4 for Benchmarking Self-Supervised Learning on Diverse Pathology Datasets

Computational pathology can lead to saving human lives, but models are annotation hungry and pathology images are notoriously expensive to annotate. Self-supervised learning has shown to be an effective method for utilizing unlabeled data, and its application to pathology could greatly benefit its downstream tasks. Yet, there are no principled studies that compare SSL methods and discuss how to adapt them for pathology. To address this need, we execute the largest-scale study of SSL pre-training on pathology image data, to date. Our study is conducted using 4 representative SSL methods on diverse downstream tasks. We establish that large-scale domain-aligned pre-training in pathology consistently out-performs ImageNet pre-training in standard SSL settings such as linear and fine-tuning evaluations, as well as in low-label regimes. Moreover, we propose a set of domain-specific techniques that we experimentally show leads to a performance boost. Lastly, for the first time, we apply SSL to the challenging task of nuclei instance segmentation and show large and consistent performance improvements under diverse settings.

Viaarxiv icon

Interactive Multi-Class Tiny-Object Detection

Mar 29, 2022
Chunggi Lee, Seonwook Park, Heon Song, Jeongun Ryu, Sanghoon Kim, Haejoon Kim, Sérgio Pereira, Donggeun Yoo

Figure 1 for Interactive Multi-Class Tiny-Object Detection
Figure 2 for Interactive Multi-Class Tiny-Object Detection
Figure 3 for Interactive Multi-Class Tiny-Object Detection
Figure 4 for Interactive Multi-Class Tiny-Object Detection

Annotating tens or hundreds of tiny objects in a given image is laborious yet crucial for a multitude of Computer Vision tasks. Such imagery typically contains objects from various categories, yet the multi-class interactive annotation setting for the detection task has thus far been unexplored. To address these needs, we propose a novel interactive annotation method for multiple instances of tiny objects from multiple classes, based on a few point-based user inputs. Our approach, C3Det, relates the full image context with annotator inputs in a local and global manner via late-fusion and feature-correlation, respectively. We perform experiments on the Tiny-DOTA and LCell datasets using both two-stage and one-stage object detection architectures to verify the efficacy of our approach. Our approach outperforms existing approaches in interactive annotation, achieving higher mAP with fewer clicks. Furthermore, we validate the annotation efficiency of our approach in a user study where it is shown to be 2.85x faster and yield only 0.36x task load (NASA-TLX, lower is better) compared to manual annotation. The code is available at https://github.com/ChungYi347/Interactive-Multi-Class-Tiny-Object-Detection.

Viaarxiv icon

Weakly-Supervised Physically Unconstrained Gaze Estimation

May 20, 2021
Rakshit Kothari, Shalini De Mello, Umar Iqbal, Wonmin Byeon, Seonwook Park, Jan Kautz

Figure 1 for Weakly-Supervised Physically Unconstrained Gaze Estimation
Figure 2 for Weakly-Supervised Physically Unconstrained Gaze Estimation
Figure 3 for Weakly-Supervised Physically Unconstrained Gaze Estimation
Figure 4 for Weakly-Supervised Physically Unconstrained Gaze Estimation

A major challenge for physically unconstrained gaze estimation is acquiring training data with 3D gaze annotations for in-the-wild and outdoor scenarios. In contrast, videos of human interactions in unconstrained environments are abundantly available and can be much more easily annotated with frame-level activity labels. In this work, we tackle the previously unexplored problem of weakly-supervised gaze estimation from videos of human interactions. We leverage the insight that strong gaze-related geometric constraints exist when people perform the activity of "looking at each other" (LAEO). To acquire viable 3D gaze supervision from LAEO labels, we propose a training algorithm along with several novel loss functions especially designed for the task. With weak supervision from two large scale CMU-Panoptic and AVA-LAEO activity datasets, we show significant improvements in (a) the accuracy of semi-supervised gaze estimation and (b) cross-domain generalization on the state-of-the-art physically unconstrained in-the-wild Gaze360 gaze estimation benchmark. We open source our code at https://github.com/NVlabs/weakly-supervised-gaze.

* CVPR 2021 (Oral) 
Viaarxiv icon

Self-Learning Transformations for Improving Gaze and Head Redirection

Oct 23, 2020
Yufeng Zheng, Seonwook Park, Xucong Zhang, Shalini De Mello, Otmar Hilliges

Figure 1 for Self-Learning Transformations for Improving Gaze and Head Redirection
Figure 2 for Self-Learning Transformations for Improving Gaze and Head Redirection
Figure 3 for Self-Learning Transformations for Improving Gaze and Head Redirection
Figure 4 for Self-Learning Transformations for Improving Gaze and Head Redirection

Many computer vision tasks rely on labeled data. Rapid progress in generative modeling has led to the ability to synthesize photorealistic images. However, controlling specific aspects of the generation process such that the data can be used for supervision of downstream tasks remains challenging. In this paper we propose a novel generative model for images of faces, that is capable of producing high-quality images under fine-grained control over eye gaze and head orientation angles. This requires the disentangling of many appearance related factors including gaze and head orientation but also lighting, hue etc. We propose a novel architecture which learns to discover, disentangle and encode these extraneous variations in a self-learned manner. We further show that explicitly disentangling task-irrelevant factors results in more accurate modelling of gaze and head orientation. A novel evaluation scheme shows that our method improves upon the state-of-the-art in redirection accuracy and disentanglement between gaze direction and head orientation changes. Furthermore, we show that in the presence of limited amounts of real-world training data, our method allows for improvements in the downstream task of semi-supervised cross-dataset gaze estimation. Please check our project page at: https://ait.ethz.ch/projects/2020/STED-gaze/

* Accepted at NeurIPS 2020. Check our supplementary video at: https://ait.ethz.ch/projects/2020/STED-gaze/ 
Viaarxiv icon

ETH-XGaze: A Large Scale Dataset for Gaze Estimation under Extreme Head Pose and Gaze Variation

Jul 31, 2020
Xucong Zhang, Seonwook Park, Thabo Beeler, Derek Bradley, Siyu Tang, Otmar Hilliges

Figure 1 for ETH-XGaze: A Large Scale Dataset for Gaze Estimation under Extreme Head Pose and Gaze Variation
Figure 2 for ETH-XGaze: A Large Scale Dataset for Gaze Estimation under Extreme Head Pose and Gaze Variation
Figure 3 for ETH-XGaze: A Large Scale Dataset for Gaze Estimation under Extreme Head Pose and Gaze Variation
Figure 4 for ETH-XGaze: A Large Scale Dataset for Gaze Estimation under Extreme Head Pose and Gaze Variation

Gaze estimation is a fundamental task in many applications of computer vision, human computer interaction and robotics. Many state-of-the-art methods are trained and tested on custom datasets, making comparison across methods challenging. Furthermore, existing gaze estimation datasets have limited head pose and gaze variations, and the evaluations are conducted using different protocols and metrics. In this paper, we propose a new gaze estimation dataset called ETH-XGaze, consisting of over one million high-resolution images of varying gaze under extreme head poses. We collect this dataset from 110 participants with a custom hardware setup including 18 digital SLR cameras and adjustable illumination conditions, and a calibrated system to record ground truth gaze targets. We show that our dataset can significantly improve the robustness of gaze estimation methods across different head poses and gaze angles. Additionally, we define a standardized experimental protocol and evaluation metric on ETH-XGaze, to better unify gaze estimation research going forward. The dataset and benchmark website are available at https://ait.ethz.ch/projects/2020/ETH-XGaze

* Accepted at ECCV 2020 (Spotlight) 
Viaarxiv icon

Towards End-to-end Video-based Eye-Tracking

Jul 26, 2020
Seonwook Park, Emre Aksan, Xucong Zhang, Otmar Hilliges

Figure 1 for Towards End-to-end Video-based Eye-Tracking
Figure 2 for Towards End-to-end Video-based Eye-Tracking
Figure 3 for Towards End-to-end Video-based Eye-Tracking
Figure 4 for Towards End-to-end Video-based Eye-Tracking

Estimating eye-gaze from images alone is a challenging task, in large parts due to un-observable person-specific factors. Achieving high accuracy typically requires labeled data from test users which may not be attainable in real applications. We observe that there exists a strong relationship between what users are looking at and the appearance of the user's eyes. In response to this understanding, we propose a novel dataset and accompanying method which aims to explicitly learn these semantic and temporal relationships. Our video dataset consists of time-synchronized screen recordings, user-facing camera views, and eye gaze data, which allows for new benchmarks in temporal gaze tracking as well as label-free refinement of gaze. Importantly, we demonstrate that the fusion of information from visual stimuli as well as eye images can lead towards achieving performance similar to literature-reported figures acquired through supervised personalization. Our final method yields significant performance improvements on our proposed EVE dataset, with up to a 28 percent improvement in Point-of-Gaze estimates (resulting in 2.49 degrees in angular error), paving the path towards high-accuracy screen-based eye tracking purely from webcam sensors. The dataset and reference source code are available at https://ait.ethz.ch/projects/2020/EVE

* Accepted at ECCV 2020 
Viaarxiv icon

Content-Consistent Generation of Realistic Eyes with Style

Nov 08, 2019
Marcel Bühler, Seonwook Park, Shalini De Mello, Xucong Zhang, Otmar Hilliges

Figure 1 for Content-Consistent Generation of Realistic Eyes with Style
Figure 2 for Content-Consistent Generation of Realistic Eyes with Style
Figure 3 for Content-Consistent Generation of Realistic Eyes with Style

Accurately labeled real-world training data can be scarce, and hence recent works adapt, modify or generate images to boost target datasets. However, retaining relevant details from input data in the generated images is challenging and failure could be critical to the performance on the final task. In this work, we synthesize person-specific eye images that satisfy a given semantic segmentation mask (content), while following the style of a specified person from only a few reference images. We introduce two approaches, (a) one used to win the OpenEDS Synthetic Eye Generation Challenge at ICCV 2019, and (b) a principled approach to solving the problem involving simultaneous injection of style and content information at multiple scales. Our implementation is available at https://github.com/mcbuehler/Seg2Eye.

* 4 pages, 4 figures, ICCV Workshop 2019 
Viaarxiv icon

Few-shot Adaptive Gaze Estimation

May 06, 2019
Seonwook Park, Shalini De Mello, Pavlo Molchanov, Umar Iqbal, Otmar Hilliges, Jan Kautz

Figure 1 for Few-shot Adaptive Gaze Estimation
Figure 2 for Few-shot Adaptive Gaze Estimation
Figure 3 for Few-shot Adaptive Gaze Estimation
Figure 4 for Few-shot Adaptive Gaze Estimation

Inter-personal anatomical differences limit the accuracy of person-independent gaze estimation networks. Yet there is a need to lower gaze errors further to enable applications requiring higher quality. Further gains can be achieved by personalizing gaze networks, ideally with few calibration samples. However, over-parameterized neural networks are not amenable to learning from few examples as they can quickly over-fit. We embrace these challenges and propose a novel framework for Few-shot Adaptive GaZE Estimation (FAZE) for learning person-specific gaze networks with very few (less than 9) calibration samples. FAZE learns a rotation-aware latent representation of gaze via a disentangling encoder-decoder architecture along with a highly adaptable gaze estimator trained using meta-learning. It is capable of adapting to any new person to yield significant performance gains with as few as 3 samples, yielding state-of-the-art performance of 3.18-deg on GazeCapture, a 19% improvement over prior art.

Viaarxiv icon