Alert button
Picture for Yinan Zhao

Yinan Zhao

Alert button

ReV, LS2N

Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP

Oct 09, 2022
Feng Liang, Bichen Wu, Xiaoliang Dai, Kunpeng Li, Yinan Zhao, Hang Zhang, Peizhao Zhang, Peter Vajda, Diana Marculescu

Figure 1 for Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP
Figure 2 for Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP
Figure 3 for Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP
Figure 4 for Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP

Open-vocabulary semantic segmentation aims to segment an image into semantic regions according to text descriptions, which may not have been seen during training. Recent two-stage methods first generate class-agnostic mask proposals and then leverage pre-trained vision-language models, e.g., CLIP, to classify masked regions. We identify the performance bottleneck of this paradigm to be the pre-trained CLIP model, since it does not perform well on masked images. To address this, we propose to finetune CLIP on a collection of masked image regions and their corresponding text descriptions. We collect training data by mining an existing image-caption dataset (e.g., COCO Captions), using CLIP to match masked image regions to nouns in the image captions. Compared with the more precise and manually annotated segmentation labels with fixed classes (e.g., COCO-Stuff), we find our noisy but diverse dataset can better retain CLIP's generalization ability. Along with finetuning the entire model, we utilize the "blank" areas in masked images using a method we dub mask prompt tuning. Experiments demonstrate mask prompt tuning brings significant improvement without modifying any weights of CLIP, and it can further improve a fully finetuned model. In particular, when trained on COCO and evaluated on ADE20K-150, our best model achieves 29.6% mIoU, which is +8.5% higher than the previous state-of-the-art. For the first time, open-vocabulary generalist models match the performance of supervised specialist models in 2017 without dataset-specific adaptations.

* Project page: https://jeff-liangf.github.io/projects/ovseg 
Viaarxiv icon

Human Behavior Recognition Method Based on CEEMD-ES Radar Selection

Jun 06, 2022
Zhaolin Zhang, Mingqi Song, Wugang Meng, Yuhan Liu, Fengcong Li, Xiang Feng, Yinan Zhao

Figure 1 for Human Behavior Recognition Method Based on CEEMD-ES Radar Selection
Figure 2 for Human Behavior Recognition Method Based on CEEMD-ES Radar Selection
Figure 3 for Human Behavior Recognition Method Based on CEEMD-ES Radar Selection
Figure 4 for Human Behavior Recognition Method Based on CEEMD-ES Radar Selection

In recent years, the millimeter-wave radar to identify human behavior has been widely used in medical,security, and other fields. When multiple radars are performing detection tasks, the validity of the features contained in each radar is difficult to guarantee. In addition, processing multiple radar data also requires a lot of time and computational cost. The Complementary Ensemble Empirical Mode Decomposition-Energy Slice (CEEMD-ES) multistatic radar selection method is proposed to solve these problems. First, this method decomposes and reconstructs the radar signal according to the difference in the reflected echo frequency between the limbs and the trunk of the human body. Then, the radar is selected according to the difference between the ratio of echo energy of limbs and trunk and the theoretical value. The time domain, frequency domain and various entropy features of the selected radar are extracted. Finally, the Extreme Learning Machine (ELM) recognition model of the ReLu core is established. Experiments show that this method can effectively select the radar, and the recognition rate of three kinds of human actions is 98.53%.

* 4 pages, 5 figures 
Viaarxiv icon

Generalized Few-Shot Semantic Segmentation: All You Need is Fine-Tuning

Dec 21, 2021
Josh Myers-Dean, Yinan Zhao, Brian Price, Scott Cohen, Danna Gurari

Figure 1 for Generalized Few-Shot Semantic Segmentation: All You Need is Fine-Tuning
Figure 2 for Generalized Few-Shot Semantic Segmentation: All You Need is Fine-Tuning
Figure 3 for Generalized Few-Shot Semantic Segmentation: All You Need is Fine-Tuning
Figure 4 for Generalized Few-Shot Semantic Segmentation: All You Need is Fine-Tuning

Generalized few-shot semantic segmentation was introduced to move beyond only evaluating few-shot segmentation models on novel classes to include testing their ability to remember base classes. While all approaches currently are based on meta-learning, they perform poorly and saturate in learning after observing only a few shots. We propose the first fine-tuning solution, and demonstrate that it addresses the saturation problem while achieving state-of-art results on two datasets, PASCAL-$5^i$ and COCO-$20^i$. We also show it outperforms existing methods whether fine-tuning multiple final layers or only the final layer. Finally, we present a triplet loss regularization that shows how to redistribute the balance of performance between novel and base categories so that there is a smaller gap between them.

* Includes supplementary materials 
Viaarxiv icon

A novel partially-decoupled translational parallel manipulator with symbolic kinematics, singularity identification and workspace determination

Jun 08, 2021
Huiping Shen, Yinan Zhao, Ju Li, Guanglei Wu, Damien Chablat

Figure 1 for A novel partially-decoupled translational parallel manipulator with symbolic kinematics, singularity identification and workspace determination
Figure 2 for A novel partially-decoupled translational parallel manipulator with symbolic kinematics, singularity identification and workspace determination
Figure 3 for A novel partially-decoupled translational parallel manipulator with symbolic kinematics, singularity identification and workspace determination
Figure 4 for A novel partially-decoupled translational parallel manipulator with symbolic kinematics, singularity identification and workspace determination

This paper presents a novel three-degree-of-freedom (3-DOF) translational parallel manipulator (TPM) by using a topological design method of parallel mechanism (PM) based on position and orientation characteristic (POC) equations. The proposed PM is only composed of lower-mobility joints and actuated prismatic joints, together with the investigations on three kinematic issues of importance. The first aspect pertains to geometric modeling of the TPM in connection with its topological characteristics, such as the POC, degree of freedom and coupling degree, from which its symbolic direct kinematic solutions are readily obtained. Moreover, the decoupled properties of input-output motions are directly evaluated without Jacobian analysis. Sequentially, based upon the inverse kinematics, the singular configurations of the TPM are identified, wherein the singular surfaces are visualized by means of a Gr{\"o}bner based elimination operation. Finally, the workspace of the TPM is evaluated with a geometric approach. This 3-DOF TPM features less joints and links compared with the well-known Delta robot, which reduces the structural complexity. Its symbolic direct kinematics and partially-decoupled property will ease path planning and dynamic analysis. The TPM can be used for manufacturing large work pieces.

* Mechanism and Machine Theory, Elsevier, 2021, 164, pp.104388  
Viaarxiv icon

Objectness-Aware One-Shot Semantic Segmentation

Apr 06, 2020
Yinan Zhao, Brian Price, Scott Cohen, Danna Gurari

Figure 1 for Objectness-Aware One-Shot Semantic Segmentation
Figure 2 for Objectness-Aware One-Shot Semantic Segmentation
Figure 3 for Objectness-Aware One-Shot Semantic Segmentation
Figure 4 for Objectness-Aware One-Shot Semantic Segmentation

While deep convolutional neural networks have led to great progress in image semantic segmentation, they typically require collecting a large number of densely-annotated images for training. Moreover, once trained, the model can only make predictions in a pre-defined set of categories. Therefore, few-shot image semantic segmentation has been explored to learn to segment from only a few annotated examples. In this paper, we tackle the challenging one-shot semantic segmentation problem by taking advantage of objectness. In order to capture prior knowledge of object and background, we first train an objectness segmentation module which generalizes well to unseen categories. Then we use the objectness module to predict the objects present in the query image, and train an objectness-aware few-shot segmentation model that takes advantage of both the object information and limited annotations of the unseen category to perform segmentation in the query image. Our method achieves a mIoU score of 57.9% and 22.6% given only one annotated example of an unseen category in PASCAL-5i and COCO-20i, outperforming related baselines overall.

Viaarxiv icon

Assessing Image Quality Issues for Real-World Problems

Mar 30, 2020
Tai-Yin Chiu, Yinan Zhao, Danna Gurari

Figure 1 for Assessing Image Quality Issues for Real-World Problems
Figure 2 for Assessing Image Quality Issues for Real-World Problems
Figure 3 for Assessing Image Quality Issues for Real-World Problems
Figure 4 for Assessing Image Quality Issues for Real-World Problems

We introduce a new large-scale dataset that links the assessment of image quality issues to two practical vision tasks: image captioning and visual question answering. First, we identify for 39,181 images taken by people who are blind whether each is sufficient quality to recognize the content as well as what quality flaws are observed from six options. These labels serve as a critical foundation for us to make the following contributions: (1) a new problem and algorithms for deciding whether an image is insufficient quality to recognize the content and so not captionable, (2) a new problem and algorithms for deciding which of six quality flaws an image contains, (3) a new problem and algorithms for deciding whether a visual question is unanswerable due to unrecognizable content versus the content of interest being missing from the field of view, and (4) a novel application of more efficiently creating a large-scale image captioning dataset by automatically deciding whether an image is insufficient quality and so should not be captioned. We publicly-share our datasets and code to facilitate future extensions of this work: https://vizwiz.org.

Viaarxiv icon

Captioning Images Taken by People Who Are Blind

Feb 20, 2020
Danna Gurari, Yinan Zhao, Meng Zhang, Nilavra Bhattacharya

Figure 1 for Captioning Images Taken by People Who Are Blind
Figure 2 for Captioning Images Taken by People Who Are Blind
Figure 3 for Captioning Images Taken by People Who Are Blind
Figure 4 for Captioning Images Taken by People Who Are Blind

While an important problem in the vision community is to design algorithms that can automatically caption images, few publicly-available datasets for algorithm development directly address the interests of real users. Observing that people who are blind have relied on (human-based) image captioning services to learn about images they take for nearly a decade, we introduce the first image captioning dataset to represent this real use case. This new dataset, which we call VizWiz-Captions, consists of over 39,000 images originating from people who are blind that are each paired with five captions. We analyze this dataset to (1) characterize the typical captions, (2) characterize the diversity of content found in the images, and (3) compare its content to that found in eight popular vision datasets. We also analyze modern image captioning algorithms to identify what makes this new dataset challenging for the vision community. We publicly-share the dataset with captioning challenge instructions at https://vizwiz.org

Viaarxiv icon

Unconstrained Foreground Object Search

Aug 10, 2019
Yinan Zhao, Brian Price, Scott Cohen, Danna Gurari

Figure 1 for Unconstrained Foreground Object Search
Figure 2 for Unconstrained Foreground Object Search
Figure 3 for Unconstrained Foreground Object Search
Figure 4 for Unconstrained Foreground Object Search

Many people search for foreground objects to use when editing images. While existing methods can retrieve candidates to aid in this, they are constrained to returning objects that belong to a pre-specified semantic class. We instead propose a novel problem of unconstrained foreground object (UFO) search and introduce a solution that supports efficient search by encoding the background image in the same latent space as the candidate foreground objects. A key contribution of our work is a cost-free, scalable approach for creating a large-scale training dataset with a variety of foreground objects of differing semantic categories per image location. Quantitative and human-perception experiments with two diverse datasets demonstrate the advantage of our UFO search solution over related baselines.

* To appear in ICCV 2019 
Viaarxiv icon

Predicting How to Distribute Work Between Algorithms and Humans to Segment an Image Batch

Apr 30, 2019
Danna Gurari, Yinan Zhao, Suyog Dutt Jain, Margrit Betke, Kristen Grauman

Figure 1 for Predicting How to Distribute Work Between Algorithms and Humans to Segment an Image Batch
Figure 2 for Predicting How to Distribute Work Between Algorithms and Humans to Segment an Image Batch
Figure 3 for Predicting How to Distribute Work Between Algorithms and Humans to Segment an Image Batch
Figure 4 for Predicting How to Distribute Work Between Algorithms and Humans to Segment an Image Batch

Foreground object segmentation is a critical step for many image analysis tasks. While automated methods can produce high-quality results, their failures disappoint users in need of practical solutions. We propose a resource allocation framework for predicting how best to allocate a fixed budget of human annotation effort in order to collect higher quality segmentations for a given batch of images and automated methods. The framework is based on a prediction module that estimates the quality of given algorithm-drawn segmentations. We demonstrate the value of the framework for two novel tasks related to predicting how to distribute annotation efforts between algorithms and humans. Specifically, we develop two systems that automatically decide, for a batch of images, when to recruit humans versus computers to create 1) coarse segmentations required to initialize segmentation tools and 2) final, fine-grained segmentations. Experiments demonstrate the advantage of relying on a mix of human and computer efforts over relying on either resource alone for segmenting objects in images coming from three diverse modalities (visible, phase contrast microscopy, and fluorescence microscopy).

Viaarxiv icon

Guided Image Inpainting: Replacing an Image Region by Pulling Content from Another Image

Mar 22, 2018
Yinan Zhao, Brian Price, Scott Cohen, Danna Gurari

Figure 1 for Guided Image Inpainting: Replacing an Image Region by Pulling Content from Another Image
Figure 2 for Guided Image Inpainting: Replacing an Image Region by Pulling Content from Another Image
Figure 3 for Guided Image Inpainting: Replacing an Image Region by Pulling Content from Another Image
Figure 4 for Guided Image Inpainting: Replacing an Image Region by Pulling Content from Another Image

Deep generative models have shown success in automatically synthesizing missing image regions using surrounding context. However, users cannot directly decide what content to synthesize with such approaches. We propose an end-to-end network for image inpainting that uses a different image to guide the synthesis of new content to fill the hole. A key challenge addressed by our approach is synthesizing new content in regions where the guidance image and the context of the original image are inconsistent. We conduct four studies that demonstrate our results yield more realistic image inpainting results over seven baselines.

Viaarxiv icon