Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yi Zhu

Challenges of Zero-Shot Recognition with Vision-Language Models: Granularity and Correctness

Jun 28, 2023

Zhenlin Xu, Yi Zhu, Tiffany Deng, Abhay Mittal, Yanbei Chen, Manchen Wang, Paolo Favaro, Joseph Tighe, Davide Modolo

Figure 1 for Challenges of Zero-Shot Recognition with Vision-Language Models: Granularity and Correctness

Figure 2 for Challenges of Zero-Shot Recognition with Vision-Language Models: Granularity and Correctness

Figure 3 for Challenges of Zero-Shot Recognition with Vision-Language Models: Granularity and Correctness

Figure 4 for Challenges of Zero-Shot Recognition with Vision-Language Models: Granularity and Correctness

Abstract:This paper investigates the challenges of applying vision-language models (VLMs) to zero-shot visual recognition tasks in an open-world setting, with a focus on contrastive vision-language models such as CLIP. We first examine the performance of VLMs on concepts of different granularity levels. We propose a way to fairly evaluate the performance discrepancy under two experimental setups and find that VLMs are better at recognizing fine-grained concepts. Furthermore, we find that the similarity scores from VLMs do not strictly reflect the correctness of the textual inputs given visual input. We propose an evaluation protocol to test our hypothesis that the scores can be biased towards more informative descriptions, and the nature of the similarity score between embedding makes it challenging for VLMs to recognize the correctness between similar but wrong descriptions. Our study highlights the challenges of using VLMs in open-world settings and suggests directions for future research to improve their zero-shot capabilities.

Via

Access Paper or Ask Questions

Clickbait Detection via Large Language Models

Jun 16, 2023

Yi Zhu, Han Wang, Ye Wang, Yun Li, Yunhao Yuan, Jipeng Qiang

Figure 1 for Clickbait Detection via Large Language Models

Figure 2 for Clickbait Detection via Large Language Models

Figure 3 for Clickbait Detection via Large Language Models

Figure 4 for Clickbait Detection via Large Language Models

Abstract:Clickbait, which aims to induce users with some surprising and even thrilling headlines for increasing click-through rates, permeates almost all online content publishers, such as news portals and social media. Recently, Large Language Models (LLMs) have emerged as a powerful instrument and achieved tremendous success in a serious of NLP downstream tasks. However, it is not yet known whether LLMs can be served as a high-quality clickbait detection system. In this paper, we analyze the performance of LLMs in the few-shot scenarios on a number of English and Chinese benchmark datasets. Experimental results show that LLMs cannot achieve the best results compared to the state-of-the-art deep and fine-tuning PLMs methods. Different from the human intuition, the experiments demonstrated that LLMs cannot make satisfied clickbait detection just by the headlines.

Via

Access Paper or Ask Questions

Multi-Modal Wireless Flexible Gel-Free Sensors with Edge Deep Learning for Detecting and Alerting Freezing of Gait in Parkinson's Patients

May 28, 2023

Yuhan Hou, Jack Ji, Yi Zhu, Thomas Dell, Xilin Liu

Abstract:Freezing of gait (FoG) is a debilitating symptom of Parkinson's disease (PD). This work develops flexible wearable sensors that can detect FoG and alert patients and companions to help prevent falls. FoG is detected on the sensors using a deep learning (DL) model with multi-modal sensory inputs collected from distributed wireless sensors. Two types of wireless sensors are developed, including: (1) a C-shape central node placed around the patient's ears, which collects electroencephalogram (EEG), detects FoG using an on-device DL model, and generates auditory alerts when FoG is detected; (2) a stretchable patch-type sensor attached to the patient's legs, which collects electromyography (EMG) and movement information from accelerometers. The patch-type sensors wirelessly send collected data to the central node through low-power ultra-wideband (UWB) transceivers. All sensors are fabricated on flexible printed circuit boards. Adhesive gel-free acetylene carbon black and polydimethylsiloxane electrodes are fabricated on the flexible substrate to allow conformal wear over the long term. Custom integrated circuits (IC) are developed in 180 nm CMOS technology and used in both types of sensors for signal acquisition, digitization, and wireless communication. A novel lightweight DL model is trained using multi-modal sensory data. The inference of the DL model is performed on a low-power microcontroller in the central node. The DL model achieves a high detection sensitivity of 0.81 and a specificity of 0.88. The developed wearable sensors are ready for clinical experiments and hold great promise in improving the quality of life of patients with PD. The proposed design methodologies can be used in wearable medical devices for the monitoring and treatment of a wide range of neurodegenerative diseases.

Via

Access Paper or Ask Questions

Tailoring Instructions to Student's Learning Levels Boosts Knowledge Distillation

May 16, 2023

Yuxin Ren, Zihan Zhong, Xingjian Shi, Yi Zhu, Chun Yuan, Mu Li

Figure 1 for Tailoring Instructions to Student's Learning Levels Boosts Knowledge Distillation

Figure 2 for Tailoring Instructions to Student's Learning Levels Boosts Knowledge Distillation

Figure 3 for Tailoring Instructions to Student's Learning Levels Boosts Knowledge Distillation

Figure 4 for Tailoring Instructions to Student's Learning Levels Boosts Knowledge Distillation

Abstract:It has been commonly observed that a teacher model with superior performance does not necessarily result in a stronger student, highlighting a discrepancy between current teacher training practices and effective knowledge transfer. In order to enhance the guidance of the teacher training process, we introduce the concept of distillation influence to determine the impact of distillation from each training sample on the student's generalization ability. In this paper, we propose Learning Good Teacher Matters (LGTM), an efficient training technique for incorporating distillation influence into the teacher's learning process. By prioritizing samples that are likely to enhance the student's generalization ability, our LGTM outperforms 10 common knowledge distillation baselines on 6 text classification tasks in the GLUE benchmark.

* Accepted at ACL 2023, main conference. Code available at https://github.com/twinkle0331/LGTM

Via

Access Paper or Ask Questions

ParaLS: Lexical Substitution via Pretrained Paraphraser

May 14, 2023

Jipeng Qiang, Kang Liu, Yun Li, Yunhao Yuan, Yi Zhu

Figure 1 for ParaLS: Lexical Substitution via Pretrained Paraphraser

Figure 2 for ParaLS: Lexical Substitution via Pretrained Paraphraser

Figure 3 for ParaLS: Lexical Substitution via Pretrained Paraphraser

Figure 4 for ParaLS: Lexical Substitution via Pretrained Paraphraser

Abstract:Lexical substitution (LS) aims at finding appropriate substitutes for a target word in a sentence. Recently, LS methods based on pretrained language models have made remarkable progress, generating potential substitutes for a target word through analysis of its contextual surroundings. However, these methods tend to overlook the preservation of the sentence's meaning when generating the substitutes. This study explores how to generate the substitute candidates from a paraphraser, as the generated paraphrases from a paraphraser contain variations in word choice and preserve the sentence's meaning. Since we cannot directly generate the substitutes via commonly used decoding strategies, we propose two simple decoding strategies that focus on the variations of the target word during decoding. Experimental results show that our methods outperform state-of-the-art LS methods based on pre-trained language models on three benchmarks.

* ACL 2023

Via

Access Paper or Ask Questions

Towards Medical Artificial General Intelligence via Knowledge-Enhanced Multimodal Pretraining

Apr 26, 2023

Bingqian Lin, Zicong Chen, Mingjie Li, Haokun Lin, Hang Xu, Yi Zhu, Jianzhuang Liu, Wenjia Cai, Lei Yang, Shen Zhao(+6 more)

Figure 1 for Towards Medical Artificial General Intelligence via Knowledge-Enhanced Multimodal Pretraining

Figure 2 for Towards Medical Artificial General Intelligence via Knowledge-Enhanced Multimodal Pretraining

Figure 3 for Towards Medical Artificial General Intelligence via Knowledge-Enhanced Multimodal Pretraining

Figure 4 for Towards Medical Artificial General Intelligence via Knowledge-Enhanced Multimodal Pretraining

Abstract:Medical artificial general intelligence (MAGI) enables one foundation model to solve different medical tasks, which is very practical in the medical domain. It can significantly reduce the requirement of large amounts of task-specific data by sufficiently sharing medical knowledge among different tasks. However, due to the challenges of designing strongly generalizable models with limited and complex medical data, most existing approaches tend to develop task-specific models. To take a step towards MAGI, we propose a new paradigm called Medical-knOwledge-enhanced mulTimOdal pretRaining (MOTOR). In MOTOR, we combine two kinds of basic medical knowledge, i.e., general and specific knowledge, in a complementary manner to boost the general pretraining process. As a result, the foundation model with comprehensive basic knowledge can learn compact representations from pretraining radiographic data for better cross-modal alignment. MOTOR unifies the understanding and generation, which are two kinds of core intelligence of an AI system, into a single medical foundation model, to flexibly handle more diverse medical tasks. To enable a comprehensive evaluation and facilitate further research, we construct a medical multimodal benchmark including a wide range of downstream tasks, such as chest x-ray report generation and medical visual question answering. Extensive experiments on our benchmark show that MOTOR obtains promising results through simple task-oriented adaptation. The visualization shows that the injected knowledge successfully highlights key information in the medical data, demonstrating the excellent interpretability of MOTOR. Our MOTOR successfully mimics the human practice of fulfilling a "medical student" to accelerate the process of becoming a "specialist". We believe that our work makes a significant stride in realizing MAGI.

* Project page: https://github.com/chenzcv7/MOTOR

Via

Access Paper or Ask Questions

Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition

Apr 10, 2023

Shuhuai Ren, Aston Zhang, Yi Zhu, Shuai Zhang, Shuai Zheng, Mu Li, Alex Smola, Xu Sun

Figure 1 for Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition

Figure 2 for Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition

Figure 3 for Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition

Figure 4 for Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition

Abstract:This work proposes POMP, a prompt pre-training method for vision-language models. Being memory and computation efficient, POMP enables the learned prompt to condense semantic information for a rich set of visual concepts with over twenty-thousand classes. Once pre-trained, the prompt with a strong transferable ability can be directly plugged into a variety of visual recognition tasks including image classification, semantic segmentation, and object detection, to boost recognition performances in a zero-shot manner. Empirical evaluation shows that POMP achieves state-of-the-art performances on 21 downstream datasets, e.g., 67.0% average accuracy on 10 classification dataset (+3.1% compared to CoOp) and 84.4 hIoU on open-vocabulary Pascal VOC segmentation (+6.9 compared to ZSSeg).

* Code is available at https://github.com/amazon-science/prompt-pretraining

Via

Access Paper or Ask Questions

On the Impact of Voice Anonymization on Speech-Based COVID-19 Detection

Apr 05, 2023

Yi Zhu, Mohamed Imoussaïne-Aïkous, Carolyn Côté-Lussier, Tiago H. Falk

Figure 1 for On the Impact of Voice Anonymization on Speech-Based COVID-19 Detection

Figure 2 for On the Impact of Voice Anonymization on Speech-Based COVID-19 Detection

Figure 3 for On the Impact of Voice Anonymization on Speech-Based COVID-19 Detection

Figure 4 for On the Impact of Voice Anonymization on Speech-Based COVID-19 Detection

Abstract:With advances seen in deep learning, voice-based applications are burgeoning, ranging from personal assistants, affective computing, to remote disease diagnostics. As the voice contains both linguistic and paralinguistic information (e.g., vocal pitch, intonation, speech rate, loudness), there is growing interest in voice anonymization to preserve speaker privacy and identity. Voice privacy challenges have emerged over the last few years and focus has been placed on removing speaker identity while keeping linguistic content intact. For affective computing and disease monitoring applications, however, the paralinguistic content may be more critical. Unfortunately, the effects that anonymization may have on these systems are still largely unknown. In this paper, we fill this gap and focus on one particular health monitoring application: speech-based COVID-19 diagnosis. We test two popular anonymization methods and their impact on five different state-of-the-art COVID-19 diagnostic systems using three public datasets. We validate the effectiveness of the anonymization methods, compare their computational complexity, and quantify the impact across different testing scenarios for both within- and across-dataset conditions. Lastly, we show the benefits of anonymization as a data augmentation tool to help recover some of the COVID-19 diagnostic accuracy loss seen with anonymized data.

* 11 pages, 10 figures

Via

Access Paper or Ask Questions

Sentence Simplification via Large Language Models

Feb 23, 2023

Yutao Feng, Jipeng Qiang, Yun Li, Yunhao Yuan, Yi Zhu

Figure 1 for Sentence Simplification via Large Language Models

Figure 2 for Sentence Simplification via Large Language Models

Figure 3 for Sentence Simplification via Large Language Models

Figure 4 for Sentence Simplification via Large Language Models

Abstract:Sentence Simplification aims to rephrase complex sentences into simpler sentences while retaining original meaning. Large Language models (LLMs) have demonstrated the ability to perform a variety of natural language processing tasks. However, it is not yet known whether LLMs can be served as a high-quality sentence simplification system. In this work, we empirically analyze the zero-/few-shot learning ability of LLMs by evaluating them on a number of benchmark test sets. Experimental results show LLMs outperform state-of-the-art sentence simplification methods, and are judged to be on a par with human annotators.

Via

Access Paper or Ask Questions

Actional Atomic-Concept Learning for Demystifying Vision-Language Navigation

Feb 13, 2023

Bingqian Lin, Yi Zhu, Xiaodan Liang, Liang Lin, Jianzhuang Liu

Abstract:Vision-Language Navigation (VLN) is a challenging task which requires an agent to align complex visual observations to language instructions to reach the goal position. Most existing VLN agents directly learn to align the raw directional features and visual features trained using one-hot labels to linguistic instruction features. However, the big semantic gap among these multi-modal inputs makes the alignment difficult and therefore limits the navigation performance. In this paper, we propose Actional Atomic-Concept Learning (AACL), which maps visual observations to actional atomic concepts for facilitating the alignment. Specifically, an actional atomic concept is a natural language phrase containing an atomic action and an object, e.g., ``go up stairs''. These actional atomic concepts, which serve as the bridge between observations and instructions, can effectively mitigate the semantic gap and simplify the alignment. AACL contains three core components: 1) a concept mapping module to map the observations to the actional atomic concept representations through the VLN environment and the recently proposed Contrastive Language-Image Pretraining (CLIP) model, 2) a concept refining adapter to encourage more instruction-oriented object concept extraction by re-ranking the predicted object concepts by CLIP, and 3) an observation co-embedding module which utilizes concept representations to regularize the observation representations. Our AACL establishes new state-of-the-art results on both fine-grained (R2R) and high-level (REVERIE and R2R-Last) VLN benchmarks. Moreover, the visualization shows that AACL significantly improves the interpretability in action decision.

* Accepted by AAAI 2023

Via

Access Paper or Ask Questions