Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yi Yang

The Hong Kong University of Science and Technology, Hong Kong SAR, China

An Open-World, Diverse, Cross-Spatial-Temporal Benchmark for Dynamic Wild Person Re-Identification

Mar 22, 2024

Lei Zhang, Xiaowei Fu, Fuxiang Huang, Yi Yang, Xinbo Gao

Abstract:Person re-identification (ReID) has made great strides thanks to the data-driven deep learning techniques. However, the existing benchmark datasets lack diversity, and models trained on these data cannot generalize well to dynamic wild scenarios. To meet the goal of improving the explicit generalization of ReID models, we develop a new Open-World, Diverse, Cross-Spatial-Temporal dataset named OWD with several distinct features. 1) Diverse collection scenes: multiple independent open-world and highly dynamic collecting scenes, including streets, intersections, shopping malls, etc. 2) Diverse lighting variations: long time spans from daytime to nighttime with abundant illumination changes. 3) Diverse person status: multiple camera networks in all seasons with normal/adverse weather conditions and diverse pedestrian appearances (e.g., clothes, personal belongings, poses, etc.). 4) Protected privacy: invisible faces for privacy critical applications. To improve the implicit generalization of ReID, we further propose a Latent Domain Expansion (LDE) method to develop the potential of source data, which decouples discriminative identity-relevant and trustworthy domain-relevant features and implicitly enforces domain-randomized identity feature space expansion with richer domain diversity to facilitate domain invariant representations. Our comprehensive evaluations with most benchmark datasets in the community are crucial for progress, although this work is far from the grand goal toward open-world and dynamic wild applications.

* Accepted by IJCV in 2024

Via

Access Paper or Ask Questions

Volumetric Environment Representation for Vision-Language Navigation

Mar 21, 2024

Rui Liu, Wenguan Wang, Yi Yang

Figure 1 for Volumetric Environment Representation for Vision-Language Navigation

Figure 2 for Volumetric Environment Representation for Vision-Language Navigation

Figure 3 for Volumetric Environment Representation for Vision-Language Navigation

Figure 4 for Volumetric Environment Representation for Vision-Language Navigation

Abstract:Vision-language navigation (VLN) requires an agent to navigate through an 3D environment based on visual observations and natural language instructions. It is clear that the pivotal factor for successful navigation lies in the comprehensive scene understanding. Previous VLN agents employ monocular frameworks to extract 2D features of perspective views directly. Though straightforward, they struggle for capturing 3D geometry and semantics, leading to a partial and incomplete environment representation. To achieve a comprehensive 3D representation with fine-grained details, we introduce a Volumetric Environment Representation (VER), which voxelizes the physical world into structured 3D cells. For each cell, VER aggregates multi-view 2D features into such a unified 3D space via 2D-3D sampling. Through coarse-to-fine feature extraction and multi-task learning for VER, our agent predicts 3D occupancy, 3D room layout, and 3D bounding boxes jointly. Based on online collected VERs, our agent performs volume state estimation and builds episodic memory for predicting the next step. Experimental results show our environment representations from multi-task learning lead to evident performance gains on VLN. Our model achieves state-of-the-art performance across VLN benchmarks (R2R, REVERIE, and R4R).

* Accepted at CVPR 2024

Via

Access Paper or Ask Questions

Beyond Surface Similarity: Detecting Subtle Semantic Shifts in Financial Narratives

Mar 21, 2024

Jiaxin Liu, Yi Yang, Kar Yan Tam

Figure 1 for Beyond Surface Similarity: Detecting Subtle Semantic Shifts in Financial Narratives

Figure 2 for Beyond Surface Similarity: Detecting Subtle Semantic Shifts in Financial Narratives

Figure 3 for Beyond Surface Similarity: Detecting Subtle Semantic Shifts in Financial Narratives

Figure 4 for Beyond Surface Similarity: Detecting Subtle Semantic Shifts in Financial Narratives

Abstract:In this paper, we introduce the Financial-STS task, a financial domain-specific NLP task designed to measure the nuanced semantic similarity between pairs of financial narratives. These narratives originate from the financial statements of the same company but correspond to different periods, such as year-over-year comparisons. Measuring the subtle semantic differences between these paired narratives enables market stakeholders to gauge changes over time in the company's financial and operational situations, which is critical for financial decision-making. We find that existing pretrained embedding models and LLM embeddings fall short in discerning these subtle financial narrative shifts. To address this gap, we propose an LLM-augmented pipeline specifically designed for the Financial-STS task. Evaluation on a human-annotated dataset demonstrates that our proposed method outperforms existing methods trained on classic STS tasks and generic LLM embeddings.

Via

Access Paper or Ask Questions

VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models

Mar 10, 2024

Wenhao Wang, Yi Yang

Figure 1 for VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models

Figure 2 for VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models

Figure 3 for VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models

Figure 4 for VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models

Abstract:The arrival of Sora marks a new era for text-to-video diffusion models, bringing significant advancements in video generation and potential applications. However, Sora, as well as other text-to-video diffusion models, highly relies on the prompts, and there is no publicly available dataset featuring a study of text-to-video prompts. In this paper, we introduce VidProM, the first large-scale dataset comprising 1.67 million unique text-to-video prompts from real users. Additionally, the dataset includes 6.69 million videos generated by four state-of-the-art diffusion models and some related data. We initially demonstrate the curation of this large-scale dataset, which is a time-consuming and costly process. Subsequently, we show how the proposed VidProM differs from DiffusionDB, a large-scale prompt-gallery dataset for image generation. Based on the analysis of these prompts, we identify the necessity for a new prompt dataset specifically designed for text-to-video generation and gain insights into the preferences of real users when creating videos. Our large-scale and diverse dataset also inspires many exciting new research areas. For instance, to develop better, more efficient, and safer text-to-video diffusion models, we suggest exploring text-to-video prompt engineering, efficient video generation, and video copy detection for diffusion models. We make the collected dataset VidProM publicly available at GitHub and Hugging Face under the CC-BY- NC 4.0 License.

* Please download the collected dataset from https://github.com/WangWenhao0716/VidProM and https://huggingface.co/datasets/WenhaoWang/VidProM

Via

Access Paper or Ask Questions

DLP-GAN: learning to draw modern Chinese landscape photos with generative adversarial network

Mar 07, 2024

Xiangquan Gui, Binxuan Zhang, Li Li, Yi Yang

Abstract:Chinese landscape painting has a unique and artistic style, and its drawing technique is highly abstract in both the use of color and the realistic representation of objects. Previous methods focus on transferring from modern photos to ancient ink paintings. However, little attention has been paid to translating landscape paintings into modern photos. To solve such problems, in this paper, we (1) propose DLP-GAN (Draw Modern Chinese Landscape Photos with Generative Adversarial Network), an unsupervised cross-domain image translation framework with a novel asymmetric cycle mapping, and (2) introduce a generator based on a dense-fusion module to match different translation directions. Moreover, a dual-consistency loss is proposed to balance the realism and abstraction of model painting. In this way, our model can draw landscape photos and sketches in the modern sense. Finally, based on our collection of modern landscape and sketch datasets, we compare the images generated by our model with other benchmarks. Extensive experiments including user studies show that our model outperforms state-of-the-art methods.

* Neural Computing and Applications, 2023: 1-18
* Corrected typos

Via

Access Paper or Ask Questions

RulePrompt: Weakly Supervised Text Classification with Prompting PLMs and Self-Iterative Logical Rules

Mar 05, 2024

Miaomiao Li, Jiaqi Zhu, Yang Wang, Yi Yang, Yilin Li, Hongan Wang

Figure 1 for RulePrompt: Weakly Supervised Text Classification with Prompting PLMs and Self-Iterative Logical Rules

Figure 2 for RulePrompt: Weakly Supervised Text Classification with Prompting PLMs and Self-Iterative Logical Rules

Figure 3 for RulePrompt: Weakly Supervised Text Classification with Prompting PLMs and Self-Iterative Logical Rules

Figure 4 for RulePrompt: Weakly Supervised Text Classification with Prompting PLMs and Self-Iterative Logical Rules

Abstract:Weakly supervised text classification (WSTC), also called zero-shot or dataless text classification, has attracted increasing attention due to its applicability in classifying a mass of texts within the dynamic and open Web environment, since it requires only a limited set of seed words (label names) for each category instead of labeled data. With the help of recently popular prompting Pre-trained Language Models (PLMs), many studies leveraged manually crafted and/or automatically identified verbalizers to estimate the likelihood of categories, but they failed to differentiate the effects of these category-indicative words, let alone capture their correlations and realize adaptive adjustments according to the unlabeled corpus. In this paper, in order to let the PLM effectively understand each category, we at first propose a novel form of rule-based knowledge using logical expressions to characterize the meanings of categories. Then, we develop a prompting PLM-based approach named RulePrompt for the WSTC task, consisting of a rule mining module and a rule-enhanced pseudo label generation module, plus a self-supervised fine-tuning module to make the PLM align with this task. Within this framework, the inaccurate pseudo labels assigned to texts and the imprecise logical rules associated with categories mutually enhance each other in an alternative manner. That establishes a self-iterative closed loop of knowledge (rule) acquisition and utilization, with seed words serving as the starting point. Extensive experiments validate the effectiveness and robustness of our approach, which markedly outperforms state-of-the-art weakly supervised methods. What is more, our approach yields interpretable category rules, proving its advantage in disambiguating easily-confused categories.

* Accepted by WWW 2024

Via

Access Paper or Ask Questions

ProtChatGPT: Towards Understanding Proteins with Large Language Models

Feb 15, 2024

Chao Wang, Hehe Fan, Ruijie Quan, Yi Yang

Figure 1 for ProtChatGPT: Towards Understanding Proteins with Large Language Models

Figure 2 for ProtChatGPT: Towards Understanding Proteins with Large Language Models

Figure 3 for ProtChatGPT: Towards Understanding Proteins with Large Language Models

Figure 4 for ProtChatGPT: Towards Understanding Proteins with Large Language Models

Abstract:Protein research is crucial in various fundamental disciplines, but understanding their intricate structure-function relationships remains challenging. Recent Large Language Models (LLMs) have made significant strides in comprehending task-specific knowledge, suggesting the potential for ChatGPT-like systems specialized in protein to facilitate basic research. In this work, we introduce ProtChatGPT, which aims at learning and understanding protein structures via natural languages. ProtChatGPT enables users to upload proteins, ask questions, and engage in interactive conversations to produce comprehensive answers. The system comprises protein encoders, a Protein-Language Pertaining Transformer (PLP-former), a projection adapter, and an LLM. The protein first undergoes protein encoders and PLP-former to produce protein embeddings, which are then projected by the adapter to conform with the LLM. The LLM finally combines user questions with projected embeddings to generate informative answers. Experiments show that ProtChatGPT can produce promising responses to proteins and their corresponding questions. We hope that ProtChatGPT could form the basis for further exploration and application in protein research. Code and our pre-trained model will be publicly available.

Via

Access Paper or Ask Questions

Do LLMs Know about Hallucination? An Empirical Investigation of LLM's Hidden States

Feb 15, 2024

Hanyu Duan, Yi Yang, Kar Yan Tam

Abstract:Large Language Models (LLMs) can make up answers that are not real, and this is known as hallucination. This research aims to see if, how, and to what extent LLMs are aware of hallucination. More specifically, we check whether and how an LLM reacts differently in its hidden states when it answers a question right versus when it hallucinates. To do this, we introduce an experimental framework which allows examining LLM's hidden states in different hallucination situations. Building upon this framework, we conduct a series of experiments with language models in the LLaMA family (Touvron et al., 2023). Our empirical findings suggest that LLMs react differently when processing a genuine response versus a fabricated one. We then apply various model interpretation techniques to help understand and explain the findings better. Moreover, informed by the empirical observations, we show great potential of using the guidance derived from LLM's hidden representation space to mitigate hallucination. We believe this work provides insights into how LLMs produce hallucinated answers and how to make them occur less often.

* 9 pages, 8 figures, 2 tables (13 pages, 12 figures, 13 tables including references and appendices)

Via

Access Paper or Ask Questions

HeadStudio: Text to Animatable Head Avatars with 3D Gaussian Splatting

Feb 09, 2024

Zhenglin Zhou, Fan Ma, Hehe Fan, Yi Yang

Abstract:Creating digital avatars from textual prompts has long been a desirable yet challenging task. Despite the promising outcomes obtained through 2D diffusion priors in recent works, current methods face challenges in achieving high-quality and animated avatars effectively. In this paper, we present $\textbf{HeadStudio}$, a novel framework that utilizes 3D Gaussian splatting to generate realistic and animated avatars from text prompts. Our method drives 3D Gaussians semantically to create a flexible and achievable appearance through the intermediate FLAME representation. Specifically, we incorporate the FLAME into both 3D representation and score distillation: 1) FLAME-based 3D Gaussian splatting, driving 3D Gaussian points by rigging each point to a FLAME mesh. 2) FLAME-based score distillation sampling, utilizing FLAME-based fine-grained control signal to guide score distillation from the text prompt. Extensive experiments demonstrate the efficacy of HeadStudio in generating animatable avatars from textual prompts, exhibiting visually appealing appearances. The avatars are capable of rendering high-quality real-time ($\geq 40$ fps) novel views at a resolution of 1024. They can be smoothly controlled by real-world speech and video. We hope that HeadStudio can advance digital avatar creation and that the present method can widely be applied across various domains.

* 9 pages, 8 figures

Via

Access Paper or Ask Questions

MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis

Feb 08, 2024

Dewei Zhou, You Li, Fan Ma, Zongxin Yang, Yi Yang

Figure 1 for MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis

Figure 2 for MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis

Figure 3 for MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis

Figure 4 for MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis

Abstract:We present a Multi-Instance Generation (MIG) task, simultaneously generating multiple instances with diverse controls in one image. Given a set of predefined coordinates and their corresponding descriptions, the task is to ensure that generated instances are accurately at the designated locations and that all instances' attributes adhere to their corresponding description. This broadens the scope of current research on Single-instance generation, elevating it to a more versatile and practical dimension. Inspired by the idea of divide and conquer, we introduce an innovative approach named Multi-Instance Generation Controller (MIGC) to address the challenges of the MIG task. Initially, we break down the MIG task into several subtasks, each involving the shading of a single instance. To ensure precise shading for each instance, we introduce an instance enhancement attention mechanism. Lastly, we aggregate all the shaded instances to provide the necessary information for accurately generating multiple instances in stable diffusion (SD). To evaluate how well generation models perform on the MIG task, we provide a COCO-MIG benchmark along with an evaluation pipeline. Extensive experiments were conducted on the proposed COCO-MIG benchmark, as well as on various commonly used benchmarks. The evaluation results illustrate the exceptional control capabilities of our model in terms of quantity, position, attribute, and interaction.

Via

Access Paper or Ask Questions