Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Chao Zhang

refer to the report for detailed contributions

Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding

May 14, 2024

Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao(+35 more)

Figure 1 for Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding

Figure 2 for Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding

Figure 3 for Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding

Figure 4 for Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding

Abstract:We present Hunyuan-DiT, a text-to-image diffusion transformer with fine-grained understanding of both English and Chinese. To construct Hunyuan-DiT, we carefully design the transformer structure, text encoder, and positional encoding. We also build from scratch a whole data pipeline to update and evaluate data for iterative model optimization. For fine-grained language understanding, we train a Multimodal Large Language Model to refine the captions of the images. Finally, Hunyuan-DiT can perform multi-turn multimodal dialogue with users, generating and refining images according to the context. Through our holistic human evaluation protocol with more than 50 professional human evaluators, Hunyuan-DiT sets a new state-of-the-art in Chinese-to-image generation compared with other open-source models. Code and pretrained models are publicly available at github.com/Tencent/HunyuanDiT

* Project Page: https://dit.hunyuan.tencent.com/

Via

Access Paper or Ask Questions

BMRetriever: Tuning Large Language Models as Better Biomedical Text Retrievers

Apr 29, 2024

Ran Xu, Wenqi Shi, Yue Yu, Yuchen Zhuang, Yanqiao Zhu, May D. Wang, Joyce C. Ho, Chao Zhang, Carl Yang

Figure 1 for BMRetriever: Tuning Large Language Models as Better Biomedical Text Retrievers

Figure 2 for BMRetriever: Tuning Large Language Models as Better Biomedical Text Retrievers

Figure 3 for BMRetriever: Tuning Large Language Models as Better Biomedical Text Retrievers

Figure 4 for BMRetriever: Tuning Large Language Models as Better Biomedical Text Retrievers

Abstract:Developing effective biomedical retrieval models is important for excelling at knowledge-intensive biomedical tasks but still challenging due to the deficiency of sufficient publicly annotated biomedical data and computational resources. We present BMRetriever, a series of dense retrievers for enhancing biomedical retrieval via unsupervised pre-training on large biomedical corpora, followed by instruction fine-tuning on a combination of labeled datasets and synthetic pairs. Experiments on 5 biomedical tasks across 11 datasets verify BMRetriever's efficacy on various biomedical applications. BMRetriever also exhibits strong parameter efficiency, with the 410M variant outperforming baselines up to 11.7 times larger, and the 2B variant matching the performance of models with over 5B parameters. The training data and model checkpoints are released at \url{https://huggingface.co/BMRetriever} to ensure transparency, reproducibility, and application to new domains.

* Work in progress. The model and data will be uploaded to \url{https://github.com/ritaranx/BMRetriever}

Via

Access Paper or Ask Questions

Frequency-Guided Multi-Level Human Action Anomaly Detection with Normalizing Flows

Apr 26, 2024

Shun Maeda, Chunzhi Gu, Jun Yu, Shogo Tokai, Shangce Gao, Chao Zhang

Figure 1 for Frequency-Guided Multi-Level Human Action Anomaly Detection with Normalizing Flows

Figure 2 for Frequency-Guided Multi-Level Human Action Anomaly Detection with Normalizing Flows

Figure 3 for Frequency-Guided Multi-Level Human Action Anomaly Detection with Normalizing Flows

Figure 4 for Frequency-Guided Multi-Level Human Action Anomaly Detection with Normalizing Flows

Abstract:We introduce the task of human action anomaly detection (HAAD), which aims to identify anomalous motions in an unsupervised manner given only the pre-determined normal category of training action samples. Compared to prior human-related anomaly detection tasks which primarily focus on unusual events from videos, HAAD involves the learning of specific action labels to recognize semantically anomalous human behaviors. To address this task, we propose a normalizing flow (NF)-based detection framework where the sample likelihood is effectively leveraged to indicate anomalies. As action anomalies often occur in some specific body parts, in addition to the full-body action feature learning, we incorporate extra encoding streams into our framework for a finer modeling of body subsets. Our framework is thus multi-level to jointly discover global and local motion anomalies. Furthermore, to show awareness of the potentially jittery data during recording, we resort to discrete cosine transformation by converting the action samples from the temporal to the frequency domain to mitigate the issue of data instability. Extensive experimental results on two human action datasets demonstrate that our method outperforms the baselines formed by adapting state-of-the-art human activity AD approaches to our task of HAAD.

Via

Access Paper or Ask Questions

Bayesian Example Selection Improves In-Context Learning for Speech, Text, and Visual Modalities

Apr 23, 2024

Siyin Wang, Chao-Han Huck Yang, Ji Wu, Chao Zhang

Abstract:Large language models (LLMs) can adapt to new tasks through in-context learning (ICL) based on a few examples presented in dialogue history without any model parameter update. Despite such convenience, the performance of ICL heavily depends on the quality of the in-context examples presented, which makes the in-context example selection approach a critical choice. This paper proposes a novel Bayesian in-Context example Selection method (ByCS) for ICL. Extending the inference probability conditioned on in-context examples based on Bayes' theorem, ByCS focuses on the inverse inference conditioned on test input. Following the assumption that accurate inverse inference probability (likelihood) will result in accurate inference probability (posterior), in-context examples are selected based on their inverse inference results. Diverse and extensive cross-tasking and cross-modality experiments are performed with speech, text, and image examples. Experimental results show the efficacy and robustness of our ByCS method on various models, tasks and modalities.

* 16 pages, 6 figures

Via

Access Paper or Ask Questions

Emerging Advancements in 6G NTN Radio Access Technologies: An Overview

Apr 22, 2024

Husnain Shahid, Carla Amatetti, Riccardo Campana, Sorya Tong, Dorin Panaitopol, Alessandro Vanelli Coralli, Abdelhamed Mohamed, Chao Zhang, Ebraam Khalifa, Eduardo Medeiros(+14 more)

Figure 1 for Emerging Advancements in 6G NTN Radio Access Technologies: An Overview

Figure 2 for Emerging Advancements in 6G NTN Radio Access Technologies: An Overview

Figure 3 for Emerging Advancements in 6G NTN Radio Access Technologies: An Overview

Figure 4 for Emerging Advancements in 6G NTN Radio Access Technologies: An Overview

Abstract:The efforts on the development, standardization and improvements to communication systems towards 5G Advanced and 6G are on track to provide benefits such as an unprecedented level of connectivity and performance, enabling a diverse range of vertical services. The full integration of non-terrestrial components into 6G plays a pivotal role in realizing this paradigm shift towards ubiquitous communication and global coverage. However, this integration into 6G brings forth a set of its own challenges, particularly in Radio Access Technologies (RATs). To this end, this paper comprehensively discusses those challenges at different levels of RATs and proposes the corresponding potential emerging advancements in the realm of 6G NTN. In particular, the focus is on advancing the prospective aspects of Radio Resource Management (RRM), spectral coexistence in terrestrial and non-terrestrial components and flexible waveform design solutions to combat the impediments. This discussion with a specific focus on emerging advancements in 6G NTN RATs is critical for shaping the next generation networks and potentially relevant in contributing the part in standardization in forthcoming releases

* accepted in 2024 EuCNC and 6G Summit, Antwerp, Belgium, 3_6 June 2024

Via

Access Paper or Ask Questions

Cepstral Analysis Based Artifact Detection, Recognition and Removal for Prefrontal EEG

Apr 12, 2024

Siqi Han, Chao Zhang, Jiaxin Lei, Qingquan Han, Yuhui Du, Anhe Wang, Shuo Bai, Milin Zhang

Abstract:This paper proposes to use cepstrum for artifact detection, recognition and removal in prefrontal EEG. This work focuses on the artifact caused by eye movement. A database containing artifact-free EEG and eye movement contaminated EEG from different subjects is established. A cepstral analysis-based feature extraction with support vector machine (SVM) based classifier is designed to identify the artifacts from the target EEG signals. The proposed method achieves an accuracy of 99.62% on the artifact detection task and a 82.79% accuracy on the 6-category eye movement classification task. A statistical value-based artifact removal method is proposed and evaluated on a public EEG database, where an accuracy improvement of 3.46% is obtained on the 3-category emotion classification task. In order to make a confident decision of each 5s EEG segment, the algorithm requires only 0.66M multiplication operations. Compared to the state-of-the-art approaches in artifact detection and removal, the proposed method features higher detection accuracy and lower computational cost, which makes it a more suitable solution to be integrated into a real-time and artifact robust Brain-Machine Interface (BMI).

* IEEE Transactions on Circuits and Systems II: Express Briefs, 2023
* 5 pages, 4 figures, published by TCAS-II

Via

Access Paper or Ask Questions

Empowering Image Recovery_ A Multi-Attention Approach

Apr 09, 2024

Juan Wen, Yawei Li, Chao Zhang, Weiyan Hou, Radu Timofte, Luc Van Gool

Abstract:We propose Diverse Restormer (DART), a novel image restoration method that effectively integrates information from various sources (long sequences, local and global regions, feature dimensions, and positional dimensions) to address restoration challenges. While Transformer models have demonstrated excellent performance in image restoration due to their self-attention mechanism, they face limitations in complex scenarios. Leveraging recent advancements in Transformers and various attention mechanisms, our method utilizes customized attention mechanisms to enhance overall performance. DART, our novel network architecture, employs windowed attention to mimic the selective focusing mechanism of human eyes. By dynamically adjusting receptive fields, it optimally captures the fundamental features crucial for image resolution reconstruction. Efficiency and performance balance are achieved through the LongIR attention mechanism for long sequence image restoration. Integration of attention mechanisms across feature and positional dimensions further enhances the recovery of fine details. Evaluation across five restoration tasks consistently positions DART at the forefront. Upon acceptance, we commit to providing publicly accessible code and models to ensure reproducibility and facilitate further research.

* 12 pages, 10 figures, 12 tables

Via

Access Paper or Ask Questions

Semantic Map-based Generation of Navigation Instructions

Mar 28, 2024

Chengzu Li, Chao Zhang, Simone Teufel, Rama Sanand Doddipatla, Svetlana Stoyanchev

Figure 1 for Semantic Map-based Generation of Navigation Instructions

Figure 2 for Semantic Map-based Generation of Navigation Instructions

Figure 3 for Semantic Map-based Generation of Navigation Instructions

Figure 4 for Semantic Map-based Generation of Navigation Instructions

Abstract:We are interested in the generation of navigation instructions, either in their own right or as training material for robotic navigation task. In this paper, we propose a new approach to navigation instruction generation by framing the problem as an image captioning task using semantic maps as visual input. Conventional approaches employ a sequence of panorama images to generate navigation instructions. Semantic maps abstract away from visual details and fuse the information in multiple panorama images into a single top-down representation, thereby reducing computational complexity to process the input. We present a benchmark dataset for instruction generation using semantic maps, propose an initial model and ask human subjects to manually assess the quality of generated instructions. Our initial investigations show promise in using semantic maps for instruction generation instead of a sequence of panorama images, but there is vast scope for improvement. We release the code for data preparation and model training at https://github.com/chengzu-li/VLGen.

* 5 pages, 2 figures, 3 tables (13 pages, 3 figures, 5 tables including references and appendices), accepted at LREC-COLING 2024

Via

Access Paper or Ask Questions

Leveraging Large Language Model to Generate a Novel Metaheuristic Algorithm with CRISPE Framework

Mar 25, 2024

Rui Zhong, Yuefeng Xu, Chao Zhang, Jun Yu

Figure 1 for Leveraging Large Language Model to Generate a Novel Metaheuristic Algorithm with CRISPE Framework

Figure 2 for Leveraging Large Language Model to Generate a Novel Metaheuristic Algorithm with CRISPE Framework

Figure 3 for Leveraging Large Language Model to Generate a Novel Metaheuristic Algorithm with CRISPE Framework

Figure 4 for Leveraging Large Language Model to Generate a Novel Metaheuristic Algorithm with CRISPE Framework

Abstract:In this paper, we borrow the large language model (LLM) ChatGPT-3.5 to automatically and quickly design a new metaheuristic algorithm (MA) with only a small amount of input. The novel animal-inspired MA named zoological search optimization (ZSO) draws inspiration from the collective behaviors of animals for solving continuous optimization problems. Specifically, the basic ZSO algorithm involves two search operators: the prey-predator interaction operator and the social flocking operator to balance exploration and exploitation well. Besides, the standard prompt engineering framework CRISPE (i.e., Capacity and Role, Insight, Statement, Personality, and Experiment) is responsible for the specific prompt design. Furthermore, we designed four variants of the ZSO algorithm with slight human-interacted adjustment. In numerical experiments, we comprehensively investigate the performance of ZSO-derived algorithms on CEC2014 benchmark functions, CEC2022 benchmark functions, and six engineering optimization problems. 20 popular and state-of-the-art MAs are employed as competitors. The experimental results and statistical analysis confirm the efficiency and effectiveness of ZSO-derived algorithms. At the end of this paper, we explore the prospects for the development of the metaheuristics community under the LLM era.

* 24 pages

Via

Access Paper or Ask Questions

M$^3$AV: A Multimodal, Multigenre, and Multipurpose Audio-Visual Academic Lecture Dataset

Mar 21, 2024

Zhe Chen, Heyang Liu, Wenyi Yu, Guangzhi Sun, Hongcheng Liu, Ji Wu, Chao Zhang, Yu Wang, Yanfeng Wang

Abstract:Publishing open-source academic video recordings is an emergent and prevalent approach to sharing knowledge online. Such videos carry rich multimodal information including speech, the facial and body movements of the speakers, as well as the texts and pictures in the slides and possibly even the papers. Although multiple academic video datasets have been constructed and released, few of them support both multimodal content recognition and understanding tasks, which is partially due to the lack of high-quality human annotations. In this paper, we propose a novel multimodal, multigenre, and multipurpose audio-visual academic lecture dataset (M$^3$AV), which has almost 367 hours of videos from five sources covering computer science, mathematics, and medical and biology topics. With high-quality human annotations of the spoken and written words, in particular high-valued name entities, the dataset can be used for multiple audio-visual recognition and understanding tasks. Evaluations performed on contextual speech recognition, speech synthesis, and slide and script generation tasks demonstrate that the diversity of M$^3$AV makes it a challenging dataset.

Via

Access Paper or Ask Questions