Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Hang Zhang

2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

Jan 03, 2025

Wenqi Zhang, Hang Zhang, Xin Li, Jiashuo Sun, Yongliang Shen, Weiming Lu, Deli Zhao, Yueting Zhuang, Lidong Bing

Figure 1 for 2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

Figure 2 for 2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

Figure 3 for 2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

Figure 4 for 2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

Abstract:Compared to image-text pair data, interleaved corpora enable Vision-Language Models (VLMs) to understand the world more naturally like humans. However, such existing datasets are crawled from webpage, facing challenges like low knowledge density, loose image-text relations, and poor logical coherence between images. On the other hand, the internet hosts vast instructional videos (e.g., online geometry courses) that are widely used by humans to learn foundational subjects, yet these valuable resources remain underexplored in VLM training. In this paper, we introduce a high-quality \textbf{multimodal textbook} corpus with richer foundational knowledge for VLM pretraining. It collects over 2.5 years of instructional videos, totaling 22,000 class hours. We first use an LLM-proposed taxonomy to systematically gather instructional videos. Then we progressively extract and refine visual (keyframes), audio (ASR), and textual knowledge (OCR) from the videos, and organize as an image-text interleaved corpus based on temporal order. Compared to its counterparts, our video-centric textbook offers more coherent context, richer knowledge, and better image-text alignment. Experiments demonstrate its superb pretraining performance, particularly in knowledge- and reasoning-intensive tasks like ScienceQA and MathVista. Moreover, VLMs pre-trained on our textbook exhibit outstanding interleaved context awareness, leveraging visual and textual cues in their few-shot context for task solving~\footnote{Our code are available at \url{https://github.com/DAMO-NLP-SG/multimodal_textbook}}.

* Under review

Via

Access Paper or Ask Questions

AltGen: AI-Driven Alt Text Generation for Enhancing EPUB Accessibility

Dec 30, 2024

Yixian Shen, Hang Zhang, Yanxin Shen, Lun Wang, Chuanqi Shi, Shaoshuai Du, Yiyi Tao

Figure 1 for AltGen: AI-Driven Alt Text Generation for Enhancing EPUB Accessibility

Figure 2 for AltGen: AI-Driven Alt Text Generation for Enhancing EPUB Accessibility

Figure 3 for AltGen: AI-Driven Alt Text Generation for Enhancing EPUB Accessibility

Figure 4 for AltGen: AI-Driven Alt Text Generation for Enhancing EPUB Accessibility

Abstract:Digital accessibility is a cornerstone of inclusive content delivery, yet many EPUB files fail to meet fundamental accessibility standards, particularly in providing descriptive alt text for images. Alt text plays a critical role in enabling visually impaired users to understand visual content through assistive technologies. However, generating high-quality alt text at scale is a resource-intensive process, creating significant challenges for organizations aiming to ensure accessibility compliance. This paper introduces AltGen, a novel AI-driven pipeline designed to automate the generation of alt text for images in EPUB files. By integrating state-of-the-art generative models, including advanced transformer-based architectures, AltGen achieves contextually relevant and linguistically coherent alt text descriptions. The pipeline encompasses multiple stages, starting with data preprocessing to extract and prepare relevant content, followed by visual analysis using computer vision models such as CLIP and ViT. The extracted visual features are enriched with contextual information from surrounding text, enabling the fine-tuned language models to generate descriptive and accurate alt text. Validation of the generated output employs both quantitative metrics, such as cosine similarity and BLEU scores, and qualitative feedback from visually impaired users. Experimental results demonstrate the efficacy of AltGen across diverse datasets, achieving a 97.5% reduction in accessibility errors and high scores in similarity and linguistic fidelity metrics. User studies highlight the practical impact of AltGen, with participants reporting significant improvements in document usability and comprehension. Furthermore, comparative analyses reveal that AltGen outperforms existing approaches in terms of accuracy, relevance, and scalability.

Via

Access Paper or Ask Questions

Comparative Analysis of Listwise Reranking with Large Language Models in Limited-Resource Language Contexts

Dec 28, 2024

Yanxin Shen, Lun Wang, Chuanqi Shi, Shaoshuai Du, Yiyi Tao, Yixian Shen, Hang Zhang

Abstract:Large Language Models (LLMs) have demonstrated significant effectiveness across various NLP tasks, including text ranking. This study assesses the performance of large language models (LLMs) in listwise reranking for limited-resource African languages. We compare proprietary models RankGPT3.5, Rank4o-mini, RankGPTo1-mini and RankClaude-sonnet in cross-lingual contexts. Results indicate that these LLMs significantly outperform traditional baseline methods such as BM25-DT in most evaluation metrics, particularly in nDCG@10 and MRR@100. These findings highlight the potential of LLMs in enhancing reranking tasks for low-resource languages and offer insights into cost-effective solutions.

Via

Access Paper or Ask Questions

Robustness of Large Language Models Against Adversarial Attacks

Dec 22, 2024

Yiyi Tao, Yixian Shen, Hang Zhang, Yanxin Shen, Lun Wang, Chuanqi Shi, Shaoshuai Du

Figure 1 for Robustness of Large Language Models Against Adversarial Attacks

Figure 2 for Robustness of Large Language Models Against Adversarial Attacks

Figure 3 for Robustness of Large Language Models Against Adversarial Attacks

Figure 4 for Robustness of Large Language Models Against Adversarial Attacks

Abstract:The increasing deployment of Large Language Models (LLMs) in various applications necessitates a rigorous evaluation of their robustness against adversarial attacks. In this paper, we present a comprehensive study on the robustness of GPT LLM family. We employ two distinct evaluation methods to assess their resilience. The first method introduce character-level text attack in input prompts, testing the models on three sentiment classification datasets: StanfordNLP/IMDB, Yelp Reviews, and SST-2. The second method involves using jailbreak prompts to challenge the safety mechanisms of the LLMs. Our experiments reveal significant variations in the robustness of these models, demonstrating their varying degrees of vulnerability to both character-level and semantic-level adversarial attacks. These findings underscore the necessity for improved adversarial training and enhanced safety mechanisms to bolster the robustness of LLMs.

Via

Access Paper or Ask Questions

Feedback Regulated Opto-Mechanical Soft Robotic Actuators

Dec 20, 2024

Jianfeng Yang, Haotian Pi, Zixuan Deng, Hongshuang Guo, Wan Shou, Hang Zhang, Hao Zeng

Figure 1 for Feedback Regulated Opto-Mechanical Soft Robotic Actuators

Figure 2 for Feedback Regulated Opto-Mechanical Soft Robotic Actuators

Figure 3 for Feedback Regulated Opto-Mechanical Soft Robotic Actuators

Figure 4 for Feedback Regulated Opto-Mechanical Soft Robotic Actuators

Abstract:Natural organisms can convert environmental stimuli into sensory feedback to regulate their body and realize active adaptivity. However, realizing such a feedback-regulation mechanism in synthetic material systems remains a grand challenge. It is believed that achieving complex feedback mechanisms in responsive materials will pave the way toward autonomous, intelligent structure and actuation without complex electronics. Inspired by living systems, we report a general principle to design and construct such feedback loops in light-responsive materials. Specifically, we design a baffle-actuator mechanism to incorporate programmed feedback into the opto-mechanical responsiveness. By simply addressing the baffle position with respect to the incident light beam, positive and negative feedback are programmed. We demonstrate the transformation of a light-bending strip into a switcher, where the intensity of light determines the energy barrier under positive feedback, realizing multi-stable shape-morphing. By leveraging the negative feedback and associated homeostasis, we demonstrate two soft robots, i.e., a locomotor and a swimmer. Furthermore, we unveil the ubiquity of feedback in light-responsive materials, which provides new insight into self-regulated robotic matters.

Via

Access Paper or Ask Questions

SceneLLM: Implicit Language Reasoning in LLM for Dynamic Scene Graph Generation

Dec 15, 2024

Hang Zhang, Zhuoling Li, Jun Liu

Abstract:Dynamic scenes contain intricate spatio-temporal information, crucial for mobile robots, UAVs, and autonomous driving systems to make informed decisions. Parsing these scenes into semantic triplets <Subject-Predicate-Object> for accurate Scene Graph Generation (SGG) is highly challenging due to the fluctuating spatio-temporal complexity. Inspired by the reasoning capabilities of Large Language Models (LLMs), we propose SceneLLM, a novel framework that leverages LLMs as powerful scene analyzers for dynamic SGG. Our framework introduces a Video-to-Language (V2L) mapping module that transforms video frames into linguistic signals (scene tokens), making the input more comprehensible for LLMs. To better encode spatial information, we devise a Spatial Information Aggregation (SIA) scheme, inspired by the structure of Chinese characters, which encodes spatial data into tokens. Using Optimal Transport (OT), we generate an implicit language signal from the frame-level token sequence that captures the video's spatio-temporal information. To further improve the LLM's ability to process this implicit linguistic input, we apply Low-Rank Adaptation (LoRA) to fine-tune the model. Finally, we use a transformer-based SGG predictor to decode the LLM's reasoning and predict semantic triplets. Our method achieves state-of-the-art results on the Action Genome (AG) benchmark, and extensive experiments show the effectiveness of SceneLLM in understanding and generating accurate dynamic scene graphs.

* 27 pages, 4 figures

Via

Access Paper or Ask Questions

Splats in Splats: Embedding Invisible 3D Watermark within Gaussian Splatting

Dec 04, 2024

Yijia Guo, Wenkai Huang, Yang Li, Gaolei Li, Hang Zhang, Liwen Hu, Jianhua Li, Tiejun Huang, Lei Ma

Figure 1 for Splats in Splats: Embedding Invisible 3D Watermark within Gaussian Splatting

Figure 2 for Splats in Splats: Embedding Invisible 3D Watermark within Gaussian Splatting

Figure 3 for Splats in Splats: Embedding Invisible 3D Watermark within Gaussian Splatting

Figure 4 for Splats in Splats: Embedding Invisible 3D Watermark within Gaussian Splatting

Abstract:3D Gaussian splatting (3DGS) has demonstrated impressive 3D reconstruction performance with explicit scene representations. Given the widespread application of 3DGS in 3D reconstruction and generation tasks, there is an urgent need to protect the copyright of 3DGS assets. However, existing copyright protection techniques for 3DGS overlook the usability of 3D assets, posing challenges for practical deployment. Here we describe WaterGS, the first 3DGS watermarking framework that embeds 3D content in 3DGS itself without modifying any attributes of the vanilla 3DGS. To achieve this, we take a deep insight into spherical harmonics (SH) and devise an importance-graded SH coefficient encryption strategy to embed the hidden SH coefficients. Furthermore, we employ a convolutional autoencoder to establish a mapping between the original Gaussian primitives' opacity and the hidden Gaussian primitives' opacity. Extensive experiments indicate that WaterGS significantly outperforms existing 3D steganography techniques, with 5.31% higher scene fidelity and 3X faster rendering speed, while ensuring security, robustness, and user experience. Codes and data will be released at https://water-gs.github.io.

Via

Access Paper or Ask Questions

CLAP: Unsupervised 3D Representation Learning for Fusion 3D Perception via Curvature Sampling and Prototype Learning

Dec 04, 2024

Runjian Chen, Hang Zhang, Avinash Ravichandran, Wenqi Shao, Alex Wong, Ping Luo

Figure 1 for CLAP: Unsupervised 3D Representation Learning for Fusion 3D Perception via Curvature Sampling and Prototype Learning

Figure 2 for CLAP: Unsupervised 3D Representation Learning for Fusion 3D Perception via Curvature Sampling and Prototype Learning

Figure 3 for CLAP: Unsupervised 3D Representation Learning for Fusion 3D Perception via Curvature Sampling and Prototype Learning

Figure 4 for CLAP: Unsupervised 3D Representation Learning for Fusion 3D Perception via Curvature Sampling and Prototype Learning

Abstract:Unsupervised 3D representation learning via masked-and-reconstruction with differentiable rendering is promising to reduce the labeling burden for fusion 3D perception. However, previous literature conduct pre-training for different modalities separately because of the hight GPU memory consumption. Consequently, the interaction between the two modalities (images and point clouds) is neglected during pre-training. In this paper, we explore joint unsupervised pre-training for fusion 3D perception via differentiable rendering and propose CLAP, short for Curvature sampLing and swApping Prototype assignment prediction. The contributions are three-fold. 1) To overcome the GPU memory consumption problem, we propose Curvature Sampling to sample the more informative points/pixels for pre-training. 2) We propose to use learnable prototypes to represent parts of the scenes in a common feature space and bring the idea of swapping prototype assignment prediction to learn the interaction between the two modalities. 3) To further optimize learnable prototypes, we propose an Expectation-Maximization training scheme to maximize the similarity between embeddings and prototypes, followed by a Gram Matrix Regularization Loss to avoid collapse. Experiment results on NuScenes show that CLAP achieves 300% more performance gain as compared to previous SOTA 3D pre-training method via differentiable rendering. Codes and models will be released.

Via

Access Paper or Ask Questions

Fidelity-Imposed Displacement Editing for the Learn2Reg 2024 SHG-BF Challenge

Oct 28, 2024

Jiacheng Wang, Xiang Chen, Renjiu Hu, Rongguang Wang, Min Liu, Yaonan Wang, Jiazheng Wang, Hao Li, Hang Zhang

Figure 1 for Fidelity-Imposed Displacement Editing for the Learn2Reg 2024 SHG-BF Challenge

Figure 2 for Fidelity-Imposed Displacement Editing for the Learn2Reg 2024 SHG-BF Challenge

Figure 3 for Fidelity-Imposed Displacement Editing for the Learn2Reg 2024 SHG-BF Challenge

Figure 4 for Fidelity-Imposed Displacement Editing for the Learn2Reg 2024 SHG-BF Challenge

Abstract:Co-examination of second-harmonic generation (SHG) and bright-field (BF) microscopy enables the differentiation of tissue components and collagen fibers, aiding the analysis of human breast and pancreatic cancer tissues. However, large discrepancies between SHG and BF images pose challenges for current learning-based registration models in aligning SHG to BF. In this paper, we propose a novel multi-modal registration framework that employs fidelity-imposed displacement editing to address these challenges. The framework integrates batch-wise contrastive learning, feature-based pre-alignment, and instance-level optimization. Experimental results from the Learn2Reg COMULISglobe SHG-BF Challenge validate the effectiveness of our method, securing the 1st place on the online leaderboard.

Via

Access Paper or Ask Questions

Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss

Oct 22, 2024

Zesen Cheng, Hang Zhang, Kehan Li, Sicong Leng, Zhiqiang Hu, Fei Wu, Deli Zhao, Xin Li, Lidong Bing

Figure 1 for Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss

Figure 2 for Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss

Figure 3 for Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss

Figure 4 for Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss

Abstract:Contrastive loss is a powerful approach for representation learning, where larger batch sizes enhance performance by providing more negative samples to better distinguish between similar and dissimilar data. However, scaling batch sizes is constrained by the quadratic growth in GPU memory consumption, primarily due to the full instantiation of the similarity matrix. To address this, we propose a tile-based computation strategy that partitions the contrastive loss calculation into arbitrary small blocks, avoiding full materialization of the similarity matrix. Furthermore, we introduce a multi-level tiling strategy to leverage the hierarchical structure of distributed systems, employing ring-based communication at the GPU level to optimize synchronization and fused kernels at the CUDA core level to reduce I/O overhead. Experimental results show that the proposed method scales batch sizes to unprecedented levels. For instance, it enables contrastive training of a CLIP-ViT-L/14 model with a batch size of 4M or 12M using 8 or 32 A800 80GB without sacrificing any accuracy. Compared to SOTA memory-efficient solutions, it achieves a two-order-of-magnitude reduction in memory while maintaining comparable speed. The code will be made publicly available.

Via

Access Paper or Ask Questions