Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jian Chen

TextLap: Customizing Language Models for Text-to-Layout Planning

Oct 09, 2024

Jian Chen, Ruiyi Zhang, Yufan Zhou, Jennifer Healey, Jiuxiang Gu, Zhiqiang Xu, Changyou Chen

Figure 1 for TextLap: Customizing Language Models for Text-to-Layout Planning

Figure 2 for TextLap: Customizing Language Models for Text-to-Layout Planning

Figure 3 for TextLap: Customizing Language Models for Text-to-Layout Planning

Figure 4 for TextLap: Customizing Language Models for Text-to-Layout Planning

Abstract:Automatic generation of graphical layouts is crucial for many real-world applications, including designing posters, flyers, advertisements, and graphical user interfaces. Given the incredible ability of Large language models (LLMs) in both natural language understanding and generation, we believe that we could customize an LLM to help people create compelling graphical layouts starting with only text instructions from the user. We call our method TextLap (text-based layout planning). It uses a curated instruction-based layout planning dataset (InsLap) to customize LLMs as a graphic designer. We demonstrate the effectiveness of TextLap and show that it outperforms strong baselines, including GPT-4 based methods, for image generation and graphical design benchmarks.

* Accepted to the EMNLP Findings

Via

Access Paper or Ask Questions

Gap Preserving Distillation by Building Bidirectional Mappings with A Dynamic Teacher

Oct 05, 2024

Yong Guo, Shulian Zhang, Haolin Pan, Jing Liu, Yulun Zhang, Jian Chen

Figure 1 for Gap Preserving Distillation by Building Bidirectional Mappings with A Dynamic Teacher

Figure 2 for Gap Preserving Distillation by Building Bidirectional Mappings with A Dynamic Teacher

Figure 3 for Gap Preserving Distillation by Building Bidirectional Mappings with A Dynamic Teacher

Figure 4 for Gap Preserving Distillation by Building Bidirectional Mappings with A Dynamic Teacher

Abstract:Knowledge distillation aims to transfer knowledge from a large teacher model to a compact student counterpart, often coming with a significant performance gap between them. We find that a too-large performance gap can hamper the training process, which is also verified in recent studies. To address this, we propose a Gap Preserving Distillation (GPD) method that trains an additional dynamic teacher model from scratch along with training the student to bridge this gap. In this way, it becomes possible to maintain a reasonable performance gap between teacher and student during the whole distillation process. To further strengthen distillation from the dynamic teacher to the student, we develop a hard strategy by enforcing them to share parameters and encouraging parameter inheritance. Besides hard strategy, we also build the soft bidirectional mappings between them which are built on an Inverse Reparameterization (IR) method and a Channel-Branch Reparameterization (CBR) strategy. We highlight that our IR is able to initialize a larger dynamic teacher with an arbitrary expansion ratio, while preserving exactly the same accuracy as the given student model. In this way, it guarantees that the dynamic teacher and student start from the same point and avoid a too large gap in early stage of training. As for our CBR, with parameter-sharing, it directly extracts an effective student model from the well-learned dynamic teacher without any post-training, making our method highly flexible for model deployment. In the experiments, GPD significantly outperforms existing distillation methods on top of both CNNs and transformers architectures, achieving up to 1.58% accuracy improvement. Interestingly, GPD also generalizes well to the scenarios without a pre-trained teacher, including training from scratch and fine-tuning, yielding a large improvement of 1.80% and 0.89% on ResNet18, respectively.

* 10 pages for the main paper

Via

Access Paper or Ask Questions

MMR: Evaluating Reading Ability of Large Multimodal Models

Aug 26, 2024

Jian Chen, Ruiyi Zhang, Yufan Zhou, Ryan Rossi, Jiuxiang Gu, Changyou Chen

Figure 1 for MMR: Evaluating Reading Ability of Large Multimodal Models

Figure 2 for MMR: Evaluating Reading Ability of Large Multimodal Models

Figure 3 for MMR: Evaluating Reading Ability of Large Multimodal Models

Figure 4 for MMR: Evaluating Reading Ability of Large Multimodal Models

Abstract:Large multimodal models (LMMs) have demonstrated impressive capabilities in understanding various types of image, including text-rich images. Most existing text-rich image benchmarks are simple extraction-based question answering, and many LMMs now easily achieve high scores. This means that current benchmarks fail to accurately reflect performance of different models, and a natural idea is to build a new benchmark to evaluate their complex reasoning and spatial understanding abilities. In this work, we propose the Multi-Modal Reading (MMR) benchmark in 11 diverse tasks to evaluate LMMs for text-rich image understanding. MMR is the first text-rich image benchmark built on human annotations with the help of language models. By evaluating several state-of-the-art LMMs, including GPT-4o, it reveals the limited capabilities of existing LMMs underscoring the value of our benchmark.

Via

Access Paper or Ask Questions

MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding

Aug 21, 2024

Jian Chen, Vashisth Tiwari, Ranajoy Sadhukhan, Zhuoming Chen, Jinyuan Shi, Ian En-Hsu Yen, Beidi Chen

Figure 1 for MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding

Figure 2 for MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding

Figure 3 for MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding

Figure 4 for MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding

Abstract:Large Language Models (LLMs) have become more prevalent in long-context applications such as interactive chatbots, document analysis, and agent workflows, but it is challenging to serve long-context requests with low latency and high throughput. Speculative decoding (SD) is a widely used technique to reduce latency without sacrificing performance but the conventional wisdom suggests that its efficacy is limited to small batch sizes. In MagicDec, we show that surprisingly SD can achieve speedup even for a high throughput inference regime for moderate to long sequences. More interestingly, an intelligent drafting strategy can achieve better speedup with increasing batch size based on our rigorous analysis. MagicDec first identifies the bottleneck shifts with increasing batch size and sequence length, and uses these insights to deploy speculative decoding more effectively for high throughput inference. Then, it leverages draft models with sparse KV cache to address the KV bottleneck that scales with both sequence length and batch size. This finding underscores the broad applicability of speculative decoding in long-context serving, as it can enhance throughput and reduce latency without compromising accuracy. For moderate to long sequences, we demonstrate up to 2x speedup for LLaMA-2-7B-32K and 1.84x speedup for LLaMA-3.1-8B when serving batch sizes ranging from 32 to 256 on 8 NVIDIA A100 GPUs. The code is available at https://github.com/Infini-AI-Lab/MagicDec/.

Via

Access Paper or Ask Questions

LLaVA-Read: Enhancing Reading Ability of Multimodal Language Models

Jul 27, 2024

Ruiyi Zhang, Yufan Zhou, Jian Chen, Jiuxiang Gu, Changyou Chen, Tong Sun

Figure 1 for LLaVA-Read: Enhancing Reading Ability of Multimodal Language Models

Figure 2 for LLaVA-Read: Enhancing Reading Ability of Multimodal Language Models

Figure 3 for LLaVA-Read: Enhancing Reading Ability of Multimodal Language Models

Figure 4 for LLaVA-Read: Enhancing Reading Ability of Multimodal Language Models

Abstract:Large multimodal language models have demonstrated impressive capabilities in understanding and manipulating images. However, many of these models struggle with comprehending intensive textual contents embedded within the images, primarily due to the limited text recognition and layout understanding ability. To understand the sources of these limitations, we perform an exploratory analysis showing the drawbacks of classical visual encoders on visual text understanding. Hence, we present LLaVA-Read, a multimodal large language model that utilizes dual visual encoders along with a visual text encoder. Our model surpasses existing state-of-the-art models in various text-rich image understanding tasks, showcasing enhanced comprehension of textual content within images. Together, our research suggests visual text understanding remains an open challenge and an efficient visual text encoder is crucial for future successful multimodal systems.

* NeurIPS 2024 Under Review

Via

Access Paper or Ask Questions

BECAUSE: Bilinear Causal Representation for Generalizable Offline Model-based Reinforcement Learning

Jul 15, 2024

Haohong Lin, Wenhao Ding, Jian Chen, Laixi Shi, Jiacheng Zhu, Bo Li, Ding Zhao

Figure 1 for BECAUSE: Bilinear Causal Representation for Generalizable Offline Model-based Reinforcement Learning

Figure 2 for BECAUSE: Bilinear Causal Representation for Generalizable Offline Model-based Reinforcement Learning

Figure 3 for BECAUSE: Bilinear Causal Representation for Generalizable Offline Model-based Reinforcement Learning

Figure 4 for BECAUSE: Bilinear Causal Representation for Generalizable Offline Model-based Reinforcement Learning

Abstract:Offline model-based reinforcement learning (MBRL) enhances data efficiency by utilizing pre-collected datasets to learn models and policies, especially in scenarios where exploration is costly or infeasible. Nevertheless, its performance often suffers from the objective mismatch between model and policy learning, resulting in inferior performance despite accurate model predictions. This paper first identifies the primary source of this mismatch comes from the underlying confounders present in offline data for MBRL. Subsequently, we introduce \textbf{B}ilin\textbf{E}ar \textbf{CAUS}al r\textbf{E}presentation~(BECAUSE), an algorithm to capture causal representation for both states and actions to reduce the influence of the distribution shift, thus mitigating the objective mismatch problem. Comprehensive evaluations on 18 tasks that vary in data quality and environment context demonstrate the superior performance of BECAUSE over existing offline RL algorithms. We show the generalizability and robustness of BECAUSE under fewer samples or larger numbers of confounders. Additionally, we offer theoretical analysis of BECAUSE to prove its error bound and sample efficiency when integrating causal representation into offline MBRL.

Via

Access Paper or Ask Questions

Enhanced Long-Tailed Recognition with Contrastive CutMix Augmentation

Jul 06, 2024

Haolin Pan, Yong Guo, Mianjie Yu, Jian Chen

Abstract:Real-world data often follows a long-tailed distribution, where a few head classes occupy most of the data and a large number of tail classes only contain very limited samples. In practice, deep models often show poor generalization performance on tail classes due to the imbalanced distribution. To tackle this, data augmentation has become an effective way by synthesizing new samples for tail classes. Among them, one popular way is to use CutMix that explicitly mixups the images of tail classes and the others, while constructing the labels according to the ratio of areas cropped from two images. However, the area-based labels entirely ignore the inherent semantic information of the augmented samples, often leading to misleading training signals. To address this issue, we propose a Contrastive CutMix (ConCutMix) that constructs augmented samples with semantically consistent labels to boost the performance of long-tailed recognition. Specifically, we compute the similarities between samples in the semantic space learned by contrastive learning, and use them to rectify the area-based labels. Experiments show that our ConCutMix significantly improves the accuracy on tail classes as well as the overall performance. For example, based on ResNeXt-50, we improve the overall accuracy on ImageNet-LT by 3.0% thanks to the significant improvement of 3.3% on tail classes. We highlight that the improvement also generalizes well to other benchmarks and models. Our code and pretrained models are available at https://github.com/PanHaulin/ConCutMix.

* 16 pages and 13 figures

Via

Access Paper or Ask Questions

Graph Neural Networks for Job Shop Scheduling Problems: A Survey

Jun 20, 2024

Igor G. Smit, Jianan Zhou, Robbert Reijnen, Yaoxin Wu, Jian Chen, Cong Zhang, Zaharah Bukhsh, Wim Nuijten, Yingqian Zhang

Figure 1 for Graph Neural Networks for Job Shop Scheduling Problems: A Survey

Figure 2 for Graph Neural Networks for Job Shop Scheduling Problems: A Survey

Figure 3 for Graph Neural Networks for Job Shop Scheduling Problems: A Survey

Figure 4 for Graph Neural Networks for Job Shop Scheduling Problems: A Survey

Abstract:Job shop scheduling problems (JSSPs) represent a critical and challenging class of combinatorial optimization problems. Recent years have witnessed a rapid increase in the application of graph neural networks (GNNs) to solve JSSPs, albeit lacking a systematic survey of the relevant literature. This paper aims to thoroughly review prevailing GNN methods for different types of JSSPs and the closely related flow-shop scheduling problems (FSPs), especially those leveraging deep reinforcement learning (DRL). We begin by presenting the graph representations of various JSSPs, followed by an introduction to the most commonly used GNN architectures. We then review current GNN-based methods for each problem type, highlighting key technical elements such as graph representations, GNN architectures, GNN tasks, and training algorithms. Finally, we summarize and analyze the advantages and limitations of GNNs in solving JSSPs and provide potential future research opportunities. We hope this survey can motivate and inspire innovative approaches for more powerful GNN-based approaches in tackling JSSPs and other scheduling problems.

Via

Access Paper or Ask Questions

Vision-Language Models Meet Meteorology: Developing Models for Extreme Weather Events Detection with Heatmaps

Jun 14, 2024

Jian Chen, Peilin Zhou, Yining Hua, Dading Chong, Meng Cao, Yaowei Li, Zixuan Yuan, Bing Zhu, Junwei Liang

Figure 1 for Vision-Language Models Meet Meteorology: Developing Models for Extreme Weather Events Detection with Heatmaps

Figure 2 for Vision-Language Models Meet Meteorology: Developing Models for Extreme Weather Events Detection with Heatmaps

Figure 3 for Vision-Language Models Meet Meteorology: Developing Models for Extreme Weather Events Detection with Heatmaps

Figure 4 for Vision-Language Models Meet Meteorology: Developing Models for Extreme Weather Events Detection with Heatmaps

Abstract:Real-time detection and prediction of extreme weather protect human lives and infrastructure. Traditional methods rely on numerical threshold setting and manual interpretation of weather heatmaps with Geographic Information Systems (GIS), which can be slow and error-prone. Our research redefines Extreme Weather Events Detection (EWED) by framing it as a Visual Question Answering (VQA) problem, thereby introducing a more precise and automated solution. Leveraging Vision-Language Models (VLM) to simultaneously process visual and textual data, we offer an effective aid to enhance the analysis process of weather heatmaps. Our initial assessment of general-purpose VLMs (e.g., GPT-4-Vision) on EWED revealed poor performance, characterized by low accuracy and frequent hallucinations due to inadequate color differentiation and insufficient meteorological knowledge. To address these challenges, we introduce ClimateIQA, the first meteorological VQA dataset, which includes 8,760 wind gust heatmaps and 254,040 question-answer pairs covering four question types, both generated from the latest climate reanalysis data. We also propose Sparse Position and Outline Tracking (SPOT), an innovative technique that leverages OpenCV and K-Means clustering to capture and depict color contours in heatmaps, providing ClimateIQA with more accurate color spatial location information. Finally, we present Climate-Zoo, the first meteorological VLM collection, which adapts VLMs to meteorological applications using the ClimateIQA dataset. Experiment results demonstrate that models from Climate-Zoo substantially outperform state-of-the-art general VLMs, achieving an accuracy increase from 0% to over 90% in EWED verification. The datasets and models in this study are publicly available for future climate science research: https://github.com/AlexJJJChen/Climate-Zoo.

Via

Access Paper or Ask Questions

TRINS: Towards Multimodal Language Models that Can Read

Jun 10, 2024

Ruiyi Zhang, Yanzhe Zhang, Jian Chen, Yufan Zhou, Jiuxiang Gu, Changyou Chen, Tong Sun

Figure 1 for TRINS: Towards Multimodal Language Models that Can Read

Figure 2 for TRINS: Towards Multimodal Language Models that Can Read

Figure 3 for TRINS: Towards Multimodal Language Models that Can Read

Figure 4 for TRINS: Towards Multimodal Language Models that Can Read

Abstract:Large multimodal language models have shown remarkable proficiency in understanding and editing images. However, a majority of these visually-tuned models struggle to comprehend the textual content embedded in images, primarily due to the limitation of training data. In this work, we introduce TRINS: a Text-Rich image INStruction dataset, with the objective of enhancing the reading ability of the multimodal large language model. TRINS is built upon LAION using hybrid data annotation strategies that include machine-assisted and human-assisted annotation processes. It contains 39,153 text-rich images, captions, and 102,437 questions. Specifically, we show that the number of words per annotation in TRINS is significantly longer than that of related datasets, providing new challenges. Furthermore, we introduce a simple and effective architecture, called a Language-vision Reading Assistant (LaRA), which is good at understanding textual content within images. LaRA outperforms existing state-of-the-art multimodal large language models on the TRINS dataset, as well as other classical benchmarks. Lastly, we conducted a comprehensive evaluation with TRINS on various text-rich image understanding and generation tasks, demonstrating its effectiveness.

* CVPR 2024

Via

Access Paper or Ask Questions