Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yawen Zhang

LIGHTHOUSE: Fast and precise distance to shoreline calculations from anywhere on earth

Jun 23, 2025

Patrick Beukema, Henry Herzog, Yawen Zhang, Hunter Pitelka, Favyen Bastani

Abstract:We introduce a new dataset and algorithm for fast and efficient coastal distance calculations from Anywhere on Earth (AoE). Existing global coastal datasets are only available at coarse resolution (e.g. 1-4 km) which limits their utility. Publicly available satellite imagery combined with computer vision enable much higher precision. We provide a global coastline dataset at 10 meter resolution, a 100+ fold improvement in precision over existing data. To handle the computational challenge of querying at such an increased scale, we introduce a new library: Layered Iterative Geospatial Hierarchical Terrain-Oriented Unified Search Engine (Lighthouse). Lighthouse is both exceptionally fast and resource-efficient, requiring only 1 CPU and 2 GB of RAM to achieve millisecond online inference, making it well suited for real-time applications in resource-constrained environments.

* 8 pages, 7 figures, 1 table, ICML 2025 ML4RS

Via

Access Paper or Ask Questions

Atlantes: A system of GPS transformers for global-scale real-time maritime intelligence

Apr 26, 2025

Henry Herzog, Joshua Hansen, Yawen Zhang, Patrick Beukema

Abstract:Unsustainable exploitation of the oceans exacerbated by global warming is threatening coastal communities worldwide. Accurate and timely monitoring of maritime activity is an essential step to effective governance and to inform future policy. In support of this complex global-scale effort, we built Atlantes, a deep learning based system that provides the first-ever real-time view of vessel behavior at global scale. Atlantes leverages a series of bespoke transformers to distill a high volume, continuous stream of GPS messages emitted by hundreds of thousands of vessels into easily quantifiable behaviors. The combination of low latency and high performance enables operationally relevant decision-making and successful interventions on the high seas where illegal and exploitative activity is too common. Atlantes is already in use by hundreds of organizations worldwide. Here we provide an overview of the model and infrastructure that enables this system to function efficiently and cost-effectively at global-scale and in real-time.

* 8 pages, 10 figures, ICLR CCAI 2025, spotlight talk

Via

Access Paper or Ask Questions

IM-RAG: Multi-Round Retrieval-Augmented Generation Through Learning Inner Monologues

May 15, 2024

Diji Yang, Jinmeng Rao, Kezhen Chen, Xiaoyuan Guo, Yawen Zhang, Jie Yang, Yi Zhang

Figure 1 for IM-RAG: Multi-Round Retrieval-Augmented Generation Through Learning Inner Monologues

Figure 2 for IM-RAG: Multi-Round Retrieval-Augmented Generation Through Learning Inner Monologues

Figure 3 for IM-RAG: Multi-Round Retrieval-Augmented Generation Through Learning Inner Monologues

Figure 4 for IM-RAG: Multi-Round Retrieval-Augmented Generation Through Learning Inner Monologues

Abstract:Although the Retrieval-Augmented Generation (RAG) paradigms can use external knowledge to enhance and ground the outputs of Large Language Models (LLMs) to mitigate generative hallucinations and static knowledge base problems, they still suffer from limited flexibility in adopting Information Retrieval (IR) systems with varying capabilities, constrained interpretability during the multi-round retrieval process, and a lack of end-to-end optimization. To address these challenges, we propose a novel LLM-centric approach, IM-RAG, that integrates IR systems with LLMs to support multi-round RAG through learning Inner Monologues (IM, i.e., the human inner voice that narrates one's thoughts). During the IM process, the LLM serves as the core reasoning model (i.e., Reasoner) to either propose queries to collect more information via the Retriever or to provide a final answer based on the conversational context. We also introduce a Refiner that improves the outputs from the Retriever, effectively bridging the gap between the Reasoner and IR modules with varying capabilities and fostering multi-round communications. The entire IM process is optimized via Reinforcement Learning (RL) where a Progress Tracker is incorporated to provide mid-step rewards, and the answer prediction is further separately optimized via Supervised Fine-Tuning (SFT). We conduct extensive experiments with the HotPotQA dataset, a popular benchmark for retrieval-based, multi-step question-answering. The results show that our approach achieves state-of-the-art (SOTA) performance while providing high flexibility in integrating IR modules as well as strong interpretability exhibited in the learned inner monologues.

* Proceedings of the 47th International ACM SIGIR 2024

Via

Access Paper or Ask Questions

Higher Layers Need More LoRA Experts

Feb 13, 2024

Chongyang Gao, Kezhen Chen, Jinmeng Rao, Baochen Sun, Ruibo Liu, Daiyi Peng, Yawen Zhang, Xiaoyuan Guo, Jie Yang, VS Subrahmanian

Figure 1 for Higher Layers Need More LoRA Experts

Figure 2 for Higher Layers Need More LoRA Experts

Figure 3 for Higher Layers Need More LoRA Experts

Figure 4 for Higher Layers Need More LoRA Experts

Abstract:Parameter-efficient tuning (PEFT) techniques like low-rank adaptation (LoRA) offer training efficiency on Large Language Models, but their impact on model performance remains limited. Recent efforts integrate LoRA and Mixture-of-Experts (MoE) to improve the performance of PEFT methods. Despite promising results, research on improving the efficiency of LoRA with MoE is still in its early stages. Recent studies have shown that experts in the MoE architecture have different strengths and also exhibit some redundancy. Does this statement also apply to parameter-efficient MoE? In this paper, we introduce a novel parameter-efficient MoE method, \textit{\textbf{M}oE-L\textbf{o}RA with \textbf{L}ayer-wise Expert \textbf{A}llocation (MoLA)} for Transformer-based models, where each model layer has the flexibility to employ a varying number of LoRA experts. We investigate several architectures with varying layer-wise expert configurations. Experiments on six well-known NLP and commonsense QA benchmarks demonstrate that MoLA achieves equal or superior performance compared to all baselines. We find that allocating more LoRA experts to higher layers further enhances the effectiveness of models with a certain number of experts in total. With much fewer parameters, this allocation strategy outperforms the setting with the same number of experts in every layer. This work can be widely used as a plug-and-play parameter-efficient tuning approach for various applications. The code is available at https://github.com/GCYZSL/MoLA.

* The code is available at https://github.com/GCYZSL/MoLA

Via

Access Paper or Ask Questions

An Exploratory Study of Multimodal Physiological Data in Jazz Improvisation Using Basic Machine Learning Techniques

Jan 22, 2024

Yawen Zhang

Abstract:Our study delves into the "Embodied Musicking Dataset," exploring the intertwined relationships and correlations between physiological and psychological dimensions during improvisational music performances. The primary objective is to ascertain the presence of a definitive causal or correlational relationship between these states and comprehend their manifestation in musical compositions. This rich dataset provides a perspective on how musicians coordinate their physicality with sonic events in real-time improvisational scenarios, emphasizing the concept of "Embodied Musicking."

* Master's thesis

Via

Access Paper or Ask Questions

Evaluation and Mitigation of Agnosia in Multimodal Large Language Models

Sep 07, 2023

Jiaying Lu, Jinmeng Rao, Kezhen Chen, Xiaoyuan Guo, Yawen Zhang, Baochen Sun, Carl Yang, Jie Yang

Figure 1 for Evaluation and Mitigation of Agnosia in Multimodal Large Language Models

Figure 2 for Evaluation and Mitigation of Agnosia in Multimodal Large Language Models

Figure 3 for Evaluation and Mitigation of Agnosia in Multimodal Large Language Models

Figure 4 for Evaluation and Mitigation of Agnosia in Multimodal Large Language Models

Abstract:While Multimodal Large Language Models (MLLMs) are widely used for a variety of vision-language tasks, one observation is that they sometimes misinterpret visual inputs or fail to follow textual instructions even in straightforward cases, leading to irrelevant responses, mistakes, and ungrounded claims. This observation is analogous to a phenomenon in neuropsychology known as Agnosia, an inability to correctly process sensory modalities and recognize things (e.g., objects, colors, relations). In our study, we adapt this similar concept to define "agnosia in MLLMs", and our goal is to comprehensively evaluate and mitigate such agnosia in MLLMs. Inspired by the diagnosis and treatment process in neuropsychology, we propose a novel framework EMMA (Evaluation and Mitigation of Multimodal Agnosia). In EMMA, we develop an evaluation module that automatically creates fine-grained and diverse visual question answering examples to assess the extent of agnosia in MLLMs comprehensively. We also develop a mitigation module to reduce agnosia in MLLMs through multimodal instruction tuning on fine-grained conversations. To verify the effectiveness of our framework, we evaluate and analyze agnosia in seven state-of-the-art MLLMs using 9K test samples. The results reveal that most of them exhibit agnosia across various aspects and degrees. We further develop a fine-grained instruction set and tune MLLMs to mitigate agnosia, which led to notable improvement in accuracy.

Via

Access Paper or Ask Questions

Tackling Vision Language Tasks Through Learning Inner Monologues

Aug 19, 2023

Diji Yang, Kezhen Chen, Jinmeng Rao, Xiaoyuan Guo, Yawen Zhang, Jie Yang, Yi Zhang

Figure 1 for Tackling Vision Language Tasks Through Learning Inner Monologues

Figure 2 for Tackling Vision Language Tasks Through Learning Inner Monologues

Figure 3 for Tackling Vision Language Tasks Through Learning Inner Monologues

Figure 4 for Tackling Vision Language Tasks Through Learning Inner Monologues

Abstract:Visual language tasks require AI models to comprehend and reason with both visual and textual content. Driven by the power of Large Language Models (LLMs), two prominent methods have emerged: (1) the hybrid integration between LLMs and Vision-Language Models (VLMs), where visual inputs are firstly converted into language descriptions by VLMs, serving as inputs for LLMs to generate final answer(s); (2) visual feature alignment in language space, where visual inputs are encoded as embeddings and projected to LLMs' language space via further supervised fine-tuning. The first approach provides light training costs and interpretability but is hard to be optimized in an end-to-end fashion. The second approach presents decent performance, but feature alignment usually requires large amounts of training data and lacks interpretability. To tackle this dilemma, we propose a novel approach, Inner Monologue Multi-Modal Optimization (IMMO), to solve complex vision language problems by simulating inner monologue processes, a cognitive process in which an individual engages in silent verbal communication with themselves. We enable LLMs and VLMs to interact through natural language conversation and propose to use a two-stage training process to learn how to do the inner monologue (self-asking questions and answering questions). IMMO is evaluated on two popular tasks and the results suggest by emulating the cognitive phenomenon of internal dialogue, our approach can enhance reasoning and explanation abilities, contributing to the more effective fusion of vision and language models. More importantly, instead of using predefined human-crafted monologues, IMMO learns this process within the deep learning models, promising wider applicability to many different AI problems beyond vision language tasks.

Via

Access Paper or Ask Questions

LOWA: Localize Objects in the Wild with Attributes

May 31, 2023

Xiaoyuan Guo, Kezhen Chen, Jinmeng Rao, Yawen Zhang, Baochen Sun, Jie Yang

Figure 1 for LOWA: Localize Objects in the Wild with Attributes

Figure 2 for LOWA: Localize Objects in the Wild with Attributes

Figure 3 for LOWA: Localize Objects in the Wild with Attributes

Figure 4 for LOWA: Localize Objects in the Wild with Attributes

Abstract:We present LOWA, a novel method for localizing objects with attributes effectively in the wild. It aims to address the insufficiency of current open-vocabulary object detectors, which are limited by the lack of instance-level attribute classification and rare class names. To train LOWA, we propose a hybrid vision-language training strategy to learn object detection and recognition with class names as well as attribute information. With LOWA, users can not only detect objects with class names, but also able to localize objects by attributes. LOWA is built on top of a two-tower vision-language architecture and consists of a standard vision transformer as the image encoder and a similar transformer as the text encoder. To learn the alignment between visual and text inputs at the instance level, we train LOWA with three training steps: object-level training, attribute-aware learning, and free-text joint training of objects and attributes. This hybrid training strategy first ensures correct object detection, then incorporates instance-level attribute information, and finally balances the object class and attribute sensitivity. We evaluate our model performance of attribute classification and attribute localization on the Open-Vocabulary Attribute Detection (OVAD) benchmark and the Visual Attributes in the Wild (VAW) dataset, and experiments indicate strong zero-shot performance. Ablation studies additionally demonstrate the effectiveness of each training step of our approach.

Via

Access Paper or Ask Questions

Air Pollution Hotspot Detection and Source Feature Analysis using Cross-domain Urban Data

Nov 15, 2022

Yawen Zhang, Michael Hannigan, Qin Lv

Figure 1 for Air Pollution Hotspot Detection and Source Feature Analysis using Cross-domain Urban Data

Figure 2 for Air Pollution Hotspot Detection and Source Feature Analysis using Cross-domain Urban Data

Figure 3 for Air Pollution Hotspot Detection and Source Feature Analysis using Cross-domain Urban Data

Figure 4 for Air Pollution Hotspot Detection and Source Feature Analysis using Cross-domain Urban Data

Abstract:Air pollution is a major global environmental health threat, in particular for people who live or work near pollution sources. Areas adjacent to pollution sources often have high ambient pollution concentrations, and those areas are commonly referred to as air pollution hotspots. Detecting and characterizing pollution hotspots are of great importance for air quality management, but are challenging due to the high spatial and temporal variability of air pollutants. In this work, we explore the use of mobile sensing data (i.e., air quality sensors installed on vehicles) to detect pollution hotspots. One major challenge with mobile sensing data is uneven sampling, i.e., data collection can vary by both space and time. To address this challenge, we propose a two-step approach to detect hotspots from mobile sensing data, which includes local spike detection and sample-weighted clustering. Essentially, this approach tackles the uneven sampling issue by weighting samples based on their spatial frequency and temporal hit rate, so as to identify robust and persistent hotspots. To contextualize the hotspots and discover potential pollution source characteristics, we explore a variety of cross-domain urban data and extract features from them. As a soft-validation of the extracted features, we build hotspot inference models for cities with and without mobile sensing data. Evaluation results using real-world mobile sensing air quality data as well as cross-domain urban data demonstrate the effectiveness of our approach in detecting and inferring pollution hotspots. Furthermore, the empirical analysis of hotspots and source features yields useful insights regarding neighborhood pollution sources.

* ACM SIGSPATIAL 2021
* 10 pages

Via

Access Paper or Ask Questions

DaSGD: Squeezing SGD Parallelization Performance in Distributed Training Using Delayed Averaging

May 31, 2020

Qinggang Zhou, Yawen Zhang, Pengcheng Li, Xiaoyong Liu, Jun Yang, Runsheng Wang, Ru Huang

Figure 1 for DaSGD: Squeezing SGD Parallelization Performance in Distributed Training Using Delayed Averaging

Figure 2 for DaSGD: Squeezing SGD Parallelization Performance in Distributed Training Using Delayed Averaging

Figure 3 for DaSGD: Squeezing SGD Parallelization Performance in Distributed Training Using Delayed Averaging

Figure 4 for DaSGD: Squeezing SGD Parallelization Performance in Distributed Training Using Delayed Averaging

Abstract:The state-of-the-art deep learning algorithms rely on distributed training systems to tackle the increasing sizes of models and training data sets. Minibatch stochastic gradient descent (SGD) algorithm requires workers to halt forward/back propagations, to wait for gradients aggregated from all workers, and to receive weight updates before the next batch of tasks. This synchronous execution model exposes the overheads of gradient/weight communication among a large number of workers in a distributed training system. We propose a new SGD algorithm, DaSGD (Local SGD with Delayed Averaging), which parallelizes SGD and forward/back propagations to hide 100% of the communication overhead. By adjusting the gradient update scheme, this algorithm uses hardware resources more efficiently and reduces the reliance on the low-latency and high-throughput inter-connects. The theoretical analysis and the experimental results show its convergence rate O(1/sqrt(K)), the same as SGD. The performance evaluation demonstrates it enables a linear performance scale-up with the cluster size.

Via

Access Paper or Ask Questions