Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ning Zhang

Sid

Task Offloading in Vehicular Edge Computing using Deep Reinforcement Learning: A Survey

Feb 10, 2025

Ashab Uddin, Ahmed Hamdi Sakr, Ning Zhang

Abstract:The increasing demand for Intelligent Transportation Systems (ITS) has introduced significant challenges in managing the complex, computation-intensive tasks generated by modern vehicles while offloading tasks to external computing infrastructures such as edge computing (EC), nearby vehicular , and UAVs has become influential solution to these challenges. However, traditional computational offloading strategies often struggle to adapt to the dynamic and heterogeneous nature of vehicular environments. In this study, we explored the potential of Reinforcement Learning (RL) and Deep Reinforcement Learning (DRL) frameworks to optimize computational offloading through adaptive, real-time decision-making, and we have thoroughly investigated the Markov Decision Process (MDP) approaches on the existing literature. The paper focuses on key aspects such as standardized learning models, optimized reward structures, and collaborative multi-agent systems, aiming to advance the understanding and application of DRL in vehicular networks. Our findings offer insights into enhancing the efficiency, scalability, and robustness of ITS, setting the stage for future innovations in this rapidly evolving field.

* 27 Pages, 3 Figures, 3 Tables

Via

Access Paper or Ask Questions

Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMs

Jan 08, 2025

Zeyi Huang, Yuyang Ji, Xiaofang Wang, Nikhil Mehta, Tong Xiao, Donghyun Lee, Sigmund Vanvalkenburgh, Shengxin Zha, Bolin Lai, Licheng Yu(+3 more)

Figure 1 for Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMs

Figure 2 for Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMs

Figure 3 for Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMs

Figure 4 for Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMs

Abstract:Long-form video understanding with Large Vision Language Models is challenged by the need to analyze temporally dispersed yet spatially concentrated key moments within limited context windows. In this work, we introduce VideoMindPalace, a new framework inspired by the "Mind Palace", which organizes critical video moments into a topologically structured semantic graph. VideoMindPalace organizes key information through (i) hand-object tracking and interaction, (ii) clustered activity zones representing specific areas of recurring activities, and (iii) environment layout mapping, allowing natural language parsing by LLMs to provide grounded insights on spatio-temporal and 3D context. In addition, we propose the Video MindPalace Benchmark (VMB), to assess human-like reasoning, including spatial localization, temporal reasoning, and layout-aware sequential understanding. Evaluated on VMB and established video QA datasets, including EgoSchema, NExT-QA, IntentQA, and the Active Memories Benchmark, VideoMindPalace demonstrates notable gains in spatio-temporal coherence and human-aligned reasoning, advancing long-form video analysis capabilities in VLMs.

Via

Access Paper or Ask Questions

Neuromorphic Optical Tracking and Imaging of Randomly Moving Targets through Strongly Scattering Media

Jan 07, 2025

Ning Zhang, Timothy Shea, Arto Nurmikko

Figure 1 for Neuromorphic Optical Tracking and Imaging of Randomly Moving Targets through Strongly Scattering Media

Figure 2 for Neuromorphic Optical Tracking and Imaging of Randomly Moving Targets through Strongly Scattering Media

Figure 3 for Neuromorphic Optical Tracking and Imaging of Randomly Moving Targets through Strongly Scattering Media

Figure 4 for Neuromorphic Optical Tracking and Imaging of Randomly Moving Targets through Strongly Scattering Media

Abstract:Tracking and acquiring simultaneous optical images of randomly moving targets obscured by scattering media remains a challenging problem of importance to many applications that require precise object localization and identification. In this work we develop an end-to-end neuromorphic optical engineering and computational approach to demonstrate how to track and image normally invisible objects by combining an event detecting camera with a multistage neuromorphic deep learning strategy. Photons emerging from dense scattering media are detected by the event camera and converted to pixel-wise asynchronized spike trains - a first step in isolating object-specific information from the dominant uninformative background. Spiking data is fed into a deep spiking neural network (SNN) engine where object tracking and image reconstruction are performed by two separate yet interconnected modules running in parallel in discrete time steps over the event duration. Through benchtop experiments we demonstrate tracking and imaging randomly moving objects in dense turbid media as well as image reconstruction of spatially stationary but optically dynamic objects. Standardized character sets serve as representative proxies for geometrically complex objects, underscoring the method's generality. The results highlight the advantages of a fully neuromorphic approach in meeting a major imaging technology with high computational efficiency and low power consumption.

* 22 pages, 6 figures

Via

Access Paper or Ask Questions

Apollo: An Exploration of Video Understanding in Large Multimodal Models

Dec 13, 2024

Orr Zohar, Xiaohan Wang, Yann Dubois, Nikhil Mehta, Tong Xiao, Philippe Hansen-Estruch, Licheng Yu, Xiaofang Wang, Felix Juefei-Xu, Ning Zhang(+2 more)

Figure 1 for Apollo: An Exploration of Video Understanding in Large Multimodal Models

Figure 2 for Apollo: An Exploration of Video Understanding in Large Multimodal Models

Figure 3 for Apollo: An Exploration of Video Understanding in Large Multimodal Models

Figure 4 for Apollo: An Exploration of Video Understanding in Large Multimodal Models

Abstract:Despite the rapid integration of video perception capabilities into Large Multimodal Models (LMMs), the underlying mechanisms driving their video understanding remain poorly understood. Consequently, many design decisions in this domain are made without proper justification or analysis. The high computational cost of training and evaluating such models, coupled with limited open research, hinders the development of video-LMMs. To address this, we present a comprehensive study that helps uncover what effectively drives video understanding in LMMs. We begin by critically examining the primary contributors to the high computational requirements associated with video-LMM research and discover Scaling Consistency, wherein design and training decisions made on smaller models and datasets (up to a critical size) effectively transfer to larger models. Leveraging these insights, we explored many video-specific aspects of video-LMMs, including video sampling, architectures, data composition, training schedules, and more. For example, we demonstrated that fps sampling during training is vastly preferable to uniform frame sampling and which vision encoders are the best for video representation. Guided by these findings, we introduce Apollo, a state-of-the-art family of LMMs that achieve superior performance across different model sizes. Our models can perceive hour-long videos efficiently, with Apollo-3B outperforming most existing $7$B models with an impressive 55.1 on LongVideoBench. Apollo-7B is state-of-the-art compared to 7B LMMs with a 70.9 on MLVU, and 63.3 on Video-MME.

* https://apollo-lmms.github.io

Via

Access Paper or Ask Questions

Unleashing In-context Learning of Autoregressive Models for Few-shot Image Manipulation

Dec 03, 2024

Bolin Lai, Felix Juefei-Xu, Miao Liu, Xiaoliang Dai, Nikhil Mehta, Chenguang Zhu, Zeyi Huang, James M. Rehg, Sangmin Lee, Ning Zhang(+1 more)

Figure 1 for Unleashing In-context Learning of Autoregressive Models for Few-shot Image Manipulation

Figure 2 for Unleashing In-context Learning of Autoregressive Models for Few-shot Image Manipulation

Figure 3 for Unleashing In-context Learning of Autoregressive Models for Few-shot Image Manipulation

Figure 4 for Unleashing In-context Learning of Autoregressive Models for Few-shot Image Manipulation

Abstract:Text-guided image manipulation has experienced notable advancement in recent years. In order to mitigate linguistic ambiguity, few-shot learning with visual examples has been applied for instructions that are underrepresented in the training set, or difficult to describe purely in language. However, learning from visual prompts requires strong reasoning capability, which diffusion models are struggling with. To address this issue, we introduce a novel multi-modal autoregressive model, dubbed $\textbf{InstaManip}$, that can $\textbf{insta}$ntly learn a new image $\textbf{manip}$ulation operation from textual and visual guidance via in-context learning, and apply it to new query images. Specifically, we propose an innovative group self-attention mechanism to break down the in-context learning process into two separate stages -- learning and applying, which simplifies the complex problem into two easier tasks. We also introduce a relation regularization method to further disentangle image transformation features from irrelevant contents in exemplar images. Extensive experiments suggest that our method surpasses previous few-shot image manipulation models by a notable margin ($\geq$19% in human evaluation). We also find our model can be further boosted by increasing the number or diversity of exemplar images.

* 18 pages, 16 figures, 5 tables

Via

Access Paper or Ask Questions

Accelerating Multimodel Large Language Models by Searching Optimal Vision Token Reduction

Nov 30, 2024

Shiyu Zhao, Zhenting Wang, Felix Juefei-Xu, Xide Xia, Miao Liu, Xiaofang Wang, Mingfu Liang, Ning Zhang, Dimitris N. Metaxas, Licheng Yu

Figure 1 for Accelerating Multimodel Large Language Models by Searching Optimal Vision Token Reduction

Figure 2 for Accelerating Multimodel Large Language Models by Searching Optimal Vision Token Reduction

Figure 3 for Accelerating Multimodel Large Language Models by Searching Optimal Vision Token Reduction

Figure 4 for Accelerating Multimodel Large Language Models by Searching Optimal Vision Token Reduction

Abstract:Prevailing Multimodal Large Language Models (MLLMs) encode the input image(s) as vision tokens and feed them into the language backbone, similar to how Large Language Models (LLMs) process the text tokens. However, the number of vision tokens increases quadratically as the image resolutions, leading to huge computational costs. In this paper, we consider improving MLLM's efficiency from two scenarios, (I) Reducing computational cost without degrading the performance. (II) Improving the performance with given budgets. We start with our main finding that the ranking of each vision token sorted by attention scores is similar in each layer except the first layer. Based on it, we assume that the number of essential top vision tokens does not increase along layers. Accordingly, for Scenario I, we propose a greedy search algorithm (G-Search) to find the least number of vision tokens to keep at each layer from the shallow to the deep. Interestingly, G-Search is able to reach the optimal reduction strategy based on our assumption. For Scenario II, based on the reduction strategy from G-Search, we design a parametric sigmoid function (P-Sigmoid) to guide the reduction at each layer of the MLLM, whose parameters are optimized by Bayesian Optimization. Extensive experiments demonstrate that our approach can significantly accelerate those popular MLLMs, e.g. LLaVA, and InternVL2 models, by more than $2 \times$ without performance drops. Our approach also far outperforms other token reduction methods when budgets are limited, achieving a better trade-off between efficiency and effectiveness.

* Technical report, 18 pages

Via

Access Paper or Ask Questions

Sequential LLM Framework for Fashion Recommendation

Oct 15, 2024

Han Liu, Xianfeng Tang, Tianlang Chen, Jiapeng Liu, Indu Indu, Henry Peng Zou, Peng Dai, Roberto Fernandez Galan, Michael D Porter, Dongmei Jia(+2 more)

Figure 1 for Sequential LLM Framework for Fashion Recommendation

Figure 2 for Sequential LLM Framework for Fashion Recommendation

Figure 3 for Sequential LLM Framework for Fashion Recommendation

Figure 4 for Sequential LLM Framework for Fashion Recommendation

Abstract:The fashion industry is one of the leading domains in the global e-commerce sector, prompting major online retailers to employ recommendation systems for product suggestions and customer convenience. While recommendation systems have been widely studied, most are designed for general e-commerce problems and struggle with the unique challenges of the fashion domain. To address these issues, we propose a sequential fashion recommendation framework that leverages a pre-trained large language model (LLM) enhanced with recommendation-specific prompts. Our framework employs parameter-efficient fine-tuning with extensive fashion data and introduces a novel mix-up-based retrieval technique for translating text into relevant product suggestions. Extensive experiments show our proposed framework significantly enhances fashion recommendation performance.

Via

Access Paper or Ask Questions

Toward Scalable Image Feature Compression: A Content-Adaptive and Diffusion-Based Approach

Oct 08, 2024

Sha Guo, Zhuo Chen, Yang Zhao, Ning Zhang, Xiaotong Li, Lingyu Duan

Figure 1 for Toward Scalable Image Feature Compression: A Content-Adaptive and Diffusion-Based Approach

Figure 2 for Toward Scalable Image Feature Compression: A Content-Adaptive and Diffusion-Based Approach

Figure 3 for Toward Scalable Image Feature Compression: A Content-Adaptive and Diffusion-Based Approach

Figure 4 for Toward Scalable Image Feature Compression: A Content-Adaptive and Diffusion-Based Approach

Abstract:Traditional image codecs emphasize signal fidelity and human perception, often at the expense of machine vision tasks. Deep learning methods have demonstrated promising coding performance by utilizing rich semantic embeddings optimized for both human and machine vision. However, these compact embeddings struggle to capture fine details such as contours and textures, resulting in imperfect reconstructions. Furthermore, existing learning-based codecs lack scalability. To address these limitations, this paper introduces a content-adaptive diffusion model for scalable image compression. The proposed method encodes fine textures through a diffusion process, enhancing perceptual quality while preserving essential features for machine vision tasks. The approach employs a Markov palette diffusion model combined with widely used feature extractors and image generators, enabling efficient data compression. By leveraging collaborative texture-semantic feature extraction and pseudo-label generation, the method accurately captures texture information. A content-adaptive Markov palette diffusion model is then applied to represent both low-level textures and high-level semantic content in a scalable manner. This framework offers flexible control over compression ratios by selecting intermediate diffusion states, eliminating the need for retraining deep learning models at different operating points. Extensive experiments demonstrate the effectiveness of the proposed framework in both image reconstruction and downstream machine vision tasks such as object detection, segmentation, and facial landmark detection, achieving superior perceptual quality compared to state-of-the-art methods.

* in Proceedings of the 31st ACM International Conference on Multimedia, pp. 1431-1442, 2023

Via

Access Paper or Ask Questions

ZALM3: Zero-Shot Enhancement of Vision-Language Alignment via In-Context Information in Multi-Turn Multimodal Medical Dialogue

Sep 26, 2024

Zhangpu Li, Changhong Zou, Suxue Ma, Zhicheng Yang, Chen Du, Youbao Tang, Zhenjie Cao, Ning Zhang, Jui-Hsin Lai, Ruei-Sung Lin(+5 more)

Figure 1 for ZALM3: Zero-Shot Enhancement of Vision-Language Alignment via In-Context Information in Multi-Turn Multimodal Medical Dialogue

Figure 2 for ZALM3: Zero-Shot Enhancement of Vision-Language Alignment via In-Context Information in Multi-Turn Multimodal Medical Dialogue

Figure 3 for ZALM3: Zero-Shot Enhancement of Vision-Language Alignment via In-Context Information in Multi-Turn Multimodal Medical Dialogue

Figure 4 for ZALM3: Zero-Shot Enhancement of Vision-Language Alignment via In-Context Information in Multi-Turn Multimodal Medical Dialogue

Abstract:The rocketing prosperity of large language models (LLMs) in recent years has boosted the prevalence of vision-language models (VLMs) in the medical sector. In our online medical consultation scenario, a doctor responds to the texts and images provided by a patient in multiple rounds to diagnose her/his health condition, forming a multi-turn multimodal medical dialogue format. Unlike high-quality images captured by professional equipment in traditional medical visual question answering (Med-VQA), the images in our case are taken by patients' mobile phones. These images have poor quality control, with issues such as excessive background elements and the lesion area being significantly off-center, leading to degradation of vision-language alignment in the model training phase. In this paper, we propose ZALM3, a Zero-shot strategy to improve vision-language ALignment in Multi-turn Multimodal Medical dialogue. Since we observe that the preceding text conversations before an image can infer the regions of interest (RoIs) in the image, ZALM3 employs an LLM to summarize the keywords from the preceding context and a visual grounding model to extract the RoIs. The updated images eliminate unnecessary background noise and provide more effective vision-language alignment. To better evaluate our proposed method, we design a new subjective assessment metric for multi-turn unimodal/multimodal medical dialogue to provide a fine-grained performance comparison. Our experiments across three different clinical departments remarkably demonstrate the efficacy of ZALM3 with statistical significance.

Via

Access Paper or Ask Questions

Model-in-the-Loop (MILO): Accelerating Multimodal AI Data Annotation with LLMs

Sep 16, 2024

Yifan Wang, David Stevens, Pranay Shah, Wenwen Jiang, Miao Liu, Xu Chen, Robert Kuo, Na Li, Boying Gong, Daniel Lee(+3 more)

Figure 1 for Model-in-the-Loop (MILO): Accelerating Multimodal AI Data Annotation with LLMs

Figure 2 for Model-in-the-Loop (MILO): Accelerating Multimodal AI Data Annotation with LLMs

Figure 3 for Model-in-the-Loop (MILO): Accelerating Multimodal AI Data Annotation with LLMs

Figure 4 for Model-in-the-Loop (MILO): Accelerating Multimodal AI Data Annotation with LLMs

Abstract:The growing demand for AI training data has transformed data annotation into a global industry, but traditional approaches relying on human annotators are often time-consuming, labor-intensive, and prone to inconsistent quality. We propose the Model-in-the-Loop (MILO) framework, which integrates AI/ML models into the annotation process. Our research introduces a collaborative paradigm that leverages the strengths of both professional human annotators and large language models (LLMs). By employing LLMs as pre-annotation and real-time assistants, and judges on annotator responses, MILO enables effective interaction patterns between human annotators and LLMs. Three empirical studies on multimodal data annotation demonstrate MILO's efficacy in reducing handling time, improving data quality, and enhancing annotator experiences. We also introduce quality rubrics for flexible evaluation and fine-grained feedback on open-ended annotations. The MILO framework has implications for accelerating AI/ML development, reducing reliance on human annotation alone, and promoting better alignment between human and machine values.

Via

Access Paper or Ask Questions