Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yong Zhao

Fred

10K is Enough: An Ultra-Lightweight Binarized Network for Infrared Small-Target Detection

Mar 04, 2025

Biqiao Xin, Qianchen Mao, Bingshu Wang, Jiangbin Zheng, Yong Zhao, C. L. Philip Chen

Figure 1 for 10K is Enough: An Ultra-Lightweight Binarized Network for Infrared Small-Target Detection

Figure 2 for 10K is Enough: An Ultra-Lightweight Binarized Network for Infrared Small-Target Detection

Figure 3 for 10K is Enough: An Ultra-Lightweight Binarized Network for Infrared Small-Target Detection

Figure 4 for 10K is Enough: An Ultra-Lightweight Binarized Network for Infrared Small-Target Detection

Abstract:The widespread deployment of InfRared Small-Target Detection(IRSTD) algorithms on edge devices necessitates the exploration of model compression techniques. Binary neural networks (BNNs) are distinguished by their exceptional efficiency in model compression. However, the small size of infrared targets introduces stringent precision requirements for the IRSTD task, while the inherent precision loss during binarization presents a significant challenge. To address this, we propose the Binarized Infrared Small-Target Detection Network (BiisNet), which preserves the core operations of binarized convolutions while integrating full-precision features into the network's information flow. Specifically, we propose the Dot-Binary Convolution, which retains fine-grained semantic information in feature maps while still leveraging the binarized convolution operations. In addition, we introduce a smooth and adaptive Dynamic Softsign function, which provides more comprehensive and progressively finer gradient during back-propagation, enhancing model stability and promoting an optimal weight distribution.Experimental results demonstrate that BiisNet not only significantly outperforms other binary architectures but also demonstrates strong competitiveness among state-of-the-art full-precision models.

Via

Access Paper or Ask Questions

CityEQA: A Hierarchical LLM Agent on Embodied Question Answering Benchmark in City Space

Feb 20, 2025

Yong Zhao, Kai Xu, Zhengqiu Zhu, Yue Hu, Zhiheng Zheng, Yingfeng Chen, Yatai Ji, Chen Gao, Yong Li, Jincai Huang

Abstract:Embodied Question Answering (EQA) has primarily focused on indoor environments, leaving the complexities of urban settings - spanning environment, action, and perception - largely unexplored. To bridge this gap, we introduce CityEQA, a new task where an embodied agent answers open-vocabulary questions through active exploration in dynamic city spaces. To support this task, we present CityEQA-EC, the first benchmark dataset featuring 1,412 human-annotated tasks across six categories, grounded in a realistic 3D urban simulator. Moreover, we propose Planner-Manager-Actor (PMA), a novel agent tailored for CityEQA. PMA enables long-horizon planning and hierarchical task execution: the Planner breaks down the question answering into sub-tasks, the Manager maintains an object-centric cognitive map for spatial reasoning during the process control, and the specialized Actors handle navigation, exploration, and collection sub-tasks. Experiments demonstrate that PMA achieves 60.7% of human-level answering accuracy, significantly outperforming frontier-based baselines. While promising, the performance gap compared to humans highlights the need for enhanced visual reasoning in CityEQA. This work paves the way for future advancements in urban spatial intelligence. Dataset and code are available at https://github.com/BiluYong/CityEQA.git.

Via

Access Paper or Ask Questions

Streaming Speaker Change Detection and Gender Classification for Transducer-Based Multi-Talker Speech Translation

Feb 04, 2025

Peidong Wang, Naoyuki Kanda, Jian Xue, Jinyu Li, Xiaofei Wang, Aswin Shanmugam Subramanian, Junkun Chen, Sunit Sivasankaran, Xiong Xiao, Yong Zhao

Figure 1 for Streaming Speaker Change Detection and Gender Classification for Transducer-Based Multi-Talker Speech Translation

Figure 2 for Streaming Speaker Change Detection and Gender Classification for Transducer-Based Multi-Talker Speech Translation

Figure 3 for Streaming Speaker Change Detection and Gender Classification for Transducer-Based Multi-Talker Speech Translation

Figure 4 for Streaming Speaker Change Detection and Gender Classification for Transducer-Based Multi-Talker Speech Translation

Abstract:Streaming multi-talker speech translation is a task that involves not only generating accurate and fluent translations with low latency but also recognizing when a speaker change occurs and what the speaker's gender is. Speaker change information can be used to create audio prompts for a zero-shot text-to-speech system, and gender can help to select speaker profiles in a conventional text-to-speech model. We propose to tackle streaming speaker change detection and gender classification by incorporating speaker embeddings into a transducer-based streaming end-to-end speech translation model. Our experiments demonstrate that the proposed methods can achieve high accuracy for both speaker change detection and gender classification.

Via

Access Paper or Ask Questions

RE-POSE: Synergizing Reinforcement Learning-Based Partitioning and Offloading for Edge Object Detection

Jan 16, 2025

Jianrui Shi, Yong Zhao, Zeyang Cui, Xiaoming Shen, Minhang Zeng, Xiaojie Liu

Figure 1 for RE-POSE: Synergizing Reinforcement Learning-Based Partitioning and Offloading for Edge Object Detection

Figure 2 for RE-POSE: Synergizing Reinforcement Learning-Based Partitioning and Offloading for Edge Object Detection

Figure 3 for RE-POSE: Synergizing Reinforcement Learning-Based Partitioning and Offloading for Edge Object Detection

Figure 4 for RE-POSE: Synergizing Reinforcement Learning-Based Partitioning and Offloading for Edge Object Detection

Abstract:Object detection plays a crucial role in smart video analysis, with applications ranging from autonomous driving and security to smart cities. However, achieving real-time object detection on edge devices presents significant challenges due to their limited computational resources and the high demands of deep neural network (DNN)-based detection models, particularly when processing high-resolution video. Conventional strategies, such as input down-sampling and network up-scaling, often compromise detection accuracy for faster performance or lead to higher inference latency. To address these issues, this paper introduces RE-POSE, a Reinforcement Learning (RL)-Driven Partitioning and Edge Offloading framework designed to optimize the accuracy-latency trade-off in resource-constrained edge environments. Our approach features an RL-Based Dynamic Clustering Algorithm (RL-DCA) that partitions video frames into non-uniform blocks based on object distribution and the computational characteristics of DNNs. Furthermore, a parallel edge offloading scheme is implemented to distribute these blocks across multiple edge servers for concurrent processing. Experimental evaluations show that RE-POSE significantly enhances detection accuracy and reduces inference latency, surpassing existing methods.

Via

Access Paper or Ask Questions

Aligning Large Language Models for Faithful Integrity Against Opposing Argument

Jan 02, 2025

Yong Zhao, Yang Deng, See-Kiong Ng, Tat-Seng Chua

Figure 1 for Aligning Large Language Models for Faithful Integrity Against Opposing Argument

Figure 2 for Aligning Large Language Models for Faithful Integrity Against Opposing Argument

Figure 3 for Aligning Large Language Models for Faithful Integrity Against Opposing Argument

Figure 4 for Aligning Large Language Models for Faithful Integrity Against Opposing Argument

Abstract:Large Language Models (LLMs) have demonstrated impressive capabilities in complex reasoning tasks. However, they can be easily misled by unfaithful arguments during conversations, even when their original statements are correct. To this end, we investigate the problem of maintaining faithful integrity in LLMs. This involves ensuring that LLMs adhere to their faithful statements in the face of opposing arguments and are able to correct their incorrect statements when presented with faithful arguments. In this work, we propose a novel framework, named Alignment for Faithful Integrity with Confidence Estimation (AFICE), which aims to align the LLM responses with faithful integrity. Specifically, AFICE first designs a Bilateral Confidence Estimation (BCE) approach for estimating the uncertainty of each response generated by the LLM given a specific context, which simultaneously estimate the model's confidence to the question based on the internal states during decoding as well as to the answer based on cumulative probability ratios. With the BCE, we construct a conversational preference dataset composed of context, original statement, and argument, which is adopted for aligning the LLM for faithful integrity using Direct Preference Optimization (DPO). Extensive experimental results on a wide range of benchmarks demonstrate significant improvements in the LLM's ability to maintain faithful responses when encountering opposing arguments, ensuring both the practical utility and trustworthiness of LLMs in complex interactive settings. Code and data will be released via https://github.com/zhaoy777/AFICE.git

* 17 pages, 5 figures

Via

Access Paper or Ask Questions

Hadamard Attention Recurrent Transformer: A Strong Baseline for Stereo Matching Transformer

Jan 02, 2025

Ziyang Chen, Yongjun Zhang, Wenting Li, Bingshu Wang, Yabo Wu, Yong Zhao, C. L. Philip Chen

Abstract:In light of the advancements in transformer technology, extant research posits the construction of stereo transformers as a potential solution to the binocular stereo matching challenge. However, constrained by the low-rank bottleneck and quadratic complexity of attention mechanisms, stereo transformers still fail to demonstrate sufficient nonlinear expressiveness within a reasonable inference time. The lack of focus on key homonymous points renders the representations of such methods vulnerable to challenging conditions, including reflections and weak textures. Furthermore, a slow computing speed is not conducive to the application. To overcome these difficulties, we present the \textbf{H}adamard \textbf{A}ttention \textbf{R}ecurrent Stereo \textbf{T}ransformer (HART) that incorporates the following components: 1) For faster inference, we present a Hadamard product paradigm for the attention mechanism, achieving linear computational complexity. 2) We designed a Dense Attention Kernel (DAK) to amplify the differences between relevant and irrelevant feature responses. This allows HART to focus on important details. DAK also converts zero elements to non-zero elements to mitigate the reduced expressiveness caused by the low-rank bottleneck. 3) To compensate for the spatial and channel interaction missing in the Hadamard product, we propose MKOI to capture both global and local information through the interleaving of large and small kernel convolutions. Experimental results demonstrate the effectiveness of our HART. In reflective area, HART ranked \textbf{1st} on the KITTI 2012 benchmark among all published methods at the time of submission. Code is available at \url{https://github.com/ZYangChen/HART}.

Via

Access Paper or Ask Questions

Knowledge Boundary of Large Language Models: A Survey

Dec 17, 2024

Moxin Li, Yong Zhao, Yang Deng, Wenxuan Zhang, Shuaiyi Li, Wenya Xie, See-Kiong Ng, Tat-Seng Chua

Figure 1 for Knowledge Boundary of Large Language Models: A Survey

Figure 2 for Knowledge Boundary of Large Language Models: A Survey

Figure 3 for Knowledge Boundary of Large Language Models: A Survey

Figure 4 for Knowledge Boundary of Large Language Models: A Survey

Abstract:Although large language models (LLMs) store vast amount of knowledge in their parameters, they still have limitations in the memorization and utilization of certain knowledge, leading to undesired behaviors such as generating untruthful and inaccurate responses. This highlights the critical need to understand the knowledge boundary of LLMs, a concept that remains inadequately defined in existing research. In this survey, we propose a comprehensive definition of the LLM knowledge boundary and introduce a formalized taxonomy categorizing knowledge into four distinct types. Using this foundation, we systematically review the field through three key lenses: the motivation for studying LLM knowledge boundaries, methods for identifying these boundaries, and strategies for mitigating the challenges they present. Finally, we discuss open challenges and potential research directions in this area. We aim for this survey to offer the community a comprehensive overview, facilitate access to key issues, and inspire further advancements in LLM knowledge research.

Via

Access Paper or Ask Questions

Motif Channel Opened in a White-Box: Stereo Matching via Motif Correlation Graph

Nov 19, 2024

Ziyang Chen, Yongjun Zhang, Wenting Li, Bingshu Wang, Yong Zhao, C. L. Philip Chen

Figure 1 for Motif Channel Opened in a White-Box: Stereo Matching via Motif Correlation Graph

Figure 2 for Motif Channel Opened in a White-Box: Stereo Matching via Motif Correlation Graph

Figure 3 for Motif Channel Opened in a White-Box: Stereo Matching via Motif Correlation Graph

Figure 4 for Motif Channel Opened in a White-Box: Stereo Matching via Motif Correlation Graph

Abstract:Real-world applications of stereo matching, such as autonomous driving, place stringent demands on both safety and accuracy. However, learning-based stereo matching methods inherently suffer from the loss of geometric structures in certain feature channels, creating a bottleneck in achieving precise detail matching. Additionally, these methods lack interpretability due to the black-box nature of deep learning. In this paper, we propose MoCha-V2, a novel learning-based paradigm for stereo matching. MoCha-V2 introduces the Motif Correlation Graph (MCG) to capture recurring textures, which are referred to as ``motifs" within feature channels. These motifs reconstruct geometric structures and are learned in a more interpretable way. Subsequently, we integrate features from multiple frequency domains through wavelet inverse transformation. The resulting motif features are utilized to restore geometric structures in the stereo matching process. Experimental results demonstrate the effectiveness of MoCha-V2. MoCha-V2 achieved 1st place on the Middlebury benchmark at the time of its release. Code is available at https://github.com/ZYangChen/MoCha-Stereo.

Via

Access Paper or Ask Questions

A Data-driven Crowd Simulation Framework Integrating Physics-informed Machine Learning with Navigation Potential Fields

Oct 21, 2024

Runkang Guo, Bin Chen, Qi Zhang, Yong Zhao, Xiao Wang, Zhengqiu Zhu

Figure 1 for A Data-driven Crowd Simulation Framework Integrating Physics-informed Machine Learning with Navigation Potential Fields

Figure 2 for A Data-driven Crowd Simulation Framework Integrating Physics-informed Machine Learning with Navigation Potential Fields

Figure 3 for A Data-driven Crowd Simulation Framework Integrating Physics-informed Machine Learning with Navigation Potential Fields

Figure 4 for A Data-driven Crowd Simulation Framework Integrating Physics-informed Machine Learning with Navigation Potential Fields

Abstract:Traditional rule-based physical models are limited by their reliance on singular physical formulas and parameters, making it difficult to effectively tackle the intricate tasks associated with crowd simulation. Recent research has introduced deep learning methods to tackle these issues, but most current approaches focus primarily on generating pedestrian trajectories, often lacking interpretability and failing to provide real-time dynamic simulations.To address the aforementioned issues, we propose a novel data-driven crowd simulation framework that integrates Physics-informed Machine Learning (PIML) with navigation potential fields. Our approach leverages the strengths of both physical models and PIML. Specifically, we design an innovative Physics-informed Spatio-temporal Graph Convolutional Network (PI-STGCN) as a data-driven module to predict pedestrian movement trends based on crowd spatio-temporal data. Additionally, we construct a physical model of navigation potential fields based on flow field theory to guide pedestrian movements, thereby reinforcing physical constraints during the simulation. In our framework, navigation potential fields are dynamically computed and updated based on the movement trends predicted by the PI-STGCN, while the updated crowd dynamics, guided by these fields, subsequently feed back into the PI-STGCN. Comparative experiments on two publicly available large-scale real-world datasets across five scenes demonstrate that our proposed framework outperforms existing rule-based methods in accuracy and fidelity. The similarity between simulated and actual pedestrian trajectories increases by 10.8%, while the average error is reduced by 4%. Moreover, our framework exhibits greater adaptability and better interpretability compared to methods that rely solely on deep learning for trajectory generation.

Via

Access Paper or Ask Questions

Heterogeneous Graph Reinforcement Learning for Dependency-aware Multi-task Allocation in Spatial Crowdsourcing

Oct 20, 2024

Yong Zhao, Zhengqiu Zhu, Chen Gao, En Wang, Jincai Huang, Fei-Yue Wang

Abstract:Spatial Crowdsourcing (SC) is gaining traction in both academia and industry, with tasks on SC platforms becoming increasingly complex and requiring collaboration among workers with diverse skills. Recent research works address complex tasks by dividing them into subtasks with dependencies and assigning them to suitable workers. However, the dependencies among subtasks and their heterogeneous skill requirements, as well as the need for efficient utilization of workers' limited work time in the multi-task allocation mode, pose challenges in achieving an optimal task allocation scheme. Therefore, this paper formally investigates the problem of Dependency-aware Multi-task Allocation (DMA) and presents a well-designed framework to solve it, known as Heterogeneous Graph Reinforcement Learning-based Task Allocation (HGRL-TA). To address the challenges associated with representing and embedding diverse problem instances to ensure robust generalization, we propose a multi-relation graph model and a Compound-path-based Heterogeneous Graph Attention Network (CHANet) for effectively representing and capturing intricate relations among tasks and workers, as well as providing embedding of problem state. The task allocation decision is determined sequentially by a policy network, which undergoes simultaneous training with CHANet using the proximal policy optimization algorithm. Extensive experiment results demonstrate the effectiveness and generality of the proposed HGRL-TA in solving the DMA problem, leading to average profits that is 21.78% higher than those achieved using the metaheuristic methods.

Via

Access Paper or Ask Questions