Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Tong Zhang

Nanjing University of Science and Technology, Nanjing, China

Computer Vision-Driven Gesture Recognition: Toward Natural and Intuitive Human-Computer

Dec 24, 2024

Fenghua Shao, Tong Zhang, Shang Gao, Qi Sun, Liuqingqing Yang

Figure 1 for Computer Vision-Driven Gesture Recognition: Toward Natural and Intuitive Human-Computer

Figure 2 for Computer Vision-Driven Gesture Recognition: Toward Natural and Intuitive Human-Computer

Figure 3 for Computer Vision-Driven Gesture Recognition: Toward Natural and Intuitive Human-Computer

Abstract:This study mainly explores the application of natural gesture recognition based on computer vision in human-computer interaction, aiming to improve the fluency and naturalness of human-computer interaction through gesture recognition technology. In the fields of virtual reality, augmented reality and smart home, traditional input methods have gradually failed to meet the needs of users for interactive experience. As an intuitive and convenient interaction method, gestures have received more and more attention. This paper proposes a gesture recognition method based on a three-dimensional hand skeleton model. By simulating the three-dimensional spatial distribution of hand joints, a simplified hand skeleton structure is constructed. By connecting the palm and each finger joint, a dynamic and static gesture model of the hand is formed, which further improves the accuracy and efficiency of gesture recognition. Experimental results show that this method can effectively recognize various gestures and maintain high recognition accuracy and real-time response capabilities in different environments. In addition, combined with multimodal technologies such as eye tracking, the intelligence level of the gesture recognition system can be further improved, bringing a richer and more intuitive user experience. In the future, with the continuous development of computer vision, deep learning and multimodal interaction technology, natural interaction based on gestures will play an important role in a wider range of application scenarios and promote revolutionary progress in human-computer interaction.

Via

Access Paper or Ask Questions

QADM-Net: Quality-adaptive Dynamic Network for Reliable Multimodal Classification

Dec 19, 2024

Shu Shen, Tong Zhang, C. L. Philip Chen

Figure 1 for QADM-Net: Quality-adaptive Dynamic Network for Reliable Multimodal Classification

Figure 2 for QADM-Net: Quality-adaptive Dynamic Network for Reliable Multimodal Classification

Figure 3 for QADM-Net: Quality-adaptive Dynamic Network for Reliable Multimodal Classification

Figure 4 for QADM-Net: Quality-adaptive Dynamic Network for Reliable Multimodal Classification

Abstract:Integrating complementary information from different data modalities can yield representation with stronger expressive ability. However, data quality varies across multimodal samples, highlighting the need for learning reliable multimodal representations, especially in safety-critical applications. This paper focuses on an aspect that existing methods in this domain commonly overlook: the importance of network dynamics and adaptability in providing reliable results from diverse samples. Specifically, it highlights the model's ability to dynamically adjust its capacity and behaviour according to different samples, using the adjusted network for predicting each sample. To this end, we propose a novel framework for multimodal reliable classification termed Quality-adaptive Dynamic Multimodal Network (QADM-Net). QADM-Net first introduces a confidence-guided dynamic depths mechanism to achieve the appropriate network capacity. This mechanism adjusts the network depth according to the difficulty of each sample, which is determined by the quality of its modalities. Subsequently, we develop an informativeness-based dynamic parameters mechanism that enables QADM-Net to perform unique inference behaviour on each of the diverse samples with feature-level quality variation presented in their feature vectors. In this way, QADM-Net adequately adapts its capacity and behaviour on each sample by investigating the quality variation of samples at both modality and feature levels, thus enhancing the reliability of classification results. Experiments conducted on four datasets demonstrate that QADM-Net significantly outperforms state-of-the-art methods in classification performance and exhibits strong adaptability to data with diverse quality.

* 11 pages, 5 figures

Via

Access Paper or Ask Questions

Entropy-Regularized Process Reward Model

Dec 15, 2024

Hanning Zhang, Pengcheng Wang, Shizhe Diao, Yong Lin, Rui Pan, Hanze Dong, Dylan Zhang, Pavlo Molchanov, Tong Zhang

Figure 1 for Entropy-Regularized Process Reward Model

Figure 2 for Entropy-Regularized Process Reward Model

Figure 3 for Entropy-Regularized Process Reward Model

Figure 4 for Entropy-Regularized Process Reward Model

Abstract:Large language models (LLMs) have shown promise in performing complex multi-step reasoning, yet they continue to struggle with mathematical reasoning, often making systematic errors. A promising solution is reinforcement learning (RL) guided by reward models, particularly those focusing on process rewards, which score each intermediate step rather than solely evaluating the final outcome. This approach is more effective at guiding policy models towards correct reasoning trajectories. In this work, we propose an entropy-regularized process reward model (ER-PRM) that integrates KL-regularized Markov Decision Processes (MDP) to balance policy optimization with the need to prevent the policy from shifting too far from its initial distribution. We derive a novel reward construction method based on the theoretical results. Our theoretical analysis shows that we could derive the optimal reward model from the initial policy sampling. Our empirical experiments on the MATH and GSM8K benchmarks demonstrate that ER-PRM consistently outperforms existing process reward models, achieving 1% improvement on GSM8K and 2-3% improvement on MATH under best-of-N evaluation, and more than 1% improvement under RLHF. These results highlight the efficacy of entropy-regularization in enhancing LLMs' reasoning capabilities.

* Preprint

Via

Access Paper or Ask Questions

A New Perspective on Time Series Anomaly Detection: Faster Patch-based Broad Learning System

Dec 07, 2024

Pengyu Li, Zhijie Zhong, Tong Zhang, Zhiwen Yu, C. L. Philip Chen, Kaixiang Yang

Figure 1 for A New Perspective on Time Series Anomaly Detection: Faster Patch-based Broad Learning System

Figure 2 for A New Perspective on Time Series Anomaly Detection: Faster Patch-based Broad Learning System

Figure 3 for A New Perspective on Time Series Anomaly Detection: Faster Patch-based Broad Learning System

Figure 4 for A New Perspective on Time Series Anomaly Detection: Faster Patch-based Broad Learning System

Abstract:Time series anomaly detection (TSAD) has been a research hotspot in both academia and industry in recent years. Deep learning methods have become the mainstream research direction due to their excellent performance. However, new viewpoints have emerged in recent TSAD research. Deep learning is not required for TSAD due to limitations such as slow deep learning speed. The Broad Learning System (BLS) is a shallow network framework that benefits from its ease of optimization and speed. It has been shown to outperform machine learning approaches while remaining competitive with deep learning. Based on the current situation of TSAD, we propose the Contrastive Patch-based Broad Learning System (CPatchBLS). This is a new exploration of patching technique and BLS, providing a new perspective for TSAD. We construct Dual-PatchBLS as a base through patching and Simple Kernel Perturbation (SKP) and utilize contrastive learning to capture the differences between normal and abnormal data under different representations. To compensate for the temporal semantic loss caused by various patching, we propose CPatchBLS with model level integration, which takes advantage of BLS's fast feature to build model-level integration and improve model detection. Using five real-world series anomaly detection datasets, we confirmed the method's efficacy, outperforming previous deep learning and machine learning methods while retaining a high level of computing efficiency.

* 13 pages, 7 figures, 3 tables, Under review

Via

Access Paper or Ask Questions

Mixture of Physical Priors Adapter for Parameter-Efficient Fine-Tuning

Dec 03, 2024

Zhaozhi Wang, Conghu Li, Qixiang Ye, Tong Zhang

Figure 1 for Mixture of Physical Priors Adapter for Parameter-Efficient Fine-Tuning

Figure 2 for Mixture of Physical Priors Adapter for Parameter-Efficient Fine-Tuning

Figure 3 for Mixture of Physical Priors Adapter for Parameter-Efficient Fine-Tuning

Figure 4 for Mixture of Physical Priors Adapter for Parameter-Efficient Fine-Tuning

Abstract:Most parameter-efficient fine-tuning (PEFT) methods rely on low-rank representations to adapt models. However, these approaches often oversimplify representations, particularly when the underlying data has high-rank or high-frequency components. This limitation hinders the model's ability to capture complex data interactions effectively. In this paper, we propose a novel approach that models network weights by leveraging a combination of physical priors, enabling more accurate approximations. We use three foundational equations -- heat diffusion, wave propagation, and Poisson's steady-state equation -- each contributing distinctive modeling properties: heat diffusion enforces local smoothness, wave propagation facilitates long-range interactions, and Poisson's equation captures global equilibrium. To combine these priors effectively, we introduce the Mixture of Physical Priors Adapter (MoPPA), using an efficient Discrete Cosine Transform (DCT) implementation. To dynamically balance these priors, a route regularization mechanism is designed to adaptively tune their contributions. MoPPA serves as a lightweight, plug-and-play module that seamlessly integrates into transformer architectures, with adaptable complexity depending on the local context. Specifically, using MAE pre-trained ViT-B, MoPPA improves PEFT accuracy by up to 2.1% on VTAB-1K image classification with a comparable number of trainable parameters, and advantages are further validated through experiments across various vision backbones, showcasing MoPPA's effectiveness and adaptability. The code will be made public available.

* 14 pages, 7 figures, 9 tables

Via

Access Paper or Ask Questions

MatchDiffusion: Training-free Generation of Match-cuts

Nov 27, 2024

Alejandro Pardo, Fabio Pizzati, Tong Zhang, Alexander Pondaven, Philip Torr, Juan Camilo Perez, Bernard Ghanem

Figure 1 for MatchDiffusion: Training-free Generation of Match-cuts

Figure 2 for MatchDiffusion: Training-free Generation of Match-cuts

Figure 3 for MatchDiffusion: Training-free Generation of Match-cuts

Figure 4 for MatchDiffusion: Training-free Generation of Match-cuts

Abstract:Match-cuts are powerful cinematic tools that create seamless transitions between scenes, delivering strong visual and metaphorical connections. However, crafting match-cuts is a challenging, resource-intensive process requiring deliberate artistic planning. In MatchDiffusion, we present the first training-free method for match-cut generation using text-to-video diffusion models. MatchDiffusion leverages a key property of diffusion models: early denoising steps define the scene's broad structure, while later steps add details. Guided by this insight, MatchDiffusion employs "Joint Diffusion" to initialize generation for two prompts from shared noise, aligning structure and motion. It then applies "Disjoint Diffusion", allowing the videos to diverge and introduce unique details. This approach produces visually coherent videos suited for match-cuts. User studies and metrics demonstrate MatchDiffusion's effectiveness and potential to democratize match-cut creation.

* https://matchdiffusion.github.io

Via

Access Paper or Ask Questions

Scaling Mesh Generation via Compressive Tokenization

Nov 11, 2024

Haohan Weng, Zibo Zhao, Biwen Lei, Xianghui Yang, Jian Liu, Zeqiang Lai, Zhuo Chen, Yuhong Liu, Jie Jiang, Chunchao Guo(+3 more)

Figure 1 for Scaling Mesh Generation via Compressive Tokenization

Figure 2 for Scaling Mesh Generation via Compressive Tokenization

Figure 3 for Scaling Mesh Generation via Compressive Tokenization

Figure 4 for Scaling Mesh Generation via Compressive Tokenization

Abstract:We propose a compressive yet effective mesh representation, Blocked and Patchified Tokenization (BPT), facilitating the generation of meshes exceeding 8k faces. BPT compresses mesh sequences by employing block-wise indexing and patch aggregation, reducing their length by approximately 75\% compared to the original sequences. This compression milestone unlocks the potential to utilize mesh data with significantly more faces, thereby enhancing detail richness and improving generation robustness. Empowered with the BPT, we have built a foundation mesh generative model training on scaled mesh data to support flexible control for point clouds and images. Our model demonstrates the capability to generate meshes with intricate details and accurate topology, achieving SoTA performance on mesh generation and reaching the level for direct product usage.

* Homepage: https://whaohan.github.io/bpt , Code: https://github.com/whaohan/bpt

Via

Access Paper or Ask Questions

Fox-1 Technical Report

Nov 08, 2024

Zijian Hu, Jipeng Zhang, Rui Pan, Zhaozhuo Xu, Salman Avestimehr, Chaoyang He, Tong Zhang

Abstract:We present Fox-1, a series of small language models (SLMs) consisting of Fox-1-1.6B and Fox-1-1.6B-Instruct-v0.1. These models are pre-trained on 3 trillion tokens of web-scraped document data and fine-tuned with 5 billion tokens of instruction-following and multi-turn conversation data. Aiming to improve the pre-training efficiency, Fox-1-1.6B model introduces a novel 3-stage data curriculum across all the training data with 2K-8K sequence length. In architecture design, Fox-1 features a deeper layer structure, an expanded vocabulary, and utilizes Grouped Query Attention (GQA), offering a performant and efficient architecture compared to other SLMs. Fox-1 achieves better or on-par performance in various benchmarks compared to StableLM-2-1.6B, Gemma-2B, Qwen1.5-1.8B, and OpenELM1.1B, with competitive inference speed and throughput. The model weights have been released under the Apache 2.0 license, where we aim to promote the democratization of LLMs and make them fully accessible to the whole open-source community.

* Base model is available at https://huggingface.co/tensoropera/Fox-1-1.6B and the instruction-tuned version is available at https://huggingface.co/tensoropera/Fox-1-1.6B-Instruct-v0.1

Via

Access Paper or Ask Questions

Alopex: A Computational Framework for Enabling On-Device Function Calls with LLMs

Nov 07, 2024

Yide Ran, Zhaozhuo Xu, Yuhang Yao, Zijian Hu, Shanshan Han, Han Jin, Alay Dilipbhai Shah, Jipeng Zhang, Dimitris Stripelis, Tong Zhang(+2 more)

Figure 1 for Alopex: A Computational Framework for Enabling On-Device Function Calls with LLMs

Figure 2 for Alopex: A Computational Framework for Enabling On-Device Function Calls with LLMs

Figure 3 for Alopex: A Computational Framework for Enabling On-Device Function Calls with LLMs

Figure 4 for Alopex: A Computational Framework for Enabling On-Device Function Calls with LLMs

Abstract:The rapid advancement of Large Language Models (LLMs) has led to their increased integration into mobile devices for personalized assistance, which enables LLMs to call external API functions to enhance their performance. However, challenges such as data scarcity, ineffective question formatting, and catastrophic forgetting hinder the development of on-device LLM agents. To tackle these issues, we propose Alopex, a framework that enables precise on-device function calls using the Fox LLM. Alopex introduces a logic-based method for generating high-quality training data and a novel ``description-question-output'' format for fine-tuning, reducing risks of function information leakage. Additionally, a data mixing strategy is used to mitigate catastrophic forgetting, combining function call data with textbook datasets to enhance performance in various tasks. Experimental results show that Alopex improves function call accuracy and significantly reduces catastrophic forgetting, providing a robust solution for integrating function call capabilities into LLMs without manual intervention.

Via

Access Paper or Ask Questions

Sharp Analysis for KL-Regularized Contextual Bandits and RLHF

Nov 07, 2024

Heyang Zhao, Chenlu Ye, Quanquan Gu, Tong Zhang

Figure 1 for Sharp Analysis for KL-Regularized Contextual Bandits and RLHF

Figure 2 for Sharp Analysis for KL-Regularized Contextual Bandits and RLHF

Abstract:Reverse-Kullback-Leibler (KL) regularization has emerged to be a predominant technique used to enhance policy optimization in reinforcement learning (RL) and reinforcement learning from human feedback (RLHF), which forces the learned policy to stay close to a reference policy. While the effectiveness and necessity of KL-regularization have been empirically demonstrated in various practical scenarios, current theoretical analysis of KL-regularized RLHF still obtains the same $\mathcal{O}(1 / \epsilon^2)$ sample complexity as problems without KL-regularization. To understand the fundamental distinction between policy learning objectives with KL-regularization and ones without KL-regularization, we are the first to theoretically demonstrate the power of KL-regularization by providing a sharp analysis for KL-regularized contextual bandits and RLHF, revealing an $\mathcal{O}(1 / \epsilon)$ sample complexity when $\epsilon$ is sufficiently small. We further explore the role of data coverage in contextual bandits and RLHF. While the coverage assumption is commonly employed in offline RLHF to link the samples from the reference policy to the optimal policy, often at the cost of a multiplicative dependence on the coverage coefficient, its impact on the sample complexity of online RLHF remains unclear. Previous theoretical analyses of online RLHF typically require explicit exploration and additional structural assumptions on the reward function class. In contrast, we show that with sufficient coverage from the reference policy, a simple two-stage mixed sampling strategy can achieve a sample complexity with only an additive dependence on the coverage coefficient. Our results provide a comprehensive understanding of the roles of KL-regularization and data coverage in RLHF, shedding light on the design of more efficient RLHF algorithms.

Via

Access Paper or Ask Questions