Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Naga VS Raviteja Chappa

Micro-DualNet: Dual-Path Spatio-Temporal Network for Micro-Action Recognition

Apr 22, 2026

Naga VS Raviteja Chappa, Evangelos Sariyanidi, Lisa Yankowitz, Gokul Nair, Casey J. Zampella, Robert T. Schultz, Birkan Tunç

Abstract:Micro-actions are subtle, localized movements lasting 1-3 seconds such as scratching one's head or tapping fingers. Such subtle actions are essential for social communication, ubiquitously used in natural interactions, and thus critical for fine-grained video understanding, yet remain poorly understood by current computer vision systems. We identify a fundamental challenge: micro-actions exhibit diverse spatio-temporal characteristics where some are defined by spatial configurations while others manifest through temporal dynamics. Existing methods that commit to a single spatio-temporal decomposition cannot accommodate this diversity. We propose a dual-path network that processes anatomically-grounded spatial entities through parallel Spatial-Temporal (ST) and Temporal-Spatial (TS) pathways. The ST path captures spatial configurations before modeling temporal dynamics, while the TS path inverts this order to prioritize temporal dynamics. Rather than fixed fusion, we introduce entity-level adaptive routing where each body part learns its optimal processing preference, complemented by Mutual Action Consistency (MAC) loss that enforces cross-path coherence. Extensive experiments demonstrate competitive performance on MA-52 dataset and state-of-the-art results on iMiGUE dataset. Our work reveals that architectural adaptation to the inherent complexity of micro-actions is essential for advancing fine-grained video understanding.

* Accepted to International Conference on Automatic Face and Gesture Recognition (FG)

Via

Access Paper or Ask Questions

Public Health Advocacy Dataset: A Dataset of Tobacco Usage Videos from Social Media

Nov 12, 2024

Naga VS Raviteja Chappa, Charlotte McCormick, Susana Rodriguez Gongora, Page Daniel Dobbs, Khoa Luu

Abstract:The Public Health Advocacy Dataset (PHAD) is a comprehensive collection of 5,730 videos related to tobacco products sourced from social media platforms like TikTok and YouTube. This dataset encompasses 4.3 million frames and includes detailed metadata such as user engagement metrics, video descriptions, and search keywords. This is the first dataset with these features providing a valuable resource for analyzing tobacco-related content and its impact. Our research employs a two-stage classification approach, incorporating a Vision-Language (VL) Encoder, demonstrating superior performance in accurately categorizing various types of tobacco products and usage scenarios. The analysis reveals significant user engagement trends, particularly with vaping and e-cigarette content, highlighting areas for targeted public health interventions. The PHAD addresses the need for multi-modal data in public health research, offering insights that can inform regulatory policies and public health strategies. This dataset is a crucial step towards understanding and mitigating the impact of tobacco usage, ensuring that public health efforts are more inclusive and effective.

* Under review at International Journal of Computer Vision (IJCV); 29 figures, 5 figures;

Via

Access Paper or Ask Questions

FLAASH: Flow-Attention Adaptive Semantic Hierarchical Fusion for Multi-Modal Tobacco Content Analysis

Oct 25, 2024

Naga VS Raviteja Chappa, Page Daniel Dobbs, Bhiksha Raj, Khoa Luu

Figure 1 for FLAASH: Flow-Attention Adaptive Semantic Hierarchical Fusion for Multi-Modal Tobacco Content Analysis

Figure 2 for FLAASH: Flow-Attention Adaptive Semantic Hierarchical Fusion for Multi-Modal Tobacco Content Analysis

Figure 3 for FLAASH: Flow-Attention Adaptive Semantic Hierarchical Fusion for Multi-Modal Tobacco Content Analysis

Figure 4 for FLAASH: Flow-Attention Adaptive Semantic Hierarchical Fusion for Multi-Modal Tobacco Content Analysis

Abstract:The proliferation of tobacco-related content on social media platforms poses significant challenges for public health monitoring and intervention. This paper introduces a novel multi-modal deep learning framework named Flow-Attention Adaptive Semantic Hierarchical Fusion (FLAASH) designed to analyze tobacco-related video content comprehensively. FLAASH addresses the complexities of integrating visual and textual information in short-form videos by leveraging a hierarchical fusion mechanism inspired by flow network theory. Our approach incorporates three key innovations, including a flow-attention mechanism that captures nuanced interactions between visual and textual modalities, an adaptive weighting scheme that balances the contribution of different hierarchical levels, and a gating mechanism that selectively emphasizes relevant features. This multi-faceted approach enables FLAASH to effectively process and analyze diverse tobacco-related content, from product showcases to usage scenarios. We evaluate FLAASH on the Multimodal Tobacco Content Analysis Dataset (MTCAD), a large-scale collection of tobacco-related videos from popular social media platforms. Our results demonstrate significant improvements over existing methods, outperforming state-of-the-art approaches in classification accuracy, F1 score, and temporal consistency. The proposed method also shows strong generalization capabilities when tested on standard video question-answering datasets, surpassing current models. This work contributes to the intersection of public health and artificial intelligence, offering an effective tool for analyzing tobacco promotion in digital media.

* Under review at International Journal of Computer Vision; 20 pages, 4 figures, 5 tables

Via

Access Paper or Ask Questions

REACT: Recognize Every Action Everywhere All At Once

Nov 27, 2023

Naga VS Raviteja Chappa, Pha Nguyen, Page Daniel Dobbs, Khoa Luu

Abstract:Group Activity Recognition (GAR) is a fundamental problem in computer vision, with diverse applications in sports video analysis, video surveillance, and social scene understanding. Unlike conventional action recognition, GAR aims to classify the actions of a group of individuals as a whole, requiring a deep understanding of their interactions and spatiotemporal relationships. To address the challenges in GAR, we present REACT (\textbf{R}ecognize \textbf{E}very \textbf{Act}ion Everywhere All At Once), a novel architecture inspired by the transformer encoder-decoder model explicitly designed to model complex contextual relationships within videos, including multi-modality and spatio-temporal features. Our architecture features a cutting-edge Vision-Language Encoder block for integrated temporal, spatial, and multi-modal interaction modeling. This component efficiently encodes spatiotemporal interactions, even with sparsely sampled frames, and recovers essential local information. Our Action Decoder Block refines the joint understanding of text and video data, allowing us to precisely retrieve bounding boxes, enhancing the link between semantics and visual reality. At the core, our Actor Fusion Block orchestrates a fusion of actor-specific data and textual features, striking a balance between specificity and context. Our method outperforms state-of-the-art GAR approaches in extensive experiments, demonstrating superior accuracy in recognizing and understanding group activities. Our architecture's potential extends to diverse real-world applications, offering empirical evidence of its performance gains. This work significantly advances the field of group activity recognition, providing a robust framework for nuanced scene comprehension.

* 10 pages, 4 figures, 5 tables

Via

Access Paper or Ask Questions

SoGAR: Self-supervised Spatiotemporal Attention-based Social Group Activity Recognition

Apr 27, 2023

Naga VS Raviteja Chappa, Pha Nguyen, Alexander H Nelson, Han-Seok Seo, Xin Li, Page Daniel Dobbs, Khoa Luu

Figure 1 for SoGAR: Self-supervised Spatiotemporal Attention-based Social Group Activity Recognition

Figure 2 for SoGAR: Self-supervised Spatiotemporal Attention-based Social Group Activity Recognition

Figure 3 for SoGAR: Self-supervised Spatiotemporal Attention-based Social Group Activity Recognition

Figure 4 for SoGAR: Self-supervised Spatiotemporal Attention-based Social Group Activity Recognition

Abstract:This paper introduces a novel approach to Social Group Activity Recognition (SoGAR) using Self-supervised Transformers network that can effectively utilize unlabeled video data. To extract spatio-temporal information, we create local and global views with varying frame rates. Our self-supervised objective ensures that features extracted from contrasting views of the same video are consistent across spatio-temporal domains. Our proposed approach is efficient in using transformer-based encoders for alleviating the weakly supervised setting of group activity recognition. By leveraging the benefits of transformer models, our approach can model long-term relationships along spatio-temporal dimensions. Our proposed SoGAR method achieves state-of-the-art results on three group activity recognition benchmarks, namely JRDB-PAR, NBA, and Volleyball datasets, surpassing the current state-of-the-art in terms of F1-score, MCA, and MPCA metrics.

* 32 pages, 7 figures. arXiv admin note: text overlap with arXiv:2303.12149

Via

Access Paper or Ask Questions