Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Akash Kumar

A Gap Between Decision Trees and Neural Networks

Jan 08, 2026

Akash Kumar

Abstract:We study when geometric simplicity of decision boundaries, used here as a notion of interpretability, can conflict with accurate approximation of axis-aligned decision trees by shallow neural networks. Decision trees induce rule-based, axis-aligned decision regions (finite unions of boxes), whereas shallow ReLU networks are typically trained as score models whose predictions are obtained by thresholding. We analyze the infinite-width, bounded-norm, single-hidden-layer ReLU class through the Radon total variation ($\mathrm{R}\mathrm{TV}$) seminorm, which controls the geometric complexity of level sets. We first show that the hard tree indicator $1_A$ has infinite $\mathrm{R}\mathrm{TV}$. Moreover, two natural split-wise continuous surrogates--piecewise-linear ramp smoothing and sigmoidal (logistic) smoothing--also have infinite $\mathrm{R}\mathrm{TV}$ in dimensions $d>1$, while Gaussian convolution yields finite $\mathrm{R}\mathrm{TV}$ but with an explicit exponential dependence on $d$. We then separate two goals that are often conflated: classification after thresholding (recovering the decision set) versus score learning (learning a calibrated score close to $1_A$). For classification, we construct a smooth barrier score $S_A$ with finite $\mathrm{R}\mathrm{TV}$ whose fixed threshold $τ=1$ exactly recovers the box. Under a mild tube-mass condition near $\partial A$, we prove an $L_1(P)$ calibration bound that decays polynomially in a sharpness parameter, along with an explicit $\mathrm{R}\mathrm{TV}$ upper bound in terms of face measures. Experiments on synthetic unions of rectangles illustrate the resulting accuracy--complexity tradeoff and how threshold selection shifts where training lands along it.

* 45 pages, plots were improved

Via

Access Paper or Ask Questions

CoSPlan: Corrective Sequential Planning via Scene Graph Incremental Updates

Dec 11, 2025

Shresth Grover, Priyank Pathak, Akash Kumar, Vibhav Vineet, Yogesh S Rawat

Abstract:Large-scale Vision-Language Models (VLMs) exhibit impressive complex reasoning capabilities but remain largely unexplored in visual sequential planning, i.e., executing multi-step actions towards a goal. Additionally, practical sequential planning often involves non-optimal (erroneous) steps, challenging VLMs to detect and correct such steps. We propose Corrective Sequential Planning Benchmark (CoSPlan) to evaluate VLMs in error-prone, vision-based sequential planning tasks across 4 domains: maze navigation, block rearrangement, image reconstruction,and object reorganization. CoSPlan assesses two key abilities: Error Detection (identifying non-optimal action) and Step Completion (correcting and completing action sequences to reach the goal). Despite using state-of-the-art reasoning techniques such as Chain-of-Thought and Scene Graphs, VLMs (e.g. Intern-VLM and Qwen2) struggle on CoSPlan, failing to leverage contextual cues to reach goals. Addressing this, we propose a novel training-free method, Scene Graph Incremental updates (SGI), which introduces intermediate reasoning steps between the initial and goal states. SGI helps VLMs reason about sequences, yielding an average performance gain of 5.2%. In addition to enhancing reliability in corrective sequential planning, SGI generalizes to traditional planning tasks such as Plan-Bench and VQA.

Via

Access Paper or Ask Questions

RobustGait: Robustness Analysis for Appearance Based Gait Recognition

Nov 17, 2025

Reeshoon Sayera, Akash Kumar, Sirshapan Mitra, Prudvi Kamtam, Yogesh S Rawat

Figure 1 for RobustGait: Robustness Analysis for Appearance Based Gait Recognition

Figure 2 for RobustGait: Robustness Analysis for Appearance Based Gait Recognition

Figure 3 for RobustGait: Robustness Analysis for Appearance Based Gait Recognition

Figure 4 for RobustGait: Robustness Analysis for Appearance Based Gait Recognition

Abstract:Appearance-based gait recognition have achieved strong performance on controlled datasets, yet systematic evaluation of its robustness to real-world corruptions and silhouette variability remains lacking. We present RobustGait, a framework for fine-grained robustness evaluation of appearance-based gait recognition systems. RobustGait evaluation spans four dimensions: the type of perturbation (digital, environmental, temporal, occlusion), the silhouette extraction method (segmentation and parsing networks), the architectural capacities of gait recognition models, and various deployment scenarios. The benchmark introduces 15 corruption types at 5 severity levels across CASIA-B, CCPG, and SUSTech1K, with in-the-wild validation on MEVID, and evaluates six state-of-the-art gait systems. We came across several exciting insights. First, applying noise at the RGB level better reflects real-world degradation, and reveal how distortions propagate through silhouette extraction to the downstream gait recognition systems. Second, gait accuracy is highly sensitive to silhouette extractor biases, revealing an overlooked source of benchmark bias. Third, robustness is dependent on both the type of perturbation and the architectural design. Finally, we explore robustness-enhancing strategies, showing that noise-aware training and knowledge distillation improve performance and move toward deployment-ready systems.

* IEEE WACV'26 Main Conference

Via

Access Paper or Ask Questions

Bayesian Learning Aided Simultaneous Sparse Estimation of Dual-Wideband THz Channels in Multi-User Hybrid MIMO Systems

Nov 15, 2025

Abhisha Garg, Akash Kumar, Suraj Srivastava, Nimish Yadav, Aditya K. Jagannatham, Lajos Hanzo

Figure 1 for Bayesian Learning Aided Simultaneous Sparse Estimation of Dual-Wideband THz Channels in Multi-User Hybrid MIMO Systems

Figure 2 for Bayesian Learning Aided Simultaneous Sparse Estimation of Dual-Wideband THz Channels in Multi-User Hybrid MIMO Systems

Figure 3 for Bayesian Learning Aided Simultaneous Sparse Estimation of Dual-Wideband THz Channels in Multi-User Hybrid MIMO Systems

Figure 4 for Bayesian Learning Aided Simultaneous Sparse Estimation of Dual-Wideband THz Channels in Multi-User Hybrid MIMO Systems

Abstract:This work conceives the Bayesian Group-Sparse Regression (BGSR) for the estimation of a spatial and frequency wideband, i.e., a dual wideband channel in Multi-User (MU) THz hybrid MIMO scenarios. We develop a practical dual wideband THz channel model that incorporates absorption losses, reflection losses, diffused ray modeling and angles of arrival/departure (AoAs/AoDs) using a Gaussian Mixture Model (GMM). Furthermore, a low-resolution analog-to-digital converter (ADC) is employed at each RF chain, which is crucial for wideband THz massive MIMO systems to reduce power consumption and hardware complexity, given the high sampling rates and large number of antennas involved. The quantized MU THz MIMO model is linearized using the popular Bussgang decomposition followed by BGSR based channel learning framework that results in sparsity across different subcarriers, where each subcarrier has its unique dictionary matrix. Next, the Bayesian Cramér Rao Bound (BCRB) is devised for bounding the normalized mean square error (NMSE) performance. Extensive simulations were performed to assess the performance improvements achieved by the proposed BGSR method compared to other sparse estimation techniques. The metrics considered for quantifying the performance improvements include the NMSE and bit error rate (BER).

Via

Access Paper or Ask Questions

A Large-Scale Analysis on Contextual Self-Supervised Video Representation Learning

Apr 08, 2025

Akash Kumar, Ashlesha Kumar, Vibhav Vineet, Yogesh S Rawat

Abstract:Self-supervised learning has emerged as a powerful paradigm for label-free model pretraining, particularly in the video domain, where manual annotation is costly and time-intensive. However, existing self-supervised approaches employ diverse experimental setups, making direct comparisons challenging due to the absence of a standardized benchmark. In this work, we establish a unified benchmark that enables fair comparisons across different methods. Additionally, we systematically investigate five critical aspects of self-supervised learning in videos: (1) dataset size, (2) model complexity, (3) data distribution, (4) data noise, and (5) feature representations. To facilitate this study, we evaluate six self-supervised learning methods across six network architectures, conducting extensive experiments on five benchmark datasets and assessing performance on two distinct downstream tasks. Our analysis reveals key insights into the interplay between pretraining strategies, dataset characteristics, pretext tasks, and model architectures. Furthermore, we extend these findings to Video Foundation Models (ViFMs), demonstrating their relevance in large-scale video representation learning. Finally, leveraging these insights, we propose a novel approach that significantly reduces training data requirements while surpassing state-of-the-art methods that rely on 10% more pretraining data. We believe this work will guide future research toward a deeper understanding of self-supervised video representation learning and its broader implications.

* CVPR'25 Workshop: 6th Data-Efficient Workshop

Via

Access Paper or Ask Questions

Retrospective: A CORDIC Based Configurable Activation Function for NN Applications

Mar 18, 2025

Omkar Kokane, Gopal Raut, Salim Ullah, Mukul Lokhande, Adam Teman, Akash Kumar, Santosh Kumar Vishvakarma

Figure 1 for Retrospective: A CORDIC Based Configurable Activation Function for NN Applications

Figure 2 for Retrospective: A CORDIC Based Configurable Activation Function for NN Applications

Figure 3 for Retrospective: A CORDIC Based Configurable Activation Function for NN Applications

Figure 4 for Retrospective: A CORDIC Based Configurable Activation Function for NN Applications

Abstract:A CORDIC-based configuration for the design of Activation Functions (AF) was previously suggested to accelerate ASIC hardware design for resource-constrained systems by providing functional reconfigurability. Since its introduction, this new approach for neural network acceleration has gained widespread popularity, influencing numerous designs for activation functions in both academic and commercial AI processors. In this retrospective analysis, we explore the foundational aspects of this initiative, summarize key developments over recent years, and introduce the DA-VINCI AF tailored for the evolving needs of AI applications. This new generation of dynamically configurable and precision-adjustable activation function cores promise greater adaptability for a range of activation functions in AI workloads, including Swish, SoftMax, SeLU, and GeLU, utilizing the Shift-and-Add CORDIC technique. The previously presented design has been optimized for MAC, Sigmoid, and Tanh functionalities and incorporated into ReLU AFs, culminating in an accumulative NEURIC compute unit. These enhancements position NEURIC as a fundamental component in the resource-efficient vector engine for the realization of AI accelerators that focus on DNNs, RNNs/LSTMs, and Transformers, achieving a quality of results (QoR) of 98.5%.

Via

Access Paper or Ask Questions

STPro: Spatial and Temporal Progressive Learning for Weakly Supervised Spatio-Temporal Grounding

Feb 28, 2025

Aaryan Garg, Akash Kumar, Yogesh S Rawat

Figure 1 for STPro: Spatial and Temporal Progressive Learning for Weakly Supervised Spatio-Temporal Grounding

Figure 2 for STPro: Spatial and Temporal Progressive Learning for Weakly Supervised Spatio-Temporal Grounding

Figure 3 for STPro: Spatial and Temporal Progressive Learning for Weakly Supervised Spatio-Temporal Grounding

Figure 4 for STPro: Spatial and Temporal Progressive Learning for Weakly Supervised Spatio-Temporal Grounding

Abstract:In this work we study Weakly Supervised Spatio-Temporal Video Grounding (WSTVG), a challenging task of localizing subjects spatio-temporally in videos using only textual queries and no bounding box supervision. Inspired by recent advances in vision-language foundation models, we investigate their utility for WSTVG, leveraging their zero-shot grounding capabilities. However, we find that a simple adaptation lacks essential spatio-temporal grounding abilities. To bridge this gap, we introduce Tubelet Referral Grounding (TRG), which connects textual queries to tubelets to enable spatio-temporal predictions. Despite its promise, TRG struggles with compositional action understanding and dense scene scenarios. To address these limitations, we propose STPro, a novel progressive learning framework with two key modules: (1) Sub-Action Temporal Curriculum Learning (SA-TCL), which incrementally builds compositional action understanding, and (2) Congestion-Guided Spatial Curriculum Learning (CG-SCL), which adapts the model to complex scenes by spatially increasing task difficulty. STPro achieves state-of-the-art results on three benchmark datasets, with improvements of 1.0% on VidSTG-Declarative and 3.0% on HCSTVG-v1.

* CVPR'25 Conference

Via

Access Paper or Ask Questions

A Gap Between the Gaussian RKHS and Neural Networks: An Infinite-Center Asymptotic Analysis

Feb 22, 2025

Akash Kumar, Rahul Parhi, Mikhail Belkin

Figure 1 for A Gap Between the Gaussian RKHS and Neural Networks: An Infinite-Center Asymptotic Analysis

Abstract:Recent works have characterized the function-space inductive bias of infinite-width bounded-norm single-hidden-layer neural networks as a kind of bounded-variation-type space. This novel neural network Banach space encompasses many classical multivariate function spaces including certain Sobolev spaces and the spectral Barron spaces. Notably, this Banach space also includes functions that exhibit less classical regularity such as those that only vary in a few directions. On bounded domains, it is well-established that the Gaussian reproducing kernel Hilbert space (RKHS) strictly embeds into this Banach space, demonstrating a clear gap between the Gaussian RKHS and the neural network Banach space. It turns out that when investigating these spaces on unbounded domains, e.g., all of $\mathbb{R}^d$, the story is fundamentally different. We establish the following fundamental result: Certain functions that lie in the Gaussian RKHS have infinite norm in the neural network Banach space. This provides a nontrivial gap between kernel methods and neural networks by the exhibition of functions in which kernel methods can do strictly better than neural networks.

* 22 pages, 1 figure

Via

Access Paper or Ask Questions

The Complexity of Learning Sparse Superposed Features with Feedback

Feb 08, 2025

Akash Kumar

Abstract:The success of deep networks is crucially attributed to their ability to capture latent features within a representation space. In this work, we investigate whether the underlying learned features of a model can be efficiently retrieved through feedback from an agent, such as a large language model (LLM), in the form of relative \textit{triplet comparisons}. These features may represent various constructs, including dictionaries in LLMs or components of a covariance matrix of Mahalanobis distances. We analyze the feedback complexity associated with learning a feature matrix in sparse settings. Our results establish tight bounds when the agent is permitted to construct activations and demonstrate strong upper bounds in sparse scenarios when the agent's feedback is limited to distributional information. We validate our theoretical findings through experiments on two distinct applications: feature recovery from Recursive Feature Machine-trained models and dictionary extraction from sparse autoencoders trained on Large Language Models.

* 40 pages, 20 figures

Via

Access Paper or Ask Questions

Contextual Self-paced Learning for Weakly Supervised Spatio-Temporal Video Grounding

Jan 28, 2025

Akash Kumar, Zsolt Kira, Yogesh Singh Rawat

Abstract:In this work, we focus on Weakly Supervised Spatio-Temporal Video Grounding (WSTVG). It is a multimodal task aimed at localizing specific subjects spatio-temporally based on textual queries without bounding box supervision. Motivated by recent advancements in multi-modal foundation models for grounding tasks, we first explore the potential of state-of-the-art object detection models for WSTVG. Despite their robust zero-shot capabilities, our adaptation reveals significant limitations, including inconsistent temporal predictions, inadequate understanding of complex queries, and challenges in adapting to difficult scenarios. We propose CoSPaL (Contextual Self-Paced Learning), a novel approach which is designed to overcome these limitations. CoSPaL integrates three core components: (1) Tubelet Phrase Grounding (TPG), which introduces spatio-temporal prediction by linking textual queries to tubelets; (2) Contextual Referral Grounding (CRG), which improves comprehension of complex queries by extracting contextual information to refine object identification over time; and (3) Self-Paced Scene Understanding (SPS), a training paradigm that progressively increases task difficulty, enabling the model to adapt to complex scenarios by transitioning from coarse to fine-grained understanding.

* ICLR'25 Main Conference. Project Page: https://akash2907.github.io/cospal_webpage

Via

Access Paper or Ask Questions