Abstract:Training very deep neural networks requires controlling the propagation of magnitudes across depth. Without such control, activations and gradients may vanish, explode, or enter unstable regimes that make optimization fail. Modern architectures often mitigate this problem through Batch Normalization, residual connections, or other normalization layers, which repeatedly re-scale or bypass intermediate representations. However, these mechanisms are not always appropriate. In Physics-Informed Neural Networks (PINNs), the network represents a continuous physical field and its input derivatives define the training objective, making batch-dependent normalization problematic because it can introduce non-local dependencies into the predicted field and its derivatives. We propose StableGrad, an optimizer-level scale-control mechanism that corrects layer-wise weight-gradient imbalances without modifying the forward model. Because the normalization is applied only after backpropagation and before the optimizer update, the network output, its derivatives, and the physical residual remain unchanged. We analyze the effective training dynamics induced by this rescaling and evaluate StableGrad on deep PINNs as the target application, with BatchNorm-free convolutional networks serving as a diagnostic stress test. On PINN benchmarks, StableGrad improves matched-depth solution accuracy and makes deeper models more reliable under standard optimization. On ResNet and EfficientNet architectures, where removing Batch Normalization normally leads to training collapse, StableGrad stabilizes optimization without introducing any other architectural change. These results show that optimizer-level control of weight-gradient scale can provide a practical alternative when forward normalization is unavailable or undesirable.
Abstract:Federated learning usually aggregates client updates using dataset size or gradient-level criteria, while overlooking internal signals about how each client model is organizing its input space during training. We introduce FedOUI, a simple aggregation rule based on the Overfitting-Underfitting Indicator (OUI), an activation-based and label-free metric. Each participating client sends its local update together with a OUI value computed on a fixed probe batch, and the server estimates the round-wise OUI distribution to assign lower weights to structurally atypical clients through a smooth reweighting rule. We evaluate FedOUI on CIFAR-10 under strong non-IID partitioning and noisy-client conditions, comparing it with FedAvg, FedProx, and a gradient-alignment baseline. The clearest gains appear under strong heterogeneity, where OUI-based weighting improves aggregation quality while remaining lightweight and interpretable. These results show that internal activation structure can provide useful information for federated aggregation beyond client size and gradient geometry.
Abstract:Activation functions are what make deep networks expressive: without them, the model collapses to a linear map. Yet we still evaluate training mostly from the outside, through loss, accuracy, return, or final calibration, while the internal structural evolution of the network remains largely unobserved. In this paper, we argue that the Overfitting--Underfitting Indicator (OUI) should be understood as a first practical observable of that internal structure. Across our recent results, OUI consistently appears as an early, label-free, activation-based signal that reveals whether a network is entering a poor or promising training regime before convergence. In supervised learning, it anticipates weight decay regimes; in reinforcement learning, it discriminates learning-rate regimes early in PPO actor--critic; and in online control, it can drive layer-wise weight decay adaptation. Read together with recent evidence that activation patterns tend to stabilize earlier than parameters, these results suggest a broader research direction: an activation-centric theory of training dynamics. OUI is becoming an empirical foothold toward this theory.
Abstract:Weight decay remains one of the most widely used regularization mechanisms for training convolutional neural networks, yet it is still commonly applied as a fixed coefficient shared by all layers throughout training. This uniform treatment ignores that different layers may follow different structural dynamics and therefore may require different regularization strengths. In this work, we propose OUIDecay, an adaptive layer-wise and time-dependent weight decay scheduler for CNNs driven by the Overfitting-Underfitting Indicator (OUI), an activation-based metric previously shown to provide early information about regularization quality. OUIDecay uses a lightweight batch-based formulation of OUI to monitor the structural behavior of each layer online and periodically rescales its weight decay relative to the other layers in the network. Unlike gradient-based adaptive decay methods, our approach relies on functional information extracted from activation patterns and does not require validation data. Experiments on EfficientNet-B0 with Stanford Cars, ResNet50 with Food101, DenseNet121 with CIFAR100, and MobileNetV2 with CIFAR10 show that OUIDecay achieves the best mean best-validation-loss in 7 out of 8 evaluated settings. These results indicate that activation-driven weight decay adaptation is a practical and effective alternative to fixed decay and gradient-based adaptive decay, while keeping the method lightweight and suitable for online use.
Abstract:Recent evidence suggests that Adam performs robustly when its momentum parameters are tied, $β_1=β_2$, reducing the optimizer to a single remaining parameter. However, the value of this parameter is still poorly understood. We argue that, in balanced Adam, $β$ should not be treated as a dimensionless constant: it defines a statistical memory horizon $H_β=(1-β)^{-1}$. In terms of the effective learning horizon $T_{\mathrm{ES}}$, estimated from the validation trajectory, we study the refresh count $R_β=(1-β)T_{\mathrm{ES}}$, which measures how many times Adam renews its internal statistics during the useful phase of training. Across 11 vision and language experiments, we find that choosing $β$ so that $R_β\approx1000$ selects different beta values depending on the training scale, yet improves robustness over the best fixed-beta baseline. Compared with the strongest fixed choice $β=0.94377$, the refresh rule improves worst-case robustness, reducing the global maximum validation gap by $33.4\%$, while bringing all 11 runs within $1\%$ of their validation oracle. These results suggest that the remaining hyperparameter of balanced Adam is better understood as a memory-scale variable than as a fixed constant. This provides a simple budget-aware perspective on optimizer scaling and opens a path toward treating Adam's momentum as part of the learning dynamics rather than as a static default.
Abstract:Federated learning (FL) enables collaborative training across organizations without sharing raw data, but it is hindered by statistical heterogeneity (non-i.i.d.\ client data) and by instability of naive weight averaging under client drift. In many cross-silo deployments, FL is warm-started from a strong pretrained backbone (e.g., ImageNet-1K) and then adapted to local domains. Motivated by recent evidence that ReLU-like gating regimes (structural knowledge) stabilize earlier than the remaining parameter values (quantitative knowledge), we propose FedSQ (Federated Structural-Quantitative learning), a transfer-initialized neural federated procedure based on a DualCopy, piecewise-linear view of deep networks. FedSQ freezes a structural copy of the pretrained model to induce fixed binary gating masks during federated fine-tuning, while only a quantitative copy is optimized locally and aggregated across rounds. Fixing the gating reduces learning to within-regime affine refinements, which stabilizes aggregation under heterogeneous partitions. Experiments on two convolutional neural network backbones under i.i.d.\ and Dirichlet splits show that FedSQ improves robustness and can reduce rounds-to-best validation performance relative to standard baselines while preserving accuracy in the transfer setting.
Abstract:Gaussian Error Linear Unit (GELU) is a widely used smooth alternative to Rectifier Linear Unit (ReLU), yet many deployment, compression, and analysis toolchains are most naturally expressed for piecewise-linear (ReLU-type) networks. We study a hardness-parameterized formulation of GELU, f(x;λ)=xΦ(λ x), where Φ is the Gaussian CDF and λ \in [1, infty) controls gate sharpness, with the goal of turning smooth gated training into a controlled path toward ReLU-compatible models. Learning λ is non-trivial: naive updates yield unstable dynamics and effective gradient attenuation, so we introduce a constrained reparameterization and an optimizer-aware update scheme. Empirically, across a diverse set of model--dataset pairs spanning MLPs, CNNs, and Transformers, we observe structured layerwise hardness profiles and assess their robustness under different initializations. We further study a deterministic ReLU-ization strategy in which the learned gates are progressively hardened toward a principled target, enabling a post-training substitution of λ-GELU by ReLU with reduced disruption. Overall, λ-GELU provides a minimal and interpretable knob to profile and control gating hardness, bridging smooth training with ReLU-centric downstream pipelines.
Abstract:Deep Reinforcement Learning systems are highly sensitive to the learning rate (LR), and selecting stable and performant training runs often requires extensive hyperparameter search. In Proximal Policy Optimization (PPO) actor--critic methods, small LR values lead to slow convergence, whereas large LR values may induce instability or collapse. We analyse this phenomenon from the behavior of the hidden neurons in the network using the Overfitting-Underfitting Indicator (OUI), a metric that quantifies the balance of binary activation patterns over a fixed probe batch. We introduce an efficient batch-based formulation of OUI and derive a theoretical connection between LR and activation sign changes, clarifying how a correct evolution of the neuron's inner structure depends on the step size. Empirically, across three discrete-control environments and multiple seeds, we show that OUI measured at only 10\% of training already discriminates between LR regimes. We observe a consistent asymmetry: critic networks achieving highest return operate in an intermediate OUI band (avoiding saturation), whereas actor networks achieving highest return exhibit comparatively high OUI values. We then compare OUI-based screening rules against early return, clip-based, divergence-based, and flip-based criteria under matched recall over successful runs. In this setting, OUI provides the strongest early screening signal: OUI alone achieves the best precision at broader recall, while combining early return with OUI yields the highest precision in best-performing screening regimes, enabling aggressive pruning of unpromising runs without requiring full training.
Abstract:Despite the empirical success of DNN, their internal training dynamics remain difficult to characterize. In ReLU-based models, the activation pattern induced by a given input determines the piecewise-linear region in which the network behaves affinely. Motivated by this geometry, we investigate whether training exhibits a two-timescale behavior: an early stage with substantial changes in activation patterns and a later stage where weight updates predominantly refine the model within largely stable activation regimes. We first prove a local stability property: outside measure-zero sets of parameters and inputs, sufficiently small parameter perturbations preserve the activation pattern of a fixed input, implying locally affine behavior within activation regions. We then empirically track per-iteration changes in weights and activation patterns across fully-connected and convolutional architectures, as well as Transformer-based models, where activation patterns are recorded in the ReLU feed-forward (MLP/FFN) submodules, using fixed validation subsets. Across the evaluated settings, activation-pattern changes decay 3 times earlier than weight-update magnitudes, showing that late-stage training often proceeds within relatively stable activation regimes. These findings provide a concrete, architecture-agnostic instrument for monitoring training dynamics and motivate further study of decoupled optimization strategies for piecewise-linear networks. For reproducibility, code and experiment configurations will be released upon acceptance.
Abstract:Adam has been at the core of large-scale training for almost a decade, yet a simple empirical fact remains unaccounted for: both validation scores and the qualitative behaviour of the training runs improve when the momentum parameters satisfy $β_{1}=β_{2}$. Some recent studies have reported this pattern, but there is still no explanation for why this choice helps. We show that this choice is closely tied to a structural property that we refer to as \textit{gradient scale invariance}. We formalize this notion and prove that Adam becomes gradient scale invariant of first order if and only if $β_{1}=β_{2}$. This perspective places the balanced regime of Adam in direct alignment with the design principles underlying several recent optimizers that explicitly enforce scale-robust updates. The theory is supported by experiments across vision and language tasks, and across different architectural families, in which rescaling the gradient has a markedly smoother effect on the update when $β_{1}=β_{2}$. Overall, our results offer a coherent explanation for an open question in the behavior of Adam and provide a simple principle that helps guide the design of future optimizers.