Abstract:Time varying dependence is often modeled through dynamic correlations or Gaussian graphical models, yet many multivariate systems change through tail behavior, asymmetry, or conditional structure while correlations change little. We introduce Dynamic Vine Copulas (DVC), a temporal vine copula framework for estimating and diagnosing sequence wide non-Gaussian dependence. DVC keeps a chosen vine factorization fixed for comparability, can use C-, D-, or R-vines, and couples pair copula states across time through smooth parameter trajectories or temporally regularized family switching paths. Its central diagnostic contrasts held-out scores from a full vine and its matched 1-truncated counterpart, separating flexible first-tree pairwise evidence from higher-tree conditional evidence. At the population level, under a correct fixed vine and simplifying assumption, this contrast is the higher-tree term of a vine total correlation decomposition; in finite samples, it is a predictive diagnostic. Across controlled benchmarks, DVC detects Student-t tail degree changes, Clayton-to-Gumbel switches, and recurrent conditional interaction episodes that Gaussian dynamic baselines miss or conflate. The higher-tree score stays near zero in pairwise only regimes but rises selectively during conditional interaction regimes. On Allen Visual Behavior Neuropixels data, DVC identifies a reproducible time indexed higher-tree signal that is positive across held-out splits and disappears under a decorrelated null, indicating simultaneous cross-area dependence. Together, these results show that DVC is both a flexible temporal copula model and an interpretable diagnostic for whether time varying dependence changes are pairwise or conditional.
Abstract:We study the organization of channel-level importance in transformer feed-forward networks (FFNs). Using a Fisher-style loss proxy (LP) based on activation-gradient second moments, we show that loss sensitivity is concentrated in a small set of channels within each layer. In Llama-3.1-8B, the top 1% of channels per layer accounts for a median of 58.7% of LP mass, with a range of 33.0% to 86.1%. We call these loss-critical channels supernodes. Although FFN layers also contain strong activation outliers, LP-defined supernodes overlap only weakly with activation-defined outliers and are not explained by activation power or weight norms alone. Around this core, we find a weaker but consistent halo structure: some non-supernode channels share the supernodes' write support and show stronger redundancy with the protected core. We use one-shot structured FFN pruning as a diagnostic test of this organization. At 50% FFN sparsity, baselines that prune many supernodes degrade sharply, whereas our SCAR variants explicitly protect the supernode core; the strongest variant, SCAR-Prot, reaches perplexity 54.8 compared with 989.2 for Wanda-channel. The LP-concentration pattern appears across Mistral-7B, Llama-2-7B, and Qwen2-7B, remains visible in targeted Llama-3.1-70B experiments, and increases during OLMo-2-7B pretraining. These results suggest that LLM FFNs develop a small learned core of loss-critical channels, and that preserving this core is important for reliable structured pruning.
Abstract:Modeling high-dimensional dependencies while keeping likelihoods tractable remains challenging. Classical vine-copula pipelines are interpretable but can be expensive, while many neural estimators are flexible but less structured. In this work, we propose Vine Denoising Copula (VDC), an amortized vine-copula pipeline that trains a single bivariate denoising model and reuses it across all vine edges. For each edge, given pseudo-observations, the model predicts a density grid. We then apply an IPFP/Sinkhorn projection that enforces non-negativity, unit mass, and uniform marginals. This keeps the exact vine likelihood and preserves the usual copula interpretation while replacing repeated per-edge optimization with GPU inference. Across synthetic and real-data benchmarks, VDC delivers strong bivariate density accuracy, competitive MI/TC estimation, and substantial speedups for high-dimensional vine fitting. In practice, these gains make explicit information estimation and dependence decomposition feasible at scales where repeated vine fitting would otherwise be costly, although conditional downstream inference remains mixed.