Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Hao Wang

Xidian University, China

Entropy Law: The Story Behind Data Compression and LLM Performance

Jul 11, 2024

Mingjia Yin, Chuhan Wu, Yufei Wang, Hao Wang, Wei Guo, Yasheng Wang, Yong Liu, Ruiming Tang, Defu Lian, Enhong Chen

Figure 1 for Entropy Law: The Story Behind Data Compression and LLM Performance

Figure 2 for Entropy Law: The Story Behind Data Compression and LLM Performance

Figure 3 for Entropy Law: The Story Behind Data Compression and LLM Performance

Figure 4 for Entropy Law: The Story Behind Data Compression and LLM Performance

Abstract:Data is the cornerstone of large language models (LLMs), but not all data is useful for model learning. Carefully selected data can better elicit the capabilities of LLMs with much less computational overhead. Most methods concentrate on evaluating the quality of individual samples in data selection, while the combinatorial effects among samples are neglected. Even if each sample is of perfect quality, their combinations may be suboptimal in teaching LLMs due to their intrinsic homogeneity or contradiction. In this paper, we aim to uncover the underlying relationships between LLM performance and data selection. Inspired by the information compression nature of LLMs, we uncover an ``entropy law'' that connects LLM performance with data compression ratio and first-epoch training loss, which reflect the information redundancy of a dataset and the mastery of inherent knowledge encoded in this dataset, respectively. Through both theoretical deduction and empirical evaluation, we find that model performance is negatively correlated to the compression ratio of training data, which usually yields a lower training loss. Based on the findings of the entropy law, we propose a quite efficient and universal data selection method named \textbf{ZIP} for training LLMs, which aim to prioritize data subsets exhibiting a low compression ratio. Based on a multi-stage algorithm that selects diverse data in a greedy manner, we can obtain a good data subset with satisfactory diversity. Extensive experiments have been conducted to validate the entropy law and the superiority of ZIP across different LLM backbones and alignment stages. We also present an interesting application of entropy law that can detect potential performance risks at the beginning of model training.

Via

Access Paper or Ask Questions

OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion

Jul 10, 2024

Hao Wang, Pengzhen Ren, Zequn Jie, Xiao Dong, Chengjian Feng, Yinlong Qian, Lin Ma, Dongmei Jiang, Yaowei Wang, Xiangyuan Lan(+1 more)

Figure 1 for OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion

Figure 2 for OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion

Figure 3 for OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion

Figure 4 for OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion

Abstract:Open-vocabulary detection is a challenging task due to the requirement of detecting objects based on class names, including those not encountered during training. Existing methods have shown strong zero-shot detection capabilities through pre-training on diverse large-scale datasets. However, these approaches still face two primary challenges: (i) how to universally integrate diverse data sources for end-to-end training, and (ii) how to effectively leverage the language-aware capability for region-level cross-modality understanding. To address these challenges, we propose a novel unified open-vocabulary detection method called OV-DINO, which pre-trains on diverse large-scale datasets with language-aware selective fusion in a unified framework. Specifically, we introduce a Unified Data Integration (UniDI) pipeline to enable end-to-end training and eliminate noise from pseudo-label generation by unifying different data sources into detection-centric data. In addition, we propose a Language-Aware Selective Fusion (LASF) module to enable the language-aware ability of the model through a language-aware query selection and fusion process. We evaluate the performance of the proposed OV-DINO on popular open-vocabulary detection benchmark datasets, achieving state-of-the-art results with an AP of 50.6\% on the COCO dataset and 40.0\% on the LVIS dataset in a zero-shot manner, demonstrating its strong generalization ability. Furthermore, the fine-tuned OV-DINO on COCO achieves 58.4\% AP, outperforming many existing methods with the same backbone. The code for OV-DINO will be available at \href{https://github.com/wanghao9610/OV-DINO}{https://github.com/wanghao9610/OV-DINO}.

* Technical Report

Via

Access Paper or Ask Questions

Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition

Jul 05, 2024

Ye Bai, Jingping Chen, Jitong Chen, Wei Chen, Zhuo Chen, Chen Ding, Linhao Dong, Qianqian Dong, Yujiao Du, Kepan Gao(+45 more)

Figure 1 for Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition

Figure 2 for Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition

Figure 3 for Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition

Figure 4 for Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition

Abstract:Modern automatic speech recognition (ASR) model is required to accurately transcribe diverse speech signals (from different domains, languages, accents, etc) given the specific contextual information in various application scenarios. Classic end-to-end models fused with extra language models perform well, but mainly in data matching scenarios and are gradually approaching a bottleneck. In this work, we introduce Seed-ASR, a large language model (LLM) based speech recognition model. Seed-ASR is developed based on the framework of audio conditioned LLM (AcLLM), leveraging the capabilities of LLMs by inputting continuous speech representations together with contextual information into the LLM. Through stage-wise large-scale training and the elicitation of context-aware capabilities in LLM, Seed-ASR demonstrates significant improvement over end-to-end models on comprehensive evaluation sets, including multiple domains, accents/dialects and languages. Additionally, Seed-ASR can be further deployed to support specific needs in various scenarios without requiring extra language models. Compared to recently released large ASR models, Seed-ASR achieves 10%-40% reduction in word (or character, for Chinese) error rates on Chinese and English public test sets, further demonstrating its powerful performance.

Via

Access Paper or Ask Questions

Foundations and Frontiers of Graph Learning Theory

Jul 03, 2024

Yu Huang, Min Zhou, Menglin Yang, Zhen Wang, Muhan Zhang, Jie Wang, Hong Xie, Hao Wang, Defu Lian, Enhong Chen

Figure 1 for Foundations and Frontiers of Graph Learning Theory

Figure 2 for Foundations and Frontiers of Graph Learning Theory

Figure 3 for Foundations and Frontiers of Graph Learning Theory

Abstract:Recent advancements in graph learning have revolutionized the way to understand and analyze data with complex structures. Notably, Graph Neural Networks (GNNs), i.e. neural network architectures designed for learning graph representations, have become a popular paradigm. With these models being usually characterized by intuition-driven design or highly intricate components, placing them within the theoretical analysis framework to distill the core concepts, helps understand the key principles that drive the functionality better and guide further development. Given this surge in interest, this article provides a comprehensive summary of the theoretical foundations and breakthroughs concerning the approximation and learning behaviors intrinsic to prevalent graph learning models. Encompassing discussions on fundamental aspects such as expressiveness power, generalization, optimization, and unique phenomena such as over-smoothing and over-squashing, this piece delves into the theoretical foundations and frontier driving the evolution of graph learning. In addition, this article also presents several challenges and further initiates discussions on possible solutions.

* 36pages,273references

Via

Access Paper or Ask Questions

Proximity Matters: Local Proximity Preserved Balancing for Treatment Effect Estimation

Jul 01, 2024

Hao Wang, Zhichao Chen, Yuan Shen, Jiajun Fan, Zhaoran Liu, Degui Yang, Xinggao Liu, Haoxuan Li

Figure 1 for Proximity Matters: Local Proximity Preserved Balancing for Treatment Effect Estimation

Figure 2 for Proximity Matters: Local Proximity Preserved Balancing for Treatment Effect Estimation

Figure 3 for Proximity Matters: Local Proximity Preserved Balancing for Treatment Effect Estimation

Figure 4 for Proximity Matters: Local Proximity Preserved Balancing for Treatment Effect Estimation

Abstract:Heterogeneous treatment effect (HTE) estimation from observational data poses significant challenges due to treatment selection bias. Existing methods address this bias by minimizing distribution discrepancies between treatment groups in latent space, focusing on global alignment. However, the fruitful aspect of local proximity, where similar units exhibit similar outcomes, is often overlooked. In this study, we propose Proximity-aware Counterfactual Regression (PCR) to exploit proximity for representation balancing within the HTE estimation context. Specifically, we introduce a local proximity preservation regularizer based on optimal transport to depict the local proximity in discrepancy calculation. Furthermore, to overcome the curse of dimensionality that renders the estimation of discrepancy ineffective, exacerbated by limited data availability for HTE estimation, we develop an informative subspace projector, which trades off minimal distance precision for improved sample complexity. Extensive experiments demonstrate that PCR accurately matches units across different treatment groups, effectively mitigates treatment selection bias, and significantly outperforms competitors. Code is available at https://anonymous.4open.science/status/ncr-B697.

* Code is available at https://anonymous.4open.science/status/ncr-B697

Via

Access Paper or Ask Questions

Personalized Federated Continual Learning via Multi-granularity Prompt

Jun 27, 2024

Hao Yu, Xin Yang, Xin Gao, Yan Kang, Hao Wang, Junbo Zhang, Tianrui Li

Figure 1 for Personalized Federated Continual Learning via Multi-granularity Prompt

Figure 2 for Personalized Federated Continual Learning via Multi-granularity Prompt

Figure 3 for Personalized Federated Continual Learning via Multi-granularity Prompt

Figure 4 for Personalized Federated Continual Learning via Multi-granularity Prompt

Abstract:Personalized Federated Continual Learning (PFCL) is a new practical scenario that poses greater challenges in sharing and personalizing knowledge. PFCL not only relies on knowledge fusion for server aggregation at the global spatial-temporal perspective but also needs model improvement for each client according to the local requirements. Existing methods, whether in Personalized Federated Learning (PFL) or Federated Continual Learning (FCL), have overlooked the multi-granularity representation of knowledge, which can be utilized to overcome Spatial-Temporal Catastrophic Forgetting (STCF) and adopt generalized knowledge to itself by coarse-to-fine human cognitive mechanisms. Moreover, it allows more effectively to personalized shared knowledge, thus serving its own purpose. To this end, we propose a novel concept called multi-granularity prompt, i.e., coarse-grained global prompt acquired through the common model learning process, and fine-grained local prompt used to personalize the generalized representation. The former focuses on efficiently transferring shared global knowledge without spatial forgetting, and the latter emphasizes specific learning of personalized local knowledge to overcome temporal forgetting. In addition, we design a selective prompt fusion mechanism for aggregating knowledge of global prompts distilled from different clients. By the exclusive fusion of coarse-grained knowledge, we achieve the transmission and refinement of common knowledge among clients, further enhancing the performance of personalization. Extensive experiments demonstrate the effectiveness of the proposed method in addressing STCF as well as improving personalized performance. Our code now is available at https://github.com/SkyOfBeginning/FedMGP.

* Accepted by KDD 2024 Research Track

Via

Access Paper or Ask Questions

Efficient Low-rank Identification via Accelerated Iteratively Reweighted Nuclear Norm Minimization

Jun 26, 2024

Hao Wang, Ye Wang, Xiangyu Yang

Figure 1 for Efficient Low-rank Identification via Accelerated Iteratively Reweighted Nuclear Norm Minimization

Figure 2 for Efficient Low-rank Identification via Accelerated Iteratively Reweighted Nuclear Norm Minimization

Figure 3 for Efficient Low-rank Identification via Accelerated Iteratively Reweighted Nuclear Norm Minimization

Figure 4 for Efficient Low-rank Identification via Accelerated Iteratively Reweighted Nuclear Norm Minimization

Abstract:This paper considers the problem of minimizing the sum of a smooth function and the Schatten-$p$ norm of the matrix. Our contribution involves proposing accelerated iteratively reweighted nuclear norm methods designed for solving the nonconvex low-rank minimization problem. Two major novelties characterize our approach. Firstly, the proposed method possesses a rank identification property, enabling the provable identification of the "correct" rank of the stationary point within a finite number of iterations. Secondly, we introduce an adaptive updating strategy for smoothing parameters. This strategy automatically fixes parameters associated with zero singular values as constants upon detecting the "correct" rank while quickly driving the rest of the parameters to zero. This adaptive behavior transforms the algorithm into one that effectively solves smooth problems after a few iterations, setting our work apart from existing iteratively reweighted methods for low-rank optimization. We prove the global convergence of the proposed algorithm, guaranteeing that every limit point of the iterates is a critical point. Furthermore, a local convergence rate analysis is provided under the Kurdyka-{\L}ojasiewicz property. We conduct numerical experiments using both synthetic and real data to showcase our algorithm's efficiency and superiority over existing methods.

* Copyright may be transferred without notice, after which this version may no longer be accessible

Via

Access Paper or Ask Questions

Debiased Recommendation with Noisy Feedback

Jun 24, 2024

Haoxuan Li, Chunyuan Zheng, Wenjie Wang, Hao Wang, Fuli Feng, Xiao-Hua Zhou

Abstract:Ratings of a user to most items in recommender systems are usually missing not at random (MNAR), largely because users are free to choose which items to rate. To achieve unbiased learning of the prediction model under MNAR data, three typical solutions have been proposed, including error-imputation-based (EIB), inverse-propensity-scoring (IPS), and doubly robust (DR) methods. However, these methods ignore an alternative form of bias caused by the inconsistency between the observed ratings and the users' true preferences, also known as noisy feedback or outcome measurement errors (OME), e.g., due to public opinion or low-quality data collection process. In this work, we study intersectional threats to the unbiased learning of the prediction model from data MNAR and OME in the collected data. First, we design OME-EIB, OME-IPS, and OME-DR estimators, which largely extend the existing estimators to combat OME in real-world recommendation scenarios. Next, we theoretically prove the unbiasedness and generalization bound of the proposed estimators. We further propose an alternate denoising training approach to achieve unbiased learning of the prediction model under MNAR data with OME. Extensive experiments are conducted on three real-world datasets and one semi-synthetic dataset to show the effectiveness of our proposed approaches. The code is available at https://github.com/haoxuanli-pku/KDD24-OME-DR.

* KDD 24 Research Track Paper

Via

Access Paper or Ask Questions

GVT2RPM: An Empirical Study for General Video Transformer Adaptation to Remote Physiological Measurement

Jun 19, 2024

Hao Wang, Euijoon Ahn, Jinman Kim

Figure 1 for GVT2RPM: An Empirical Study for General Video Transformer Adaptation to Remote Physiological Measurement

Figure 2 for GVT2RPM: An Empirical Study for General Video Transformer Adaptation to Remote Physiological Measurement

Figure 3 for GVT2RPM: An Empirical Study for General Video Transformer Adaptation to Remote Physiological Measurement

Figure 4 for GVT2RPM: An Empirical Study for General Video Transformer Adaptation to Remote Physiological Measurement

Abstract:Remote physiological measurement (RPM) is an essential tool for healthcare monitoring as it enables the measurement of physiological signs, e.g., heart rate, in a remote setting via physical wearables. Recently, with facial videos, we have seen rapid advancements in video-based RPMs. However, adopting facial videos for RPM in the clinical setting largely depends on the accuracy and robustness (work across patient populations). Fortunately, the capability of the state-of-the-art transformer architecture in general (natural) video understanding has resulted in marked improvements and has been translated to facial understanding, including RPM. However, existing RPM methods usually need RPM-specific modules, e.g., temporal difference convolution and handcrafted feature maps. Although these customized modules can increase accuracy, they are not demonstrated for their robustness across datasets. Further, due to their customization of the transformer architecture, they cannot use the advancements made in general video transformers (GVT). In this study, we interrogate the GVT architecture and empirically analyze how the training designs, i.e., data pre-processing and network configurations, affect the model performance applied to RPM. Based on the structure of video transformers, we propose to configure its spatiotemporal hierarchy to align with the dense temporal information needed in RPM for signal feature extraction. We define several practical guidelines and gradually adapt GVTs for RPM without introducing RPM-specific modules. Our experiments demonstrate favorable results to existing RPM-specific module counterparts. We conducted extensive experiments with five datasets using intra-dataset and cross-dataset settings. We highlight that the proposed guidelines GVT2RPM can be generalized to any video transformers and is robust to various datasets.

* Code is available at https://github.com/Dylan-H-Wang/facial-ai

Via

Access Paper or Ask Questions

BLoB: Bayesian Low-Rank Adaptation by Backpropagation for Large Language Models

Jun 18, 2024

Yibin Wang, Haizhou Shi, Ligong Han, Dimitris Metaxas, Hao Wang

Figure 1 for BLoB: Bayesian Low-Rank Adaptation by Backpropagation for Large Language Models

Figure 2 for BLoB: Bayesian Low-Rank Adaptation by Backpropagation for Large Language Models

Figure 3 for BLoB: Bayesian Low-Rank Adaptation by Backpropagation for Large Language Models

Figure 4 for BLoB: Bayesian Low-Rank Adaptation by Backpropagation for Large Language Models

Abstract:Large Language Models (LLMs) often suffer from overconfidence during inference, particularly when adapted to downstream domain-specific tasks with limited data. Previous work addresses this issue by employing approximate Bayesian estimation after the LLMs are trained, enabling them to quantify uncertainty. However, such post-training approaches' performance is severely limited by the parameters learned during training. In this paper, we go beyond post-training Bayesianization and propose Bayesian Low-Rank Adaptation by Backpropagation (BLoB), an algorithm that continuously and jointly adjusts both the mean and covariance of LLM parameters throughout the whole fine-tuning process. Our empirical results verify the effectiveness of BLoB in terms of generalization and uncertainty estimation, when evaluated on both in-distribution and out-of-distribution data.

* 27 pages, 3 figures, 9 tables; preprint, work in progress

Via

Access Paper or Ask Questions