University of Washington
Abstract:Training with mixed data distributions is a common and important part of creating multi-task and instruction-following models. The diversity of the data distributions and cost of joint training makes the optimization procedure extremely challenging. Data mixing methods partially address this problem, albeit having a sub-optimal performance across data sources and require multiple expensive training runs. In this paper, we propose a simple and efficient alternative for better optimization of the data sources by combining models individually trained on each data source with the base model using basic element-wise vector operations. The resulting model, namely Distribution Edited Model (DEM), is 11x cheaper than standard data mixing and outperforms strong baselines on a variety of benchmarks, yielding up to 6.2% improvement on MMLU, 11.5% on BBH, 16.1% on DROP, and 9.3% on HELM with models of size 3B to 13B. Notably, DEM does not require full re-training when modifying a single data-source, thus making it very flexible and scalable for training with diverse data sources.
Abstract:Timely status updating is the premise of emerging interaction-based applications in the Internet of Things (IoT). Using redundant devices to update the status of interest is a promising method to improve the timeliness of information. However, parallel status updating leads to out-of-order arrivals at the monitor, significantly challenging timeliness analysis. This work studies the Age of Information (AoI) of a multi-queue status update system where multiple devices monitor the same physical process. Specifically, two systems are considered: the Basic System, which only has type-1 devices that are ad hoc devices located close to the source, and the Hybrid System, which contains additional type-2 devices that are infrastructure-based devices located in fixed points compared to the Basic System. Using the Stochastic Hybrid Systems (SHS) framework, a mathematical model that combines discrete and continuous dynamics, we derive the expressions of the average AoI of the considered two systems in closed form. Numerical results verify the accuracy of the analysis. It is shown that when the number and parameters of the type-1 devices/type-2 devices are fixed, the logarithm of average AoI will linearly decrease with the logarithm of the total arrival rate of type-2 devices or that of the number of type-1 devices under specific condition. It has also been demonstrated that the proposed systems can significantly outperform the FCFS M/M/N status update system.
Abstract:Reward machines inform reinforcement learning agents about the reward structure of the environment and often drastically speed up the learning process. However, reward machines only accept Boolean features such as robot-reached-gold. Consequently, many inherently numeric tasks cannot profit from the guidance offered by reward machines. To address this gap, we aim to extend reward machines with numeric features such as distance-to-gold. For this, we present two types of reward machines: numeric-Boolean and numeric. In a numeric-Boolean reward machine, distance-to-gold is emulated by two Boolean features distance-to-gold-decreased and robot-reached-gold. In a numeric reward machine, distance-to-gold is used directly alongside the Boolean feature robot-reached-gold. We compare our new approaches to a baseline reward machine in the Craft domain, where the numeric feature is the agent-to-target distance. We use cross-product Q-learning, Q-learning with counter-factual experiences, and the options framework for learning. Our experimental results show that our new approaches significantly outperform the baseline approach. Extending reward machines with numeric features opens up new possibilities of using reward machines in inherently numeric tasks.
Abstract:This paper studies semantic-aware communication for remote estimation of multiple Markov sources over a lossy and rate-constrained channel. Unlike most existing studies that treat all source states equally, we exploit the semantics of information and consider that the remote actuator has different tolerances for the estimation errors of different states. We aim to find an optimal scheduling policy that minimizes the long-term state-dependent costs of estimation errors under a transmission frequency constraint. We theoretically show the structure of the optimal policy by leveraging the average-cost Constrained Markov Decision Process (CMDP) theory and the Lagrangian dynamic programming. By exploiting the optimal structural results, we develop a novel policy search algorithm, termed intersection search plus relative value iteration (Insec-RVI), that can find the optimal policy using only a few iterations. To avoid the ``curse of dimensionality'' of MDPs, we propose an online low-complexity drift-plus-penalty (DPP) scheduling algorithm based on the Lyapunov optimization theorem. We also design an efficient average-cost Q-learning algorithm to estimate the optimal policy without knowing a priori the channel and source statistics. Numerical results show that continuous transmission is inefficient, and remarkably, our semantic-aware policies can attain the optimum by strategically utilizing fewer transmissions by exploiting the timing of the important information.
Abstract:We propose a federated version of adaptive gradient methods, particularly AdaGrad and Adam, within the framework of over-the-air model training. This approach capitalizes on the inherent superposition property of wireless channels, facilitating fast and scalable parameter aggregation. Meanwhile, it enhances the robustness of the model training process by dynamically adjusting the stepsize in accordance with the global gradient update. We derive the convergence rate of the training algorithms, encompassing the effects of channel fading and interference, for a broad spectrum of nonconvex loss functions. Our analysis shows that the AdaGrad-based algorithm converges to a stationary point at the rate of $\mathcal{O}( \ln{(T)} /{ T^{ 1 - \frac{1}{\alpha} } } )$, where $\alpha$ represents the tail index of the electromagnetic interference. This result indicates that the level of heavy-tailedness in interference distribution plays a crucial role in the training efficiency: the heavier the tail, the slower the algorithm converges. In contrast, an Adam-like algorithm converges at the $\mathcal{O}( 1/T )$ rate, demonstrating its advantage in expediting the model training process. We conduct extensive experiments that corroborate our theoretical findings and affirm the practical efficacy of our proposed federated adaptive gradient methods.
Abstract:Development of large language models (LLM) have shown progress on reasoning, though studies have been limited to English or simple reasoning tasks. We thus introduce a multilingual structured reasoning and explanation dataset, termed xSTREET, that covers four tasks across six languages. xSTREET exposes a gap in base LLM performance between English and non-English reasoning tasks. We then propose two methods to remedy this gap, building on the insight that LLMs trained on code are better reasoners. First, at training time, we augment a code dataset with multi-lingual comments using machine translation while keeping program code as-is. Second, at inference time, we bridge the gap between training and inference by employing a prompt structure that incorporates step-by-step code primitives to derive new facts and find a solution. Our methods show improved multilingual performance on xSTREET, most notably on the scientific commonsense reasoning subtask. Furthermore, the models show no regression on non-reasoning tasks, thus showing our techniques maintain general-purpose abilities.
Abstract:Development of multimodal interactive systems is hindered by the lack of rich, multimodal (text, images) conversational data, which is needed in large quantities for LLMs. Previous approaches augment textual dialogues with retrieved images, posing privacy, diversity, and quality constraints. In this work, we introduce \textbf{M}ultimodal \textbf{A}ugmented \textbf{G}enerative \textbf{I}mages \textbf{D}ialogues (MAGID), a framework to augment text-only dialogues with diverse and high-quality images. Subsequently, a diffusion model is applied to craft corresponding images, ensuring alignment with the identified text. Finally, MAGID incorporates an innovative feedback loop between an image description generation module (textual LLM) and image quality modules (addressing aesthetics, image-text matching, and safety), that work in tandem to generate high-quality and multi-modal dialogues. We compare MAGID to other SOTA baselines on three dialogue datasets, using automated and human evaluation. Our results show that MAGID is comparable to or better than baselines, with significant improvements in human evaluation, especially against retrieval baselines where the image database is small.
Abstract:We consider the communication of natural language text from a source to a destination over noiseless and character-erasure channels. We exploit language's inherent correlations and predictability to constrain transmission costs by allowing the destination to predict or complete words with potential dissimilarity with the source text. Concretely, our objective is to obtain achievable $(\bar{c}, \bar{s})$ pairs, where $\bar{c}$ is the average transmission cost at the source and $\bar{s}$ is the average semantic similarity measured via cosine similarity between vector embedding of words at the source and those predicted/completed at the destination. We obtain $(\bar{c}, \bar{s})$ pairs for neural language and first-order Markov chain-based small language models (SLM) for prediction, using both a threshold policy that transmits a word if its cosine similarity with that predicted/completed at the destination is below a threshold, and a periodic policy, which transmits words after a specific interval and predicts/completes the words in between, at the destination. We adopt an SLM for word completion. We demonstrate that, when communication occurs over a noiseless channel, the threshold policy achieves a higher $\bar{s}$ for a given $\bar{c}$ than the periodic policy and that the $\bar{s}$ achieved with the neural SLM is greater than or equal to that of the Markov chain-based algorithm for the same $\bar{c}$. The improved performance comes with a higher complexity in terms of time and computing requirements. However, when communication occurs over a character-erasure channel, all prediction algorithms and scheduling policies perform poorly. Furthermore, if character-level Huffman coding is used, the required $\bar{c}$ to achieve a given $\bar{s}$ is reduced, but the above observations still apply.
Abstract:Federated Learning (FL) has emerged as a privacy-preserving machine learning paradigm facilitating collaborative training across multiple clients without sharing local data. Despite advancements in edge device capabilities, communication bottlenecks present challenges in aggregating a large number of clients; only a portion of the clients can update their parameters upon each global aggregation. This phenomenon introduces the critical challenge of stragglers in FL and the profound impact of client scheduling policies on global model convergence and stability. Existing scheduling strategies address staleness but predominantly focus on either timeliness or content. Motivated by this, we introduce the novel concept of Version Age of Information (VAoI) to FL. Unlike traditional Age of Information metrics, VAoI considers both timeliness and content staleness. Each client's version age is updated discretely, indicating the freshness of information. VAoI is incorporated into the client scheduling policy to minimize the average VAoI, mitigating the impact of outdated local updates and enhancing the stability of FL systems.
Abstract:Large Language Models (LLMs) are nowadays expected to generate content aligned with human preferences. Current work focuses on alignment at model training time, through techniques such as Reinforcement Learning with Human Feedback (RLHF). However, it is unclear if such methods are an effective choice to teach alignment objectives to the model. First, the inability to incorporate multiple, custom rewards and reliance on a model developer's view of universal and static principles are key limitations. Second, the residual gaps in model training and the reliability of such approaches are also questionable (e.g. susceptibility to jail-breaking even after safety training). To address these, we propose DeAL, a framework that allows the user to customize reward functions and enables Decoding-time Alignment of LLMs (DeAL). At its core, we view decoding as a heuristic-guided search process and facilitate the use of a wide variety of alignment objectives. Our experiments with programmatic constraints such as keyword and length constraints (studied widely in the pre-LLM era) and abstract objectives such as harmlessness and helpfulness (proposed in the post-LLM era) show that we can DeAL with fine-grained trade-offs, improve adherence to alignment objectives, and address residual gaps in LLMs. Lastly, while DeAL can be effectively paired with RLHF and prompting techniques, its generality makes decoding slower, an optimization we leave for future work.