We introduce RAGE, an image compression framework that achieves four generally conflicting objectives: 1) good compression for a wide variety of color images, 2) computationally efficient, fast decompression, 3) fast random access of images with pixel-level granularity without the need to decompress the entire image, 4) support for both lossless and lossy compression. To achieve these, we rely on the recent concept of generalized deduplication (GD), which is known to provide efficient lossless (de)compression and fast random access in time-series data, and deliver key expansions suitable for image compression, both lossless and lossy. Using nine different datasets, incl. graphics, logos, natural images, we show that RAGE has similar or better compression ratios to state-of-the-art lossless image compressors, while delivering pixel-level random access capabilities. Tests in an ARM Cortex-M33 platform show seek times between 9.9 and 40.6~ns and average decoding time per pixel between 274 and 1226~ns. Our measurements also show that RAGE's lossy variant, RAGE-Q, outperforms JPEG by several fold in terms of distortion in embedded graphics and has reasonable compression and distortion for natural images.
Self-attention is at the heart of the popular Transformer architecture, yet suffers from quadratic time and memory complexity. The breakthrough FlashAttention algorithm revealed I/O complexity as the true bottleneck in scaling Transformers. Given two levels of memory hierarchy, a fast cache (e.g. GPU on-chip SRAM) and a slow memory (e.g. GPU high-bandwidth memory), the I/O complexity measures the number of accesses to memory. FlashAttention computes attention using $\frac{N^2d^2}{M}$ I/O operations where $N$ is the dimension of the attention matrix, $d$ the head-dimension and $M$ the cache size. However, is this I/O complexity optimal? The known lower bound only rules out an I/O complexity of $o(Nd)$ when $M=\Theta(Nd)$, since the output that needs to be written to slow memory is $\Omega(Nd)$. This leads to the main question of our work: Is FlashAttention I/O optimal for all values of $M$? We resolve the above question in its full generality by showing an I/O complexity lower bound that matches the upper bound provided by FlashAttention for any values of $M \geq d^2$ within any constant factors. Further, we give a better algorithm with lower I/O complexity for $M < d^2$, and show that it is optimal as well. Moreover, our lower bounds do not rely on using combinatorial matrix multiplication for computing the attention matrix. We show even if one uses fast matrix multiplication, the above I/O complexity bounds cannot be improved. We do so by introducing a new communication complexity protocol for matrix compression, and connecting communication complexity to I/O complexity. To the best of our knowledge, this is the first work to establish a connection between communication complexity and I/O complexity, and we believe this connection could be of independent interest and will find many more applications in proving I/O complexity lower bounds in the future.
We present a scalable algorithm for the individually fair ($p$, $k$)-clustering problem introduced by Jung et al. and Mahabadi et al. Given $n$ points $P$ in a metric space, let $\delta(x)$ for $x\in P$ be the radius of the smallest ball around $x$ containing at least $n / k$ points. A clustering is then called individually fair if it has centers within distance $\delta(x)$ of $x$ for each $x\in P$. While good approximation algorithms are known for this problem no efficient practical algorithms with good theoretical guarantees have been presented. We design the first fast local-search algorithm that runs in ~$O(nk^2)$ time and obtains a bicriteria $(O(1), 6)$ approximation. Then we show empirically that not only is our algorithm much faster than prior work, but it also produces lower-cost solutions.
Common self-improvement approaches for large language models (LLMs), such as STaR (Zelikman et al., 2022), iteratively fine-tune LLMs on self-generated solutions to improve their problem-solving ability. However, these approaches discard the large amounts of incorrect solutions generated during this process, potentially neglecting valuable information in such solutions. To address this shortcoming, we propose V-STaR that utilizes both the correct and incorrect solutions generated during the self-improvement process to train a verifier using DPO that judges correctness of model-generated solutions. This verifier is used at inference time to select one solution among many candidate solutions. Running V-STaR for multiple iterations results in progressively better reasoners and verifiers, delivering a 4% to 17% test accuracy improvement over existing self-improvement and verification approaches on common code generation and math reasoning benchmarks with LLaMA2 models.
In this study, we propose an automated framework for camel farm monitoring, introducing two key contributions: the Unified Auto-Annotation framework and the Fine-Tune Distillation framework. The Unified Auto-Annotation approach combines two models, GroundingDINO (GD), and Segment-Anything-Model (SAM), to automatically annotate raw datasets extracted from surveillance videos. Building upon this foundation, the Fine-Tune Distillation framework conducts fine-tuning of student models using the auto-annotated dataset. This process involves transferring knowledge from a large teacher model to a student model, resembling a variant of Knowledge Distillation. The Fine-Tune Distillation framework aims to be adaptable to specific use cases, enabling the transfer of knowledge from the large models to the small models, making it suitable for domain-specific applications. By leveraging our raw dataset collected from Al-Marmoom Camel Farm in Dubai, UAE, and a pre-trained teacher model, GroundingDINO, the Fine-Tune Distillation framework produces a lightweight deployable model, YOLOv8. This framework demonstrates high performance and computational efficiency, facilitating efficient real-time object detection. Our code is available at \href{https://github.com/Razaimam45/Fine-Tune-Distillation}{https://github.com/Razaimam45/Fine-Tune-Distillation}
Currently, human drivers outperform self-driving vehicles in many conditions such as collision avoidance. Therefore, understanding human driver behaviour in these conditions will provide insight for future autonomous vehicles. For understanding driver behaviour, risk assessment is applied so far as one of the approaches by using both subjective and objective measurement. Subjective measurement methods such as questionnaires may provide insight into driver risk assessment but there is often significant variability between drivers.Physiological measurements such as heart rate (HR), electroencephalogram (EEG), and electromyogram (EMG) provide more objective measurements of driver risk assessment. HR is often used for measuring driver risk assessment based on observed correlations between HR and risk perception. Previous work has used HR to measure driver risk assessment in self-driving systems, but pedestrian dynamics is not considered for the research. In this study, we observed driver behaviour in certain scenarios which have pedestrian on driving simulator. The scenarios have safe/unsafe situations (i.e., pedestrian crosses road and vehicle may hit pedestrian in one scenario), HR analysis in time/frequency domain is processed for risk assessment. As a result, HR analysis in frequency domain shows certain reasonability for driver risk assessment when driver has pedestrian in its traffic.
It is well known that selecting samples with large losses/gradients can significantly reduce the number of training steps. However, the selection overhead is often too high to yield any meaningful gains in terms of overall training time. In this work, we focus on the greedy approach of selecting samples with large \textit{approximate losses} instead of exact losses in order to reduce the selection overhead. For smooth convex losses, we show that such a greedy strategy can converge to a constant factor of the minimum value of the average loss in fewer iterations than the standard approach of random selection. We also theoretically quantify the effect of the approximation level. We then develop SIFT which uses early exiting to obtain approximate losses with an intermediate layer's representations for sample selection. We evaluate SIFT on the task of training a 110M parameter 12-layer BERT base model and show significant gains (in terms of training hours and number of backpropagation steps) without any optimized implementation over vanilla training. For e.g., to reach 64% validation accuracy, SIFT with exit at the first layer takes ~43 hours compared to ~57 hours of vanilla training.
Rearrangement planning for object retrieval tasks from confined spaces is a challenging problem, primarily due to the lack of open space for robot motion and limited perception. Several traditional methods exist to solve object retrieval tasks, but they require overhead cameras for perception and a time-consuming exhaustive search to find a solution and often make unrealistic assumptions, such as having identical, simple geometry objects in the environment. This paper presents a neural object retrieval framework that efficiently performs rearrangement planning of unknown, arbitrary objects in confined spaces to retrieve the desired object using a given robot grasp. Our method actively senses the environment with the robot's in-hand camera. It then selects and relocates the non-target objects such that they do not block the robot path homotopy to the target object, thus also aiding an underlying path planner in quickly finding robot motion sequences. Furthermore, we demonstrate our framework in challenging scenarios, including real-world cabinet-like environments with arbitrary household objects. The results show that our framework achieves the best performance among all presented methods and is, on average, two orders of magnitude computationally faster than the best-performing baselines.
Cognitive radio (CR) and integrated sensing and communication (ISAC) are both critical technologies for the sixth generation (6G) wireless networks. However, their interplay has yet to be explored. To obtain the mutual benefits between CR and ISAC, we focus on a reconfigurable intelligent surface (RIS)-enhanced cognitive ISAC system and explore using the additional degrees-of-freedom brought by the RIS to improve the performance of the cognitive ISAC system. Specifically, we formulate an optimization problem of maximizing the signal-to-noise-plus-interference ratios (SINRs) of the mobile sensors (MSs) while ensuring the requirements of the spectrum sensing (SS) and the secondary transmissions by jointly designing the SS time, the secondary base station (SBS) beamforming, and the RIS beamforming. The formulated non-convex problem can be solved by the proposed block coordinate descent (BCD) algorithm based on the Dinkelbach's transform and the successive convex approximation (SCA) methods. Simulation results demonstrate that the proposed scheme exhibits good convergence performance and can effectively reduce the position error bounds (PEBs) of the MSs, thereby improving the radio environment map (REM) accuracy of CR networks. Additionally, we reveal the impact of RIS deployment locations on the performance of cognitive ISAC systems.
Novel sparse regression LDPC (SR-LDPC) codes exhibit excellent performance over additive white Gaussian noise (AWGN) channels in part due to their natural provision of shaping gains. Though SR-LDPC-like codes have been considered within the context of single-user error correction and massive random access, they are yet to be examined as candidates for coordinated multi-user communication scenarios. This article explores this gap in the literature and demonstrates that SR-LDPC codes, when combined with coded demixing techniques, offer a new framework for efficient non-orthogonal multiple access (NOMA) in the context of coordinated multi-user communication channels. The ensuing communication scheme is referred to as MU-SR-LDPC coding. Empirical evidence suggests that, for a fixed SNR, MU-SR-LDPC coding can achieve a target bit error rate (BER) at a higher sum rate than orthogonal multiple access (OMA) techniques such as time division multiple access (TDMA) and frequency division multiple access (FDMA). Importantly, MU-SR-LDPC codes enable a pragmatic solution path for user-centric cell-free communication systems with (local) joint decoding. Results are supported by numerical simulations.