Recent advances in monocular depth estimation have been made by incorporating natural language as additional guidance. Although yielding impressive results, the impact of the language prior, particularly in terms of generalization and robustness, remains unexplored. In this paper, we address this gap by quantifying the impact of this prior and introduce methods to benchmark its effectiveness across various settings. We generate "low-level" sentences that convey object-centric, three-dimensional spatial relationships, incorporate them as additional language priors and evaluate their downstream impact on depth estimation. Our key finding is that current language-guided depth estimators perform optimally only with scene-level descriptions and counter-intuitively fare worse with low level descriptions. Despite leveraging additional data, these methods are not robust to directed adversarial attacks and decline in performance with an increase in distribution shift. Finally, to provide a foundation for future research, we identify points of failures and offer insights to better understand these shortcomings. With an increasing number of methods using language for depth estimation, our findings highlight the opportunities and pitfalls that require careful consideration for effective deployment in real-world settings
One of the key shortcomings in current text-to-image (T2I) models is their inability to consistently generate images which faithfully follow the spatial relationships specified in the text prompt. In this paper, we offer a comprehensive investigation of this limitation, while also developing datasets and methods that achieve state-of-the-art performance. First, we find that current vision-language datasets do not represent spatial relationships well enough; to alleviate this bottleneck, we create SPRIGHT, the first spatially-focused, large scale dataset, by re-captioning 6 million images from 4 widely used vision datasets. Through a 3-fold evaluation and analysis pipeline, we find that SPRIGHT largely improves upon existing datasets in capturing spatial relationships. To demonstrate its efficacy, we leverage only ~0.25% of SPRIGHT and achieve a 22% improvement in generating spatially accurate images while also improving the FID and CMMD scores. Secondly, we find that training on images containing a large number of objects results in substantial improvements in spatial consistency. Notably, we attain state-of-the-art on T2I-CompBench with a spatial score of 0.2133, by fine-tuning on <500 images. Finally, through a set of controlled experiments and ablations, we document multiple findings that we believe will enhance the understanding of factors that affect spatial consistency in text-to-image models. We publicly release our dataset and model to foster further research in this area.
Large Language Models (LLMs) have achieved remarkable performance across a wide variety of natural language tasks; however, their large size makes their inference slow and computationally expensive. Focusing on this problem, we propose to instruction tune LLMs with additional explicit losses from the intermediate layers (LITE) and show that it enables these layers to acquire 'good' generation ability without affecting the generation ability of the final layer. We perform 'dynamic confidence-based early exiting' at token level from the intermediate layers which improves the efficiency of text generation without compromising the quality of the generation. We conduct comprehensive experiments by instruction tuning LLaMA-2 models on the Alpaca dataset and holistically evaluate on four different human-instruction test sets. We show that dynamic early exiting achieves consistent and considerable inference computation cost improvements (37.86% for 7B and 46.35% for 13B model) while maintaining the generation quality of the responses. We further conduct a thorough analysis of the results over several important aspects, such as comparing the semantic similarity of the outputs and dissecting the efficiency improvements by comparing the number of tokens generated in the output. In summary, our work contributes to improving the efficiency of LLM inference while maintaining the generation quality, a crucial step en route to enabling their widespread adoption.
Large Language Models (LLMs) have achieved remarkable performance across a wide variety of natural language tasks; however, their large size makes their inference slow and computationally expensive which poses a practical challenge for resource constrained real-world applications. Focusing on this problem, we propose to instruction tune LLMs in a way that enables intermediate layer decoding for efficiently generating text, but importantly without compromising the quality of the generation. Specifically, we instruction tune LLMs with additional explicit Losses from the InTermediate layErs (LITE) and show that it enables these layers to acquire 'good' generation ability without affecting the generation ability of the final layer. We perform 'dynamic confidence-based early exiting' at token level from the intermediate layers which improves the efficiency of inference while maintaining the generation quality. We conduct comprehensive experiments by instruction tuning LLaMA-2 models on the widely used Alpaca dataset and holistically evaluate on four different human-instruction test sets: Vicuna, WizardLM, Koala, and Self-Instruct. We show that 'dynamic early exiting' achieves consistent and considerable cost improvements (37.86% on average) while maintaining the generation quality of the responses. We further conduct a thorough analysis of the results over several important aspects, such as comparing the semantic similarity of the outputs and dissecting the efficiency improvements by comparing the number of tokens generated in the output. In summary, our work contributes to improving the efficiency of LLM inference while maintaining the generation quality, a crucial step en route to enabling their widespread adoption.