Alert button
Picture for Omer Levy

Omer Levy

Alert button

Self-Alignment with Instruction Backtranslation

Aug 14, 2023
Xian Li, Ping Yu, Chunting Zhou, Timo Schick, Luke Zettlemoyer, Omer Levy, Jason Weston, Mike Lewis

Figure 1 for Self-Alignment with Instruction Backtranslation
Figure 2 for Self-Alignment with Instruction Backtranslation
Figure 3 for Self-Alignment with Instruction Backtranslation
Figure 4 for Self-Alignment with Instruction Backtranslation

We present a scalable method to build a high quality instruction following language model by automatically labelling human-written text with corresponding instructions. Our approach, named instruction backtranslation, starts with a language model finetuned on a small amount of seed data, and a given web corpus. The seed model is used to construct training examples by generating instruction prompts for web documents (self-augmentation), and then selecting high quality examples from among these candidates (self-curation). This data is then used to finetune a stronger model. Finetuning LLaMa on two iterations of our approach yields a model that outperforms all other LLaMa-based models on the Alpaca leaderboard not relying on distillation data, demonstrating highly effective self-alignment.

Viaarxiv icon

ZeroSCROLLS: A Zero-Shot Benchmark for Long Text Understanding

May 23, 2023
Uri Shaham, Maor Ivgi, Avia Efrat, Jonathan Berant, Omer Levy

Figure 1 for ZeroSCROLLS: A Zero-Shot Benchmark for Long Text Understanding
Figure 2 for ZeroSCROLLS: A Zero-Shot Benchmark for Long Text Understanding
Figure 3 for ZeroSCROLLS: A Zero-Shot Benchmark for Long Text Understanding
Figure 4 for ZeroSCROLLS: A Zero-Shot Benchmark for Long Text Understanding

We introduce ZeroSCROLLS, a zero-shot benchmark for natural language understanding over long texts, which contains only test sets, without training or development data. We adapt six tasks from the SCROLLS benchmark, and add four new datasets, including two novel information fusing tasks, such as aggregating the percentage of positive reviews. Using ZeroSCROLLS, we conduct a comprehensive evaluation of both open-source and closed large language models, finding that Claude outperforms ChatGPT, and that GPT-4 achieves the highest average score. However, there is still room for improvement on multiple open challenges in ZeroSCROLLS, such as aggregation tasks, where models struggle to pass the naive baseline. As the state of the art is a moving target, we invite researchers to evaluate their ideas on the live ZeroSCROLLS leaderboard

Viaarxiv icon

LIMA: Less Is More for Alignment

May 18, 2023
Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, Susan Zhang, Gargi Ghosh, Mike Lewis, Luke Zettlemoyer, Omer Levy

Figure 1 for LIMA: Less Is More for Alignment
Figure 2 for LIMA: Less Is More for Alignment
Figure 3 for LIMA: Less Is More for Alignment
Figure 4 for LIMA: Less Is More for Alignment

Large language models are trained in two stages: (1) unsupervised pretraining from raw text, to learn general-purpose representations, and (2) large scale instruction tuning and reinforcement learning, to better align to end tasks and user preferences. We measure the relative importance of these two stages by training LIMA, a 65B parameter LLaMa language model fine-tuned with the standard supervised loss on only 1,000 carefully curated prompts and responses, without any reinforcement learning or human preference modeling. LIMA demonstrates remarkably strong performance, learning to follow specific response formats from only a handful of examples in the training data, including complex queries that range from planning trip itineraries to speculating about alternate history. Moreover, the model tends to generalize well to unseen tasks that did not appear in the training data. In a controlled human study, responses from LIMA are either equivalent or strictly preferred to GPT-4 in 43% of cases; this statistic is as high as 58% when compared to Bard and 65% versus DaVinci003, which was trained with human feedback. Taken together, these results strongly suggest that almost all knowledge in large language models is learned during pretraining, and only limited instruction tuning data is necessary to teach models to produce high quality output.

Viaarxiv icon

Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation

May 02, 2023
Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, Omer Levy

Figure 1 for Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation
Figure 2 for Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation
Figure 3 for Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation
Figure 4 for Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation

The ability to collect a large dataset of human preferences from text-to-image users is usually limited to companies, making such datasets inaccessible to the public. To address this issue, we create a web app that enables text-to-image users to generate images and specify their preferences. Using this web app we build Pick-a-Pic, a large, open dataset of text-to-image prompts and real users' preferences over generated images. We leverage this dataset to train a CLIP-based scoring function, PickScore, which exhibits superhuman performance on the task of predicting human preferences. Then, we test PickScore's ability to perform model evaluation and observe that it correlates better with human rankings than other automatic evaluation metrics. Therefore, we recommend using PickScore for evaluating future text-to-image generation models, and using Pick-a-Pic prompts as a more relevant dataset than MS-COCO. Finally, we demonstrate how PickScore can enhance existing text-to-image models via ranking.

Viaarxiv icon

Vision Transformers with Mixed-Resolution Tokenization

Apr 27, 2023
Tomer Ronen, Omer Levy, Avram Golbert

Figure 1 for Vision Transformers with Mixed-Resolution Tokenization
Figure 2 for Vision Transformers with Mixed-Resolution Tokenization
Figure 3 for Vision Transformers with Mixed-Resolution Tokenization
Figure 4 for Vision Transformers with Mixed-Resolution Tokenization

Vision Transformer models process input images by dividing them into a spatially regular grid of equal-size patches. Conversely, Transformers were originally introduced over natural language sequences, where each token represents a subword - a chunk of raw data of arbitrary size. In this work, we apply this approach to Vision Transformers by introducing a novel image tokenization scheme, replacing the standard uniform grid with a mixed-resolution sequence of tokens, where each token represents a patch of arbitrary size. Using the Quadtree algorithm and a novel saliency scorer, we construct a patch mosaic where low-saliency areas of the image are processed in low resolution, routing more of the model's capacity to important image regions. Using the same architecture as vanilla ViTs, our Quadformer models achieve substantial accuracy gains on image classification when controlling for the computational budget. Code and models are publicly available at https://github.com/TomerRonen34/mixed-resolution-vit .

Viaarxiv icon

X&Fuse: Fusing Visual Information in Text-to-Image Generation

Mar 02, 2023
Yuval Kirstain, Omer Levy, Adam Polyak

Figure 1 for X&Fuse: Fusing Visual Information in Text-to-Image Generation
Figure 2 for X&Fuse: Fusing Visual Information in Text-to-Image Generation
Figure 3 for X&Fuse: Fusing Visual Information in Text-to-Image Generation
Figure 4 for X&Fuse: Fusing Visual Information in Text-to-Image Generation

We introduce X&Fuse, a general approach for conditioning on visual information when generating images from text. We demonstrate the potential of X&Fuse in three different text-to-image generation scenarios. (i) When a bank of images is available, we retrieve and condition on a related image (Retrieve&Fuse), resulting in significant improvements on the MS-COCO benchmark, gaining a state-of-the-art FID score of 6.65 in zero-shot settings. (ii) When cropped-object images are at hand, we utilize them and perform subject-driven generation (Crop&Fuse), outperforming the textual inversion method while being more than x100 faster. (iii) Having oracle access to the image scene (Scene&Fuse), allows us to achieve an FID score of 5.03 on MS-COCO in zero-shot settings. Our experiments indicate that X&Fuse is an effective, easy-to-adapt, simple, and general approach for scenarios in which the model may benefit from additional visual information.

Viaarxiv icon

Scaling Laws for Generative Mixed-Modal Language Models

Jan 10, 2023
Armen Aghajanyan, Lili Yu, Alexis Conneau, Wei-Ning Hsu, Karen Hambardzumyan, Susan Zhang, Stephen Roller, Naman Goyal, Omer Levy, Luke Zettlemoyer

Figure 1 for Scaling Laws for Generative Mixed-Modal Language Models
Figure 2 for Scaling Laws for Generative Mixed-Modal Language Models
Figure 3 for Scaling Laws for Generative Mixed-Modal Language Models
Figure 4 for Scaling Laws for Generative Mixed-Modal Language Models

Generative language models define distributions over sequences of tokens that can represent essentially any combination of data modalities (e.g., any permutation of image tokens from VQ-VAEs, speech tokens from HuBERT, BPE tokens for language or code, and so on). To better understand the scaling properties of such mixed-modal models, we conducted over 250 experiments using seven different modalities and model sizes ranging from 8 million to 30 billion, trained on 5-100 billion tokens. We report new mixed-modal scaling laws that unify the contributions of individual modalities and the interactions between them. Specifically, we explicitly model the optimal synergy and competition due to data and model size as an additive term to previous uni-modal scaling laws. We also find four empirical phenomena observed during the training, such as emergent coordinate-ascent style training that naturally alternates between modalities, guidelines for selecting critical hyper-parameters, and connections between mixed-modal competition and training stability. Finally, we test our scaling law by training a 30B speech-text model, which significantly outperforms the corresponding unimodal models. Overall, our research provides valuable insights into the design and training of mixed-modal generative models, an important new class of unified models that have unique distributional properties.

Viaarxiv icon

Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor

Dec 19, 2022
Or Honovich, Thomas Scialom, Omer Levy, Timo Schick

Figure 1 for Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor
Figure 2 for Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor
Figure 3 for Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor
Figure 4 for Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor

Instruction tuning enables pretrained language models to perform new tasks from inference-time natural language descriptions. These approaches rely on vast amounts of human supervision in the form of crowdsourced datasets or user interactions. In this work, we introduce Unnatural Instructions: a large dataset of creative and diverse instructions, collected with virtually no human labor. We collect 64,000 examples by prompting a language model with three seed examples of instructions and eliciting a fourth. This set is then expanded by prompting the model to rephrase each instruction, creating a total of approximately 240,000 examples of instructions, inputs, and outputs. Experiments show that despite containing a fair amount of noise, training on Unnatural Instructions rivals the effectiveness of training on open-source manually-curated datasets, surpassing the performance of models such as T0++ and Tk-Instruct across various benchmarks. These results demonstrate the potential of model-generated data as a cost-effective alternative to crowdsourcing for dataset expansion and diversification.

* 18 pages, 7 figures 
Viaarxiv icon