Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Learning to Look Inside: Augmenting Token-Based Encoders with Character-Level Information

Aug 01, 2021

Yuval Pinter, Amanda Stent, Mark Dredze, Jacob Eisenstein

Figure 1 for Learning to Look Inside: Augmenting Token-Based Encoders with Character-Level Information

Figure 2 for Learning to Look Inside: Augmenting Token-Based Encoders with Character-Level Information

Figure 3 for Learning to Look Inside: Augmenting Token-Based Encoders with Character-Level Information

Figure 4 for Learning to Look Inside: Augmenting Token-Based Encoders with Character-Level Information

Share this with someone who'll enjoy it:

Abstract:Commonly-used transformer language models depend on a tokenization schema which sets an unchangeable subword vocabulary prior to pre-training, destined to be applied to all downstream tasks regardless of domain shift, novel word formations, or other sources of vocabulary mismatch. Recent work has shown that "token-free" models can be trained directly on characters or bytes, but training these models from scratch requires substantial computational resources, and this implies discarding the many domain-specific models that were trained on tokens. In this paper, we present XRayEmb, a method for retrofitting existing token-based models with character-level information. XRayEmb is composed of a character-level "encoder" that computes vector representations of character sequences, and a generative component that decodes from the internal representation to a character sequence. We show that incorporating XRayEmb's learned vectors into sequences of pre-trained token embeddings helps performance on both autoregressive and masked pre-trained transformer architectures and on both sequence-level and sequence tagging tasks, particularly on non-standard English text.

View paper on

Share this with someone who'll enjoy it:

Title:Learning to Look Inside: Augmenting Token-Based Encoders with Character-Level Information

Paper and Code