Alert button
Picture for Yichen Huang

Yichen Huang

Alert button

Robustness Tests for Automatic Machine Translation Metrics with Adversarial Attacks

Nov 01, 2023
Yichen Huang, Timothy Baldwin

We investigate MT evaluation metric performance on adversarially-synthesized texts, to shed light on metric robustness. We experiment with word- and character-level attacks on three popular machine translation metrics: BERTScore, BLEURT, and COMET. Our human experiments validate that automatic metrics tend to overpenalize adversarially-degraded translations. We also identify inconsistencies in BERTScore ratings, where it judges the original sentence and the adversarially-degraded one as similar, while judging the degraded translation as notably worse than the original with respect to the reference. We identify patterns of brittleness that motivate more robust metric development.

* Accepted in Findings of EMNLP 2023 
Viaarxiv icon

Learning Interpretable Low-dimensional Representation via Physical Symmetry

Feb 24, 2023
Xuanjie Liu, Daniel Chin, Yichen Huang, Gus Xia

Figure 1 for Learning Interpretable Low-dimensional Representation via Physical Symmetry
Figure 2 for Learning Interpretable Low-dimensional Representation via Physical Symmetry
Figure 3 for Learning Interpretable Low-dimensional Representation via Physical Symmetry
Figure 4 for Learning Interpretable Low-dimensional Representation via Physical Symmetry

Interpretable representation learning has been playing a key role in creative intelligent systems. In the music domain, current learning algorithms can successfully learn various features such as pitch, timbre, chord, texture, etc. However, most methods rely heavily on music domain knowledge. It remains an open question what general computational principles give rise to interpretable representations, especially low-dim factors that agree with human perception. In this study, we take inspiration from modern physics and use physical symmetry as a self-consistency constraint for the latent space. Specifically, it requires the prior model that characterises the dynamics of the latent states to be equivariant with respect to certain group transformations. We show that physical symmetry leads the model to learn a linear pitch factor from unlabelled monophonic music audio in a self-supervised fashion. In addition, the same methodology can be applied to computer vision, learning a 3D Cartesian space from videos of a simple moving object without labels. Furthermore, physical symmetry naturally leads to representation augmentation, a new technique which improves sample efficiency.

Viaarxiv icon

UNITER-Based Situated Coreference Resolution with Rich Multimodal Input

Dec 07, 2021
Yichen Huang, Yuchen Wang, Yik-Cheung Tam

Figure 1 for UNITER-Based Situated Coreference Resolution with Rich Multimodal Input
Figure 2 for UNITER-Based Situated Coreference Resolution with Rich Multimodal Input
Figure 3 for UNITER-Based Situated Coreference Resolution with Rich Multimodal Input
Figure 4 for UNITER-Based Situated Coreference Resolution with Rich Multimodal Input

We present our work on the multimodal coreference resolution task of the Situated and Interactive Multimodal Conversation 2.0 (SIMMC 2.0) dataset as a part of the tenth Dialog System Technology Challenge (DSTC10). We propose a UNITER-based model utilizing rich multimodal context such as textual dialog history, object knowledge base and visual dialog scenes to determine whether each object in the current scene is mentioned in the current dialog turn. Results show that the proposed approach outperforms the official DSTC10 baseline substantially, with the object F1 score boosted from 36.6% to 77.3% on the development set, demonstrating the effectiveness of the proposed object representations from rich multimodal input. Our model ranks second in the official evaluation on the object coreference resolution task with an F1 score of 73.3% after model ensembling.

Viaarxiv icon

Narrative Incoherence Detection

Dec 21, 2020
Deng Cai, Yizhe Zhang, Yichen Huang, Wai Lam, Bill Dolan

Figure 1 for Narrative Incoherence Detection
Figure 2 for Narrative Incoherence Detection
Figure 3 for Narrative Incoherence Detection
Figure 4 for Narrative Incoherence Detection

Motivated by the increasing popularity of intelligent editing assistant, we introduce and investigate the task of narrative incoherence detection: Given a (corrupted) long-form narrative, decide whether there exists some semantic discrepancy in the narrative flow. Specifically, we focus on the missing sentence and incoherent sentence detection. Despite its simple setup, this task is challenging as the model needs to understand and analyze a multi-sentence narrative text, and make decisions at the sentence level. As an initial step towards this task, we implement several baselines either directly analyzing the raw text (\textit{token-level}) or analyzing learned sentence representations (\textit{sentence-level}). We observe that while token-level modeling enjoys greater expressive power and hence better performance, sentence-level modeling possesses an advantage in efficiency and flexibility. With pre-training on large-scale data and cycle-consistent sentence embedding, our extended sentence-level model can achieve comparable detection accuracy to the token-level model. As a by-product, such a strategy enables simultaneous incoherence detection and infilling/modification suggestions.

Viaarxiv icon

Neuro-Symbolic Visual Reasoning: Disentangling "Visual" from "Reasoning"

Jun 30, 2020
Saeed Amizadeh, Hamid Palangi, Oleksandr Polozov, Yichen Huang, Kazuhito Koishida

Figure 1 for Neuro-Symbolic Visual Reasoning: Disentangling "Visual" from "Reasoning"
Figure 2 for Neuro-Symbolic Visual Reasoning: Disentangling "Visual" from "Reasoning"
Figure 3 for Neuro-Symbolic Visual Reasoning: Disentangling "Visual" from "Reasoning"
Figure 4 for Neuro-Symbolic Visual Reasoning: Disentangling "Visual" from "Reasoning"

Visual reasoning tasks such as visual question answering (VQA) require an interplay of visual perception with reasoning about the question semantics grounded in perception. Various benchmarks for reasoning across language and vision like VQA, VCR and more recently GQA for compositional question answering facilitate scientific progress from perception models to visual reasoning. However, recent advances are still primarily driven by perception improvements (e.g. scene graph generation) rather than reasoning. Neuro-symbolic models such as Neural Module Networks bring the benefits of compositional reasoning to VQA, but they are still entangled with visual representation learning, and thus neural reasoning is hard to improve and assess on its own. To address this, we propose (1) a framework to isolate and evaluate the reasoning aspect of VQA separately from its perception, and (2) a novel top-down calibration technique that allows the model to answer reasoning questions even with imperfect perception. To this end, we introduce a differentiable first-order logic formalism for VQA that explicitly decouples question answering from visual perception. On the challenging GQA dataset, this framework is used to perform in-depth, disentangled comparisons between well-known VQA models leading to informative insights regarding the participating models as well as the task.

* To be published in Proceedings of the 37th International Conference on Machine Learning (ICML), Vienna, Austria, PMLR 119, 2020 
Viaarxiv icon

INSET: Sentence Infilling with Inter-sentential Generative Pre-training

Nov 10, 2019
Yichen Huang, Yizhe Zhang, Oussama Elachqar, Yu Cheng

Figure 1 for INSET: Sentence Infilling with Inter-sentential Generative Pre-training
Figure 2 for INSET: Sentence Infilling with Inter-sentential Generative Pre-training
Figure 3 for INSET: Sentence Infilling with Inter-sentential Generative Pre-training
Figure 4 for INSET: Sentence Infilling with Inter-sentential Generative Pre-training

Missing sentence generation (or sentence infilling) fosters a wide range of applications in natural language generation, such as document auto-completion and meeting note expansion. Such a task asks the model to generate intermediate missing sentence that can semantically and syntactically bridge the surrounding context. Solving the sentence infilling task requires techniques in NLP ranging from natural language understanding, discourse-level planning and natural language generation. In this paper, we present a framework to decouple this challenge and address these three aspects respectively, leveraging the power of existing large-scale pre-trained models such as BERT and GPT-2. Our empirical results demonstrate the effectiveness of our proposed model in learning a sentence representation for generation, and further generating a missing sentence that bridges the context.

* Y.H. and Y.Z. contributed equally to this work 
Viaarxiv icon

Provably efficient neural network representation for image classification

Nov 13, 2017
Yichen Huang

The state-of-the-art approaches for image classification are based on neural networks. Mathematically, the task of classifying images is equivalent to finding the function that maps an image to the label it is associated with. To rigorously establish the success of neural network methods, we should first prove that the function has an efficient neural network representation, and then design provably efficient training algorithms to find such a representation. Here, we achieve the first goal based on a set of assumptions about the patterns in the images. The validity of these assumptions is very intuitive in many image classification problems, including but not limited to, recognizing handwritten digits.

Viaarxiv icon