Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Soroush Vosoughi

Scaled Supervision is an Implicit Lipschitz Regularizer

Mar 19, 2025

Zhongyu Ouyang, Chunhui Zhang, Yaning Jia, Soroush Vosoughi

Abstract:In modern social media, recommender systems (RecSys) rely on the click-through rate (CTR) as the standard metric to evaluate user engagement. CTR prediction is traditionally framed as a binary classification task to predict whether a user will interact with a given item. However, this approach overlooks the complexity of real-world social modeling, where the user, item, and their interactive features change dynamically in fast-paced online environments. This dynamic nature often leads to model instability, reflected in overfitting short-term fluctuations rather than higher-level interactive patterns. While overfitting calls for more scaled and refined supervisions, current solutions often rely on binary labels that overly simplify fine-grained user preferences through the thresholding process, which significantly reduces the richness of the supervision. Therefore, we aim to alleviate the overfitting problem by increasing the supervision bandwidth in CTR training. Specifically, (i) theoretically, we formulate the impact of fine-grained preferences on model stability as a Lipschitz constrain; (ii) empirically, we discover that scaling the supervision bandwidth can act as an implicit Lipschitz regularizer, stably optimizing existing CTR models to achieve better generalizability. Extensive experiments show that this scaled supervision significantly and consistently improves the optimization process and the performance of existing CTR models, even without the need for additional hyperparameter tuning.

* Accepted to the International AAAI Conference on Web and Social Media (ICWSM 2025)

Via

Access Paper or Ask Questions

Superficial Self-Improved Reasoners Benefit from Model Merging

Mar 03, 2025

Xiangchi Yuan, Chunhui Zhang, Zheyuan Liu, Dachuan Shi, Soroush Vosoughi, Wenke Lee

Abstract:As scaled language models (LMs) approach human-level reasoning capabilities, self-improvement emerges as a solution to synthesizing high-quality data corpus. While previous research has identified model collapse as a risk in self-improvement, where model outputs become increasingly deterministic, we discover a more fundamental challenge: the superficial self-improved reasoners phenomenon. In particular, our analysis reveals that even when LMs show improved in-domain (ID) reasoning accuracy, they actually compromise their generalized reasoning capabilities on out-of-domain (OOD) tasks due to memorization rather than genuine. Through a systematic investigation of LM architecture, we discover that during self-improvement, LM weight updates are concentrated in less reasoning-critical layers, leading to superficial learning. To address this, we propose Iterative Model Merging (IMM), a method that strategically combines weights from original and self-improved models to preserve generalization while incorporating genuine reasoning improvements. Our approach effectively mitigates both LM collapse and superficial learning, moving towards more stable self-improving systems.

Via

Access Paper or Ask Questions

Pretrained Image-Text Models are Secretly Video Captioners

Feb 19, 2025

Chunhui Zhang, Yiren Jian, Zhongyu Ouyang, Soroush Vosoughi

Abstract:Developing video captioning models is computationally expensive. The dynamic nature of video also complicates the design of multimodal models that can effectively caption these sequences. However, we find that by using minimal computational resources and without complex modifications to address video dynamics, an image-based model can be repurposed to outperform several specialised video captioning systems. Our adapted model demonstrates top tier performance on major benchmarks, ranking 2nd on MSRVTT and MSVD, and 3rd on VATEX. We transform it into a competitive video captioner by post training a typical image captioning model BLIP2 with only 6,000 video text pairs and simply concatenating frames (significantly fewer data than other methods), which use 2.5 to 144 million pairs. From a resource optimization perspective, this video captioning study focuses on three fundamental factors: optimizing model scale, maximizing data efficiency, and incorporating reinforcement learning. This extensive study demonstrates that a lightweight, image based adaptation strategy can rival state-of-the-art video captioning systems, offering a practical solution for low-resource scenarios.

* Accepted to the 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL 2025). The first two authors contributed equally and were listed in random order

Via

Access Paper or Ask Questions

Communication is All You Need: Persuasion Dataset Construction via Multi-LLM Communication

Feb 13, 2025

Weicheng Ma, Hefan Zhang, Ivory Yang, Shiyu Ji, Joice Chen, Farnoosh Hashemi, Shubham Mohole, Ethan Gearey, Michael Macy, Saeed Hassanpour(+1 more)

Figure 1 for Communication is All You Need: Persuasion Dataset Construction via Multi-LLM Communication

Figure 2 for Communication is All You Need: Persuasion Dataset Construction via Multi-LLM Communication

Figure 3 for Communication is All You Need: Persuasion Dataset Construction via Multi-LLM Communication

Figure 4 for Communication is All You Need: Persuasion Dataset Construction via Multi-LLM Communication

Abstract:Large Language Models (LLMs) have shown proficiency in generating persuasive dialogue, yet concerns about the fluency and sophistication of their outputs persist. This paper presents a multi-LLM communication framework designed to enhance the generation of persuasive data automatically. This framework facilitates the efficient production of high-quality, diverse linguistic content with minimal human oversight. Through extensive evaluations, we demonstrate that the generated data excels in naturalness, linguistic diversity, and the strategic use of persuasion, even in complex scenarios involving social taboos. The framework also proves adept at generalizing across novel contexts. Our results highlight the framework's potential to significantly advance research in both computational and social science domains concerning persuasive communication.

* Accepted to NAACL 2025 Main Conference

Via

Access Paper or Ask Questions

Temporal Working Memory: Query-Guided Segment Refinement for Enhanced Multimodal Understanding

Feb 09, 2025

Xingjian Diao, Chunhui Zhang, Weiyi Wu, Zhongyu Ouyang, Peijun Qing, Ming Cheng, Soroush Vosoughi, Jiang Gui

Figure 1 for Temporal Working Memory: Query-Guided Segment Refinement for Enhanced Multimodal Understanding

Figure 2 for Temporal Working Memory: Query-Guided Segment Refinement for Enhanced Multimodal Understanding

Figure 3 for Temporal Working Memory: Query-Guided Segment Refinement for Enhanced Multimodal Understanding

Figure 4 for Temporal Working Memory: Query-Guided Segment Refinement for Enhanced Multimodal Understanding

Abstract:Multimodal foundation models (MFMs) have demonstrated significant success in tasks such as visual captioning, question answering, and image-text retrieval. However, these models face inherent limitations due to their finite internal capacity, which restricts their ability to process extended temporal sequences, a crucial requirement for comprehensive video and audio analysis. To overcome these challenges, we introduce a specialized cognitive module, temporal working memory (TWM), which aims to enhance the temporal modeling capabilities of MFMs. It selectively retains task-relevant information across temporal dimensions, ensuring that critical details are preserved throughout the processing of video and audio content. The TWM uses a query-guided attention approach to focus on the most informative multimodal segments within temporal sequences. By retaining only the most relevant content, TWM optimizes the use of the model's limited capacity, enhancing its temporal modeling ability. This plug-and-play module can be easily integrated into existing MFMs. With our TWM, nine state-of-the-art models exhibit significant performance improvements across tasks such as video captioning, question answering, and video-text retrieval. By enhancing temporal modeling, TWM extends the capability of MFMs to handle complex, time-sensitive data effectively. Our code is available at https://github.com/xid32/NAACL_2025_TWM.

* Accepted at NAACL 2025

Via

Access Paper or Ask Questions

Is It Navajo? Accurate Language Detection in Endangered Athabaskan Languages

Jan 27, 2025

Ivory Yang, Weicheng Ma, Chunhui Zhang, Soroush Vosoughi

Figure 1 for Is It Navajo? Accurate Language Detection in Endangered Athabaskan Languages

Figure 2 for Is It Navajo? Accurate Language Detection in Endangered Athabaskan Languages

Figure 3 for Is It Navajo? Accurate Language Detection in Endangered Athabaskan Languages

Figure 4 for Is It Navajo? Accurate Language Detection in Endangered Athabaskan Languages

Abstract:Endangered languages, such as Navajo - the most widely spoken Native American language - are significantly underrepresented in contemporary language technologies, exacerbating the challenges of their preservation and revitalization. This study evaluates Google's large language model (LLM)-based language identification system, which consistently misidentifies Navajo, exposing inherent limitations when applied to low-resource Native American languages. To address this, we introduce a random forest classifier trained on Navajo and eight frequently confused languages. Despite its simplicity, the classifier achieves near-perfect accuracy (97-100%), significantly outperforming Google's LLM-based system. Additionally, the model demonstrates robustness across other Athabaskan languages - a family of Native American languages spoken primarily in Alaska, the Pacific Northwest, and parts of the Southwestern United States - suggesting its potential for broader application. Our findings underscore the pressing need for NLP systems that prioritize linguistic diversity and adaptability over centralized, one-size-fits-all solutions, especially in supporting underrepresented languages in a multicultural world. This work directly contributes to ongoing efforts to address cultural biases in language models and advocates for the development of culturally localized NLP tools that serve diverse linguistic communities.

* Accepted to NAACL 2025 Main

Via

Access Paper or Ask Questions

NüshuRescue: Revitalization of the endangered Nüshu Language with AI

Dec 03, 2024

Ivory Yang, Weicheng Ma, Soroush Vosoughi

Figure 1 for NüshuRescue: Revitalization of the endangered Nüshu Language with AI

Figure 2 for NüshuRescue: Revitalization of the endangered Nüshu Language with AI

Figure 3 for NüshuRescue: Revitalization of the endangered Nüshu Language with AI

Figure 4 for NüshuRescue: Revitalization of the endangered Nüshu Language with AI

Abstract:The preservation and revitalization of endangered and extinct languages is a meaningful endeavor, conserving cultural heritage while enriching fields like linguistics and anthropology. However, these languages are typically low-resource, making their reconstruction labor-intensive and costly. This challenge is exemplified by N\"ushu, a rare script historically used by Yao women in China for self-expression within a patriarchal society. To address this challenge, we introduce N\"ushuRescue, an AI-driven framework designed to train large language models (LLMs) on endangered languages with minimal data. N\"ushuRescue automates evaluation and expands target corpora to accelerate linguistic revitalization. As a foundational component, we developed NCGold, a 500-sentence N\"ushu-Chinese parallel corpus, the first publicly available dataset of its kind. Leveraging GPT-4-Turbo, with no prior exposure to N\"ushu and only 35 short examples from NCGold, N\"ushuRescue achieved 48.69\% translation accuracy on 50 withheld sentences and generated NCSilver, a set of 98 newly translated modern Chinese sentences of varying lengths. A sample of both NCGold and NCSilver is included in the Supplementary Materials. Additionally, we developed FastText-based and Seq2Seq models to further support research on N\"ushu. N\"ushuRescue provides a versatile and scalable tool for the revitalization of endangered languages, minimizing the need for extensive human input.

* Accepted to COLING 2025

Via

Access Paper or Ask Questions

ImpScore: A Learnable Metric For Quantifying The Implicitness Level of Language

Nov 07, 2024

Yuxin Wang, Xiaomeng Zhu, Weimin Lyu, Saeed Hassanpour, Soroush Vosoughi

Figure 1 for ImpScore: A Learnable Metric For Quantifying The Implicitness Level of Language

Figure 2 for ImpScore: A Learnable Metric For Quantifying The Implicitness Level of Language

Figure 3 for ImpScore: A Learnable Metric For Quantifying The Implicitness Level of Language

Figure 4 for ImpScore: A Learnable Metric For Quantifying The Implicitness Level of Language

Abstract:Handling implicit language is essential for natural language processing systems to achieve precise text understanding and facilitate natural interactions with users. Despite its importance, the absence of a robust metric for accurately measuring the implicitness of language significantly constrains the depth of analysis possible in evaluating models' comprehension capabilities. This paper addresses this gap by developing a scalar metric that quantifies the implicitness level of language without relying on external references. Drawing on principles from traditional linguistics, we define ''implicitness'' as the divergence between semantic meaning and pragmatic interpretation. To operationalize this definition, we introduce ImpScore, a novel, reference-free metric formulated through an interpretable regression model. This model is trained using pairwise contrastive learning on a specially curated dataset comprising $112,580$ (implicit sentence, explicit sentence) pairs. We validate ImpScore through a user study that compares its assessments with human evaluations on out-of-distribution data, demonstrating its accuracy and strong correlation with human judgments. Additionally, we apply ImpScore to hate speech detection datasets, illustrating its utility and highlighting significant limitations in current large language models' ability to understand highly implicit content. The metric model and its training data are available at https://github.com/audreycs/ImpScore.

Via

Access Paper or Ask Questions

Achieving Domain-Independent Certified Robustness via Knowledge Continuity

Nov 03, 2024

Alan Sun, Chiyu Ma, Kenneth Ge, Soroush Vosoughi

Figure 1 for Achieving Domain-Independent Certified Robustness via Knowledge Continuity

Figure 2 for Achieving Domain-Independent Certified Robustness via Knowledge Continuity

Figure 3 for Achieving Domain-Independent Certified Robustness via Knowledge Continuity

Figure 4 for Achieving Domain-Independent Certified Robustness via Knowledge Continuity

Abstract:We present knowledge continuity, a novel definition inspired by Lipschitz continuity which aims to certify the robustness of neural networks across input domains (such as continuous and discrete domains in vision and language, respectively). Most existing approaches that seek to certify robustness, especially Lipschitz continuity, lie within the continuous domain with norm and distribution-dependent guarantees. In contrast, our proposed definition yields certification guarantees that depend only on the loss function and the intermediate learned metric spaces of the neural network. These bounds are independent of domain modality, norms, and distribution. We further demonstrate that the expressiveness of a model class is not at odds with its knowledge continuity. This implies that achieving robustness by maximizing knowledge continuity should not theoretically hinder inferential performance. Finally, to complement our theoretical results, we present several applications of knowledge continuity such as regularization, a certification algorithm, and show that knowledge continuity can be used to localize vulnerable components of a neural network.

* 32 pages, 6 figures

Via

Access Paper or Ask Questions

On the Exploration of LM-Based Soft Modular Robot Design

Nov 01, 2024

Weicheng Ma, Luyang Zhao, Chun-Yi She, Yitao Jiang, Alan Sun, Bo Zhu, Devin Balkcom, Soroush Vosoughi

Figure 1 for On the Exploration of LM-Based Soft Modular Robot Design

Figure 2 for On the Exploration of LM-Based Soft Modular Robot Design

Figure 3 for On the Exploration of LM-Based Soft Modular Robot Design

Figure 4 for On the Exploration of LM-Based Soft Modular Robot Design

Abstract:Recent large language models (LLMs) have demonstrated promising capabilities in modeling real-world knowledge and enhancing knowledge-based generation tasks. In this paper, we further explore the potential of using LLMs to aid in the design of soft modular robots, taking into account both user instructions and physical laws, to reduce the reliance on extensive trial-and-error experiments typically needed to achieve robot designs that meet specific structural or task requirements. Specifically, we formulate the robot design process as a sequence generation task and find that LLMs are able to capture key requirements expressed in natural language and reflect them in the construction sequences of robots. To simplify, rather than conducting real-world experiments to assess design quality, we utilize a simulation tool to provide feedback to the generative model, allowing for iterative improvements without requiring extensive human annotations. Furthermore, we introduce five evaluation metrics to assess the quality of robot designs from multiple angles including task completion and adherence to instructions, supporting an automatic evaluation process. Our model performs well in evaluations for designing soft modular robots with uni- and bi-directional locomotion and stair-descending capabilities, highlighting the potential of using natural language and LLMs for robot design. However, we also observe certain limitations that suggest areas for further improvement.

* 8 pages, 7 figures

Via

Access Paper or Ask Questions