Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Dinh Bach Vu

Lucy: edgerunning agentic web search on mobile with machine generated task vectors

Aug 01, 2025

Alan Dao, Dinh Bach Vu, Alex Nguyen, Norapat Buppodom

Figure 1 for Lucy: edgerunning agentic web search on mobile with machine generated task vectors

Figure 2 for Lucy: edgerunning agentic web search on mobile with machine generated task vectors

Figure 3 for Lucy: edgerunning agentic web search on mobile with machine generated task vectors

Figure 4 for Lucy: edgerunning agentic web search on mobile with machine generated task vectors

Abstract:Small language models (SLMs) are inherently limited in knowledge-intensive tasks due to their constrained capacity. While test-time computation offers a path to enhanced performance, most approaches treat reasoning as a fixed or heuristic process. In this work, we propose a new paradigm: viewing the model's internal reasoning, delimited by <think> and </think> tags, as a dynamic task vector machine. Rather than treating the content inside these tags as a mere trace of thought, we interpret the generation process itself as a mechanism through which the model \textbf{constructs and refines its own task vectors} on the fly. We developed a method to optimize this dynamic task vector machine through RLVR and successfully trained an agentic web-search model. We present Lucy, a 1.7B-parameter SLM that leverages this dynamic reasoning mechanism with MCP integration to achieve 78.3% accuracy on the SimpleQA benchmark, performing on par with much larger models such as DeepSeek-V3. This demonstrates that small models can rival large ones when equipped with structured, self-constructed task reasoning.

Via

Access Paper or Ask Questions

Speechless: Speech Instruction Training Without Speech for Low Resource Languages

May 23, 2025

Alan Dao, Dinh Bach Vu, Huy Hoang Ha, Tuan Le Duc Anh, Shreyas Gopal, Yue Heng Yeo, Warren Keng Hoong Low, Eng Siong Chng, Jia Qi Yip

Abstract:The rapid growth of voice assistants powered by large language models (LLM) has highlighted a need for speech instruction data to train these systems. Despite the abundance of speech recognition data, there is a notable scarcity of speech instruction data, which is essential for fine-tuning models to understand and execute spoken commands. Generating high-quality synthetic speech requires a good text-to-speech (TTS) model, which may not be available to low resource languages. Our novel approach addresses this challenge by halting synthesis at the semantic representation level, bypassing the need for TTS. We achieve this by aligning synthetic semantic representations with the pre-trained Whisper encoder, enabling an LLM to be fine-tuned on text instructions while maintaining the ability to understand spoken instructions during inference. This simplified training process is a promising approach to building voice assistant for low-resource languages.

* This paper was accepted by INTERSPEECH 2025

Via

Access Paper or Ask Questions

AlphaSpace: Enabling Robotic Actions through Semantic Tokenization and Symbolic Reasoning

Mar 27, 2025

Alan Dao, Dinh Bach Vu, Bui Quang Huy

Abstract:This paper presents AlphaSpace, a novel methodology designed to enhance the spatial reasoning capabilities of language models for robotic manipulation in 3D Cartesian space. AlphaSpace employs a hierarchical semantics-based tokenization strategy that encodes spatial information at both coarse and fine-grained levels. Our approach represents objects with their attributes, positions, and height information through structured tokens, enabling precise spatial reasoning without relying on traditional vision-based embeddings. This approach enables LLMs to accurately manipulate objects by positioning them at specific (x, y, z) coordinates. Experimental results suggest that AlphaSpace demonstrates promising potential for improving manipulation tasks, achieving a total accuracy of 66.67%, compared to 37.5% for GPT-4o and 29.17% for Claude 3.5 Sonnet. These results demonstrate the potential of structured spatial encoding for manipulation tasks and warrant further exploration.

Via

Access Paper or Ask Questions

PoseLess: Depth-Free Vision-to-Joint Control via Direct Image Mapping with VLM

Mar 11, 2025

Alan Dao, Dinh Bach Vu, Tuan Le Duc Anh, Bui Quang Huy

Figure 1 for PoseLess: Depth-Free Vision-to-Joint Control via Direct Image Mapping with VLM

Figure 2 for PoseLess: Depth-Free Vision-to-Joint Control via Direct Image Mapping with VLM

Abstract:This paper introduces PoseLess, a novel framework for robot hand control that eliminates the need for explicit pose estimation by directly mapping 2D images to joint angles using projected representations. Our approach leverages synthetic training data generated through randomized joint configurations, enabling zero-shot generalization to real-world scenarios and cross-morphology transfer from robotic to human hands. By projecting visual inputs and employing a transformer-based decoder, PoseLess achieves robust, low-latency control while addressing challenges such as depth ambiguity and data scarcity. Experimental results demonstrate competitive performance in joint angle prediction accuracy without relying on any human-labelled dataset.

Via

Access Paper or Ask Questions

AlphaMaze: Enhancing Large Language Models' Spatial Intelligence via GRPO

Feb 21, 2025

Alan Dao, Dinh Bach Vu

Figure 1 for AlphaMaze: Enhancing Large Language Models' Spatial Intelligence via GRPO

Figure 2 for AlphaMaze: Enhancing Large Language Models' Spatial Intelligence via GRPO

Figure 3 for AlphaMaze: Enhancing Large Language Models' Spatial Intelligence via GRPO

Figure 4 for AlphaMaze: Enhancing Large Language Models' Spatial Intelligence via GRPO

Abstract:Large Language Models (LLMs) have demonstrated impressive capabilities in language processing, yet they often struggle with tasks requiring genuine visual spatial reasoning. In this paper, we introduce a novel two-stage training framework designed to equip standard LLMs with visual reasoning abilities for maze navigation. First, we leverage Supervised Fine Tuning (SFT) on a curated dataset of tokenized maze representations to teach the model to predict step-by-step movement commands. Next, we apply Group Relative Policy Optimization (GRPO)-a technique used in DeepSeekR1-with a carefully crafted reward function to refine the model's sequential decision-making and encourage emergent chain-of-thought behaviors. Experimental results on synthetically generated mazes show that while a baseline model fails to navigate the maze, the SFT-trained model achieves 86% accuracy, and further GRPO fine-tuning boosts accuracy to 93%. Qualitative analyses reveal that GRPO fosters more robust and self-corrective reasoning, highlighting the potential of our approach to bridge the gap between language models and visual spatial tasks. These findings offer promising implications for applications in robotics, autonomous navigation, and other domains that require integrated visual and sequential reasoning.

Via

Access Paper or Ask Questions

Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant

Oct 20, 2024

Alan Dao, Dinh Bach Vu, Huy Hoang Ha

Figure 1 for Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant

Figure 2 for Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant

Figure 3 for Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant

Figure 4 for Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant

Abstract:Large Language Models (LLMs) have revolutionized natural language processing, but their application to speech-based tasks remains challenging due to the complexities of integrating audio and text modalities. This paper introduces Ichigo, a mixed-modal model that seamlessly processes interleaved sequences of speech and text. Utilizing a tokenized early-fusion approach, Ichigo quantizes speech into discrete tokens and employs a uniform transformer-based architecture for both speech and text modalities. This method enables joint reasoning and generation across modalities without the need for separate adapters. We present a comprehensive training methodology, including pre-training on multilingual speech recognition datasets and fine-tuning on a curated instruction dataset. Ichigo demonstrates state-of-the-art performance on speech question-answering benchmarks, outperforming existing open-source speech language models and achieving comparable results to cascaded systems. Notably, Ichigo exhibits a latency of just 111 ms to first token generation, significantly lower than current models. Our approach not only advances the field of multimodal AI but also provides a framework for smaller research teams to contribute effectively to open-source speech-language models.

Via

Access Paper or Ask Questions