Abstract:Large language models (LLMs) have demonstrated remarkable reasoning capabilities when prompted with strategies such as Chain-of-Thought (CoT). However, these approaches focus on token-level output without considering internal weight dynamics. We introduce Weight-of-Thought (WoT) reasoning, a novel approach that examines neural network weights before inference to identify reasoning pathways. Unlike existing methods, WoT explores the weight space through graph-based message passing, multi-step reasoning processes, and attention mechanisms. Our implementation creates an interconnected graph of reasoning nodes. Experiments on diverse reasoning tasks (syllogistic, mathematical, algebraic, combinatorial, and geometric) demonstrate that WoT achieves superior performance compared to traditional methods, particularly for complex problems. This approach leads to both improved performance and greater interpretability of the reasoning process, offering a promising direction for enhancing LLM reasoning capabilities.
Abstract:The scarcity of high-quality, multimodal training data severely hinders the creation of lifelike avatar animations for conversational AI in virtual environments. Existing datasets often lack the intricate synchronization between speech, facial expressions, and body movements that characterize natural human communication. To address this critical gap, we introduce Allo-AVA, a large-scale dataset specifically designed for text and audio-driven avatar gesture animation in an allocentric (third person point-of-view) context. Allo-AVA consists of $\sim$1,250 hours of diverse video content, complete with audio, transcripts, and extracted keypoints. Allo-AVA uniquely maps these keypoints to precise timestamps, enabling accurate replication of human movements (body and facial gestures) in synchronization with speech. This comprehensive resource enables the development and evaluation of more natural, context-aware avatar animation models, potentially transforming applications ranging from virtual reality to digital assistants.
Abstract:As virtual agents become increasingly prevalent in human-computer interaction, generating realistic and contextually appropriate gestures in real-time remains a significant challenge. While neural rendering techniques have made substantial progress with static scripts, their applicability to human-computer interactions remains limited. To address this, we introduce Large Body Language Models (LBLMs) and present LBLM-AVA, a novel LBLM architecture that combines a Transformer-XL large language model with a parallelized diffusion model to generate human-like gestures from multimodal inputs (text, audio, and video). LBLM-AVA incorporates several key components enhancing its gesture generation capabilities, such as multimodal-to-pose embeddings, enhanced sequence-to-sequence mapping with redefined attention mechanisms, a temporal smoothing module for gesture sequence coherence, and an attention-based refinement module for enhanced realism. The model is trained on our large-scale proprietary open-source dataset Allo-AVA. LBLM-AVA achieves state-of-the-art performance in generating lifelike and contextually appropriate gestures with a 30% reduction in Fr\'echet Gesture Distance (FGD), and a 25% improvement in Fr\'echet Inception Distance compared to existing approaches.