Abstract:Audio-driven human animation methods, such as talking head and talking body generation, have made remarkable progress in generating synchronized facial movements and appealing visual quality videos. However, existing methods primarily focus on single human animation and struggle with multi-stream audio inputs, facing incorrect binding problems between audio and persons. Additionally, they exhibit limitations in instruction-following capabilities. To solve this problem, in this paper, we propose a novel task: Multi-Person Conversational Video Generation, and introduce a new framework, MultiTalk, to address the challenges during multi-person generation. Specifically, for audio injection, we investigate several schemes and propose the Label Rotary Position Embedding (L-RoPE) method to resolve the audio and person binding problem. Furthermore, during training, we observe that partial parameter training and multi-task training are crucial for preserving the instruction-following ability of the base model. MultiTalk achieves superior performance compared to other methods on several datasets, including talking head, talking body, and multi-person datasets, demonstrating the powerful generation capabilities of our approach.
Abstract:Foundation models hold significant potential for enabling robots to perform long-horizon general manipulation tasks. However, the simplicity of tasks and the uniformity of environments in existing benchmarks restrict their effective deployment in complex scenarios. To address this limitation, this paper introduces the \textit{RoboCAS} benchmark, the first benchmark specifically designed for complex object arrangement scenarios in robotic manipulation. This benchmark employs flexible and concise scripted policies to efficiently collect a diverse array of demonstrations, showcasing scattered, orderly, and stacked object arrangements within a highly realistic physical simulation environment. It includes complex processes such as target retrieval, obstacle clearance, and robot manipulation, testing agents' abilities to perform long-horizon planning for spatial reasoning and predicting chain reactions under ambiguous instructions. Extensive experiments on multiple baseline models reveal their limitations in managing complex object arrangement scenarios, underscoring the urgent need for intelligent agents capable of performing long-horizon operations in practical deployments and providing valuable insights for future research directions. Project website: \url{https://github.com/notFoundThisPerson/RoboCAS-v0}.