The body structures of tendon-driven musculoskeletal humanoids are complex, and accurate modeling is difficult, because they are made by imitating the body structures of human beings. For this reason, we have not been able to move them accurately like ordinary humanoids driven by actuators in each axis, and large internal muscle tension and slack of tendon wires have emerged by the model error between its geometric model and the actual robot. Therefore, we construct a joint-muscle mapping (JMM) using a neural network (NN), which expresses a nonlinear relationship between joint angles and muscle lengths, and aim to move tendon-driven musculoskeletal humanoids accurately by updating the JMM online from data of the actual robot. In this study, the JMM is updated online by using the vision of the robot so that it moves to the correct position (Vision Updater). Also, we execute another update to modify muscle antagonisms correctly (Antagonism Updater). By using these two updaters, the error between the target and actual joint angles decrease to about 40% in 5 minutes, and we show through a manipulation experiment that the tendon-driven musculoskeletal humanoid Kengoro becomes able to move as intended. This novel system can adapt to the state change and growth of robots, because it updates the JMM online successively.
The tendon-driven musculoskeletal humanoid has many benefits that human beings have, but the modeling of its complex muscle and bone structures is difficult and conventional model-based controls cannot realize intended movements. Therefore, a learning control mechanism that acquires nonlinear relationships between joint angles, muscle tensions, and muscle lengths from the actual robot is necessary. In this study, we propose a system which runs the learning control mechanism for a long time to keep the self-body image of the musculoskeletal humanoid correct at all times. Also, we show that the musculoskeletal humanoid can conduct position control, torque control, and variable stiffness control using this self-body image. We conduct a long-time self-body image acquisition experiment lasting 3 hours, evaluate variable stiffness control using the self-body image, etc., and discuss the superiority and practicality of the self-body image acquisition of musculoskeletal structures, comprehensively.
Tendon-driven musculoskeletal humanoids have many benefits in terms of the flexible spine, multiple degrees of freedom, and variable stiffness. At the same time, because of its body complexity, there are problems in controllability. First, due to the large difference between the actual robot and its geometric model, it cannot move as intended and large internal muscle tension may emerge. Second, movements which do not appear as changes in muscle lengths may emerge, because of the muscle route changes caused by softness of body tissue. To solve these problems, we construct two models: ideal joint-muscle model and muscle-route change model, using a neural network. We initialize these models by a man-made geometric model and update them online using the sensor information of the actual robot. We validate that the tendon-driven musculoskeletal humanoid Kengoro is able to obtain a correct self-body image through several experiments.
Musculoskeletal humanoids have been developed by imitating humans and expected to perform natural and dynamic motions as well as humans. To achieve desired motions stably in current musculoskeletal humanoids is not easy because they cannot maintain the sufficient moment arm of muscles in various postures. In this research, we discuss planar structures that spread across joint structures such as ligament and planar muscles and the application of planar interskeletal structures to humanoid robots. Next, we develop MusashiOLegs, a musculoskeletal legs which has planar interskeletal structures and conducts several experiments to verify the importance of planar interskeletal structures.
Human can not only support their body during standing or walking, but also support them by hand, so that they can dangle a bar and others. But most humanoid robots support their body only in the foot and they use their hand just to manipulate objects because their hands are too weak to support their body. Strong hands are supposed to enable humanoid robots to act in much broader scene. Therefore, we developed new life-size five-fingered hand that can support the body of life-size humanoid robot. It is tendon-driven and underactuated hand and actuators in forearms produce large gripping force. This hand has flexible joints using machined springs, which can be designed integrally with the attachment. Thus, it has both structural strength and impact resistance in spite of small size. As other characteristics, this hand has force sensors to measure external force and the fingers can be flexed along objects though the number of actuators to flex fingers is less than that of fingers. We installed the developed hand on musculoskeletal humanoid "Kengoro" and achieved two self-weight supporting motions: push-up motion and dangling motion.
Human hands can not only grasp objects of various shape and size and manipulate them in hands but also exert such a large gripping force that they can support the body in the situations such as dangling a bar and climbing a ladder. On the other hand, it is difficult for most robot hands to manage both. Therefore in this paper we developed the hand which can grasp various objects and exert large gripping force. To develop such hand, we focused on the thumb CM joint with wide range of motion and the MP joints of four fingers with the DOF of abduction and adduction. Based on the hand with large gripping force and flexibility using machined spring, we applied above mentioned joint mechanism to the hand. The thumb CM joint has wide range of motion because of the combination of three machined springs and MP joints of four fingers have variable rigidity mechanism instead of driving each joint independently in order to move joint in limited space and by limited actuators. Using the developed hand, we achieved the grasping of various objects, supporting a large load and several motions with an arm.
Large language models (LLMs) have become ubiquitous in practice and are widely used for generation tasks such as translation, summarization and instruction following. However, their enormous size and reliance on autoregressive decoding increase deployment costs and complicate their use in latency-critical applications. In this work, we propose a hybrid approach that combines language models of different sizes to increase the efficiency of autoregressive decoding while maintaining high performance. Our method utilizes a pretrained frozen LLM that encodes all prompt tokens once in parallel, and uses the resulting representations to condition and guide a small language model (SLM), which then generates the response more efficiently. We investigate the combination of encoder-decoder LLMs with both encoder-decoder and decoder-only SLMs from different model families and only require fine-tuning of the SLM. Experiments with various benchmarks show substantial speedups of up to $4\times$, with minor performance penalties of $1-2\%$ for translation and summarization tasks compared to the LLM.
Vision-Language Models (VLMs) have emerged as promising tools for open-ended image understanding tasks, including open vocabulary segmentation. Yet, direct application of such VLMs to segmentation is non-trivial, since VLMs are trained with image-text pairs and naturally lack pixel-level granularity. Recent works have made advancements in bridging this gap, often by leveraging the shared image-text space in which the image and a provided text prompt are represented. In this paper, we challenge the capabilities of VLMs further and tackle open-vocabulary segmentation without the need for any textual input. To this end, we propose a novel Self-Guided Semantic Segmentation (Self-Seg) framework. Self-Seg is capable of automatically detecting relevant class names from clustered BLIP embeddings and using these for accurate semantic segmentation. In addition, we propose an LLM-based Open-Vocabulary Evaluator (LOVE) to effectively assess predicted open-vocabulary class names. We achieve state-of-the-art results on Pascal VOC, ADE20K and CityScapes for open-vocabulary segmentation without given class names, as well as competitive performance with methods where class names are given. All code and data will be released.
Foundation Models (FMs) have revolutionized machine learning with their adaptability and high performance across tasks; yet, their integration into Federated Learning (FL) is challenging due to substantial communication overhead from their extensive parameterization. Current communication-efficient FL strategies, such as gradient compression, reduce bitrates to around $1$ bit-per-parameter (bpp). However, these approaches fail to harness the characteristics of FMs, with their large number of parameters still posing a challenge to communication efficiency, even at these bitrate regimes. In this work, we present DeltaMask, a novel method that efficiently fine-tunes FMs in FL at an ultra-low bitrate, well below 1 bpp. DeltaMask employs stochastic masking to detect highly effective subnetworks within FMs and leverage stochasticity and sparsity in client masks to compress updates into a compact grayscale image using probabilistic filters, deviating from traditional weight training approaches. Our comprehensive evaluations across various datasets and architectures demonstrate DeltaMask efficiently achieves bitrates as low as 0.09 bpp, enhancing communication efficiency while maintaining FMs performance, as measured on 8 datasets and 5 pre-trained models of various network architectures.
The dominant paradigm for learning video-text representations -- noise contrastive learning -- increases the similarity of the representations of pairs of samples that are known to be related, such as text and video from the same sample, and pushes away the representations of all other pairs. We posit that this last behaviour is too strict, enforcing dissimilar representations even for samples that are semantically-related -- for example, visually similar videos or ones that share the same depicted action. In this paper, we propose a novel method that alleviates this by leveraging a generative model to naturally push these related samples together: each sample's caption must be reconstructed as a weighted combination of other support samples' visual representations. This simple idea ensures that representations are not overly-specialized to individual samples, are reusable across the dataset, and results in representations that explicitly encode semantics shared between samples, unlike noise contrastive learning. Our proposed method outperforms others by a large margin on MSR-VTT, VATEX and ActivityNet, for video-to-text and text-to-video retrieval.