We present a new method for text-driven motion transfer - synthesizing a video that complies with an input text prompt describing the target objects and scene while maintaining an input video's motion and scene layout. Prior methods are confined to transferring motion across two subjects within the same or closely related object categories and are applicable for limited domains (e.g., humans). In this work, we consider a significantly more challenging setting in which the target and source objects differ drastically in shape and fine-grained motion characteristics (e.g., translating a jumping dog into a dolphin). To this end, we leverage a pre-trained and fixed text-to-video diffusion model, which provides us with generative and motion priors. The pillar of our method is a new space-time feature loss derived directly from the model. This loss guides the generation process to preserve the overall motion of the input video while complying with the target object in terms of shape and fine-grained motion traits.
In an environment with certain locations of higher priority, it is required to patrol these locations as frequently as possible due to their importance. However, the Non-Priority locations are often neglected during the task. It is necessary to balance the patrols on both kinds of sites to avoid breaches in security. We present a distributed online algorithm that assigns the routes to agents that ensures a finite time visit to the Non-Priority locations along with Priority Patrolling. The proposed algorithm generates offline patrol routes (Rabbit Walks) with three segments (Hops) to explore non-priority locations. The generated number of offline walks depends exponentially on a parameter introduced in the proposed algorithm, thereby facilitating the scalable implementation based on the onboard resources available on each patrolling robot. A systematic performance evaluation through simulations and experimental results validates the proportionately balanced visits and suggests the proposed algorithm's versatile applicability in the implementation of deterministic and non-deterministic scenarios.
Many humanoid and multi-legged robots are controlled in positions rather than in torques, preventing direct control of contact forces, and hampering their ability to create multiple contacts to enhance their balance, such as placing a hand on a wall or a handrail. This paper introduces the SEIKO (Sequential Equilibrium Inverse Kinematic Optimization) pipeline, drawing inspiration from flexibility models used in serial elastic actuators to indirectly control contact forces on traditional position-controlled robots. SEIKO formulates whole-body retargeting from Cartesian commands and admittance control using two quadratic programs solved in real time. We validated our pipeline with experiments on the real, full-scale humanoid robot Talos in various multicontact scenarios, including pushing tasks, far-reaching tasks, stair climbing, and stepping on sloped surfaces. This work opens the possibility of stable, contact-rich behaviors while getting around many of the challenges of torque-controlled robots. Code and videos are available at https://hucebot.github.io/seiko_controller_website/ .
This paper presents a proposed method of autonomous control for docking tasks of a single-seat personal mobility vehicle. We proposed a non-linear model predictive control (NMPC) based visual servoing to achieves the desired autonomous docking task. The proposed method is implemented on a four-wheel electric wheelchair platform, with two independent rear driving wheels and two front castor wheels. The NMPC-based visual servoing technique leverages the information extracted from a visual sensor as a real-time feedback for the NMPC to control the motion of the vehicle achieving the desired autonomous docking task. To evaluate the performance of the proposed controller method, a number of experiments both in simulation and in the actual setting. The controller performance is then evaluated based on the controller design requirement. The simulation results on autonomous docking experiments show that the proposed controller has been successfully achieve the desired controller design requirement to generate realtime trajectory for the vehicle performing autonomous docking tasks in several different scenarios.
Vertical Federated learning (VFL) is a class of FL where each client shares the same sample space but only holds a subset of the features. While VFL tackles key privacy challenges of distributed learning, it often assumes perfect hardware and communication capabilities. This assumption hinders the broad deployment of VFL, particularly on edge devices, which are heterogeneous in their in-situ capabilities and will connect/disconnect from the network over time. To address this gap, we define Internet Learning (IL) including its data splitting and network context and which puts good performance under extreme dynamic condition of clients as the primary goal. We propose VFL as a naive baseline and develop several extensions to handle the IL paradigm of learning. Furthermore, we implement new methods, propose metrics, and extensively analyze results based on simulating a sensor network. The results show that the developed methods are more robust to changes in the network than VFL baseline.
This paper introduces an extension to the Orienteering Problem (OP), called Clustered Orienteering Problem with Subgroups (COPS). In this variant, nodes are arranged into subgroups, and the subgroups are organized into clusters. A reward is associated with each subgroup and is gained only if all of its nodes are visited; however, at most one subgroup can be visited per cluster. The objective is to maximize the total collected reward while attaining a travel budget. We show that our new formulation has the ability to model and solve two previous well-known variants, the Clustered Orienteering Problem (COP) and the Set Orienteering Problem (SOP), in addition to other scenarios introduced here. An Integer Linear Programming (ILP) formulation and a Tabu Search-based heuristic are proposed to solve the problem. Experimental results indicate that the ILP method can yield optimal solutions at the cost of time, whereas the metaheuristic produces comparable solutions within a more reasonable computational cost.
We propose SutraNets, a novel method for neural probabilistic forecasting of long-sequence time series. SutraNets use an autoregressive generative model to factorize the likelihood of long sequences into products of conditional probabilities. When generating long sequences, most autoregressive approaches suffer from harmful error accumulation, as well as challenges in modeling long-distance dependencies. SutraNets treat long, univariate prediction as multivariate prediction over lower-frequency sub-series. Autoregression proceeds across time and across sub-series in order to ensure coherent multivariate (and, hence, high-frequency univariate) outputs. Since sub-series can be generated using fewer steps, SutraNets effectively reduce error accumulation and signal path distances. We find SutraNets to significantly improve forecasting accuracy over competitive alternatives on six real-world datasets, including when we vary the number of sub-series and scale up the depth and width of the underlying sequence models.
We present a novel differentiable point-based rendering framework for material and lighting decomposition from multi-view images, enabling editing, ray-tracing, and real-time relighting of the 3D point cloud. Specifically, a 3D scene is represented as a set of relightable 3D Gaussian points, where each point is additionally associated with a normal direction, BRDF parameters, and incident lights from different directions. To achieve robust lighting estimation, we further divide incident lights of each point into global and local components, as well as view-dependent visibilities. The 3D scene is optimized through the 3D Gaussian Splatting technique while BRDF and lighting are decomposed by physically-based differentiable rendering. Moreover, we introduce an innovative point-based ray-tracing approach based on the bounding volume hierarchy for efficient visibility baking, enabling real-time rendering and relighting of 3D Gaussian points with accurate shadow effects. Extensive experiments demonstrate improved BRDF estimation and novel view rendering results compared to state-of-the-art material estimation approaches. Our framework showcases the potential to revolutionize the mesh-based graphics pipeline with a relightable, traceable, and editable rendering pipeline solely based on point cloud. Project page:https://nju-3dv.github.io/projects/Relightable3DGaussian/.
Cognitive load assessment is crucial for understanding human performance in various domains. This study investigates the impact of different task conditions and time constraints on cognitive load using multiple measures, including subjective evaluations, performance metrics, and physiological eye-tracking data. Fifteen participants completed a series of primary and secondary tasks with different time limits. The NASA-TLX questionnaire, reaction time, inverse efficiency score, and eye-related features (blink, saccade, and fixation frequency) were utilized to assess cognitive load. The study results show significant differences in the level of cognitive load required for different tasks and when under time constraints. The study also found that there was a positive correlation (r = 0.331, p = 0.014) between how often participants blinked their eyes and the level of cognitive load required but a negative correlation (r = -0.290, p = 0.032) between how often participants made quick eye movements (saccades) and the level of cognitive load required. Additionally, the analysis revealed a significant negative correlation (r = -0.347, p = 0.009) and (r = -0.370, p = 0.005) between fixation and saccade frequencies under time constraints.
The quality of a face crop in an image is decided by many factors such as camera resolution, distance, and illumination condition. This makes the discrimination of face images with different qualities a challenging problem in realistic applications. However, most existing approaches are designed specifically for high-quality (HQ) or low-quality (LQ) images, and the performances would degrade for the mixed-quality images. Besides, many methods ask for pre-trained feature extractors or other auxiliary structures to support the training and the evaluation. In this paper, we point out that the key to better understand both the HQ and the LQ images simultaneously is to apply different learning methods according to their qualities. We propose a novel quality-guided joint training approach for mixed-quality face recognition, which could simultaneously learn the images of different qualities with a single encoder. Based on quality partition, classification-based method is employed for HQ data learning. Meanwhile, for the LQ images which lack identity information, we learn them with self-supervised image-image contrastive learning. To effectively catch up the model update and improve the discriminability of contrastive learning in our joint training scenario, we further propose a proxy-updated real-time queue to compose the contrastive pairs with features from the genuine encoder. Experiments on the low-quality datasets SCface and Tinyface, the mixed-quality dataset IJB-B, and five high-quality datasets demonstrate the effectiveness of our proposed approach in recognizing face images of different qualities.