School of Software, Tianjin University
Abstract:Many real-time applications of the Internet of Things (IoT) need to deal with correlated information generated by multiple sensors. The design of efficient status update strategies that minimize the Age of Correlated Information (AoCI) is a key factor. In this paper, we consider an IoT network consisting of sensors equipped with the energy harvesting (EH) capability. We optimize the average AoCI at the data fusion center (DFC) by appropriately managing the energy harvested by sensors, whose true battery states are unobservable during the decision-making process. Particularly, we first formulate the dynamic status update procedure as a partially observable Markov decision process (POMDP), where the environmental dynamics are unknown to the DFC. In order to address the challenges arising from the causality of energy usage, unknown environmental dynamics, unobservability of sensors'true battery states, and large-scale discrete action space, we devise a deep reinforcement learning (DRL)-based dynamic status update algorithm. The algorithm leverages the advantages of the soft actor-critic and long short-term memory techniques. Meanwhile, it incorporates our proposed action decomposition and mapping mechanism. Extensive simulations are conducted to validate the effectiveness of our proposed algorithm by comparing it with available DRL algorithms for POMDPs.




Abstract:Learning with noisy labels (LNL) has been extensively studied, with existing approaches typically following a framework that alternates between clean sample selection and semi-supervised learning (SSL). However, this approach has a limitation: the clean set selected by the Deep Neural Network (DNN) classifier, trained through self-training, inevitably contains noisy samples. This mixture of clean and noisy samples leads to misguidance in DNN training during SSL, resulting in impaired generalization performance due to confirmation bias caused by error accumulation in sample selection. To address this issue, we propose a method called Collaborative Sample Selection (CSS), which leverages the large-scale pre-trained model CLIP. CSS aims to remove the mixed noisy samples from the identified clean set. We achieve this by training a 2-Dimensional Gaussian Mixture Model (2D-GMM) that combines the probabilities from CLIP with the predictions from the DNN classifier. To further enhance the adaptation of CLIP to LNL, we introduce a co-training mechanism with a contrastive loss in semi-supervised learning. This allows us to jointly train the prompt of CLIP and the DNN classifier, resulting in improved feature representation, boosted classification performance of DNNs, and reciprocal benefits to our Collaborative Sample Selection. By incorporating auxiliary information from CLIP and utilizing prompt fine-tuning, we effectively eliminate noisy samples from the clean set and mitigate confirmation bias during training. Experimental results on multiple benchmark datasets demonstrate the effectiveness of our proposed method in comparison with the state-of-the-art approaches.




Abstract:We report Zero123++, an image-conditioned diffusion model for generating 3D-consistent multi-view images from a single input view. To take full advantage of pretrained 2D generative priors, we develop various conditioning and training schemes to minimize the effort of finetuning from off-the-shelf image diffusion models such as Stable Diffusion. Zero123++ excels in producing high-quality, consistent multi-view images from a single image, overcoming common issues like texture degradation and geometric misalignment. Furthermore, we showcase the feasibility of training a ControlNet on Zero123++ for enhanced control over the generation process. The code is available at https://github.com/SUDO-AI-3D/zero123plus.




Abstract:Perception is necessary for autonomous navigation in an unknown area crowded with obstacles. It's challenging for a robot to navigate safely without any sensors that can sense the environment, resulting in a $\textit{blind}$ robot, and becomes more difficult when comes to a group of robots. However, it could be costly to equip all robots with expensive perception or SLAM systems. In this paper, we propose a novel system named $\textbf{ColAG}$, to solve the problem of autonomous navigation for a group of $\textit{blind}$ UGVs by introducing cooperation with one UAV, which is the only robot that has full perception capabilities in the group. The UAV uses SLAM for its odometry and mapping while sharing this information with UGVs via limited relative pose estimation. The UGVs plan their trajectories in the received map and predict possible failures caused by the uncertainty of its wheel odometry and unknown risky areas. The UAV dynamically schedules waypoints to prevent UGVs from collisions, formulated as a Vehicle Routing Problem with Time Windows to optimize the UAV's trajectories and minimize time when UGVs have to wait to guarantee safety. We validate our system through extensive simulation with up to 7 UGVs and real-world experiments with 3 UGVs.




Abstract:We propose InternLM-XComposer, a vision-language large model that enables advanced image-text comprehension and composition. The innovative nature of our model is highlighted by three appealing properties: 1) Interleaved Text-Image Composition: InternLM-XComposer can effortlessly generate coherent and contextual articles that seamlessly integrate images, providing a more engaging and immersive reading experience. Simply provide a title, and our system will generate the corresponding manuscript. It can intelligently identify the areas in the text where images would enhance the content and automatically insert the most appropriate visual candidates. 2) Comprehension with Rich Multilingual Knowledge: The text-image comprehension is empowered by training on extensive multi-modal multilingual concepts with carefully crafted strategies, resulting in a deep understanding of visual content. 3) State-of-the-art Performance: Our model consistently achieves state-of-the-art results across various mainstream benchmarks for vision-language foundational models, including MME Benchmark, MMBench, MMBench-CN, Seed-Bench, and CCBench (Chinese Cultural Benchmark). Collectively, InternLM-XComposer seamlessly blends advanced text-image comprehension and composition, revolutionizing vision-language interaction and offering new insights and opportunities. The InternLM-XComposer model series with 7B parameters are publicly available at https://github.com/InternLM/InternLM-XComposer.




Abstract:Autonomous navigation of ground robots on uneven terrain is being considered in more and more tasks. However, uneven terrain will bring two problems to motion planning: how to assess the traversability of the terrain and how to cope with the dynamics model of the robot associated with the terrain. The trajectories generated by existing methods are often too conservative or cannot be tracked well by the controller since the second problem is not well solved. In this paper, we propose terrain pose mapping to describe the impact of terrain on the robot. With this mapping, we can obtain the SE(3) state of the robot on uneven terrain for a given state in SE(2). Then, based on it, we present a trajectory optimization framework for car-like robots on uneven terrain that can consider both of the above problems. The trajectories generated by our method conform to the dynamics model of the system without being overly conservative and yet able to be tracked well by the controller. We perform simulations and real-world experiments to validate the efficiency and trajectory quality of our algorithm.




Abstract:Recent advancement in personalized image generation have unveiled the intriguing capability of pre-trained text-to-image models on learning identity information from a collection of portrait images. However, existing solutions can be vulnerable in producing truthful details, and usually suffer from several defects such as (i) The generated face exhibit its own unique characteristics, \ie facial shape and facial feature positioning may not resemble key characteristics of the input, and (ii) The synthesized face may contain warped, blurred or corrupted regions. In this paper, we present FaceChain, a personalized portrait generation framework that combines a series of customized image-generation model and a rich set of face-related perceptual understanding models (\eg, face detection, deep face embedding extraction, and facial attribute recognition), to tackle aforementioned challenges and to generate truthful personalized portraits, with only a handful of portrait images as input. Concretely, we inject several SOTA face models into the generation procedure, achieving a more efficient label-tagging, data-processing, and model post-processing compared to previous solutions, such as DreamBooth ~\cite{ruiz2023dreambooth} , InstantBooth ~\cite{shi2023instantbooth} , or other LoRA-only approaches ~\cite{hu2021lora} . Through the development of FaceChain, we have identified several potential directions to accelerate development of Face/Human-Centric AIGC research and application. We have designed FaceChain as a framework comprised of pluggable components that can be easily adjusted to accommodate different styles and personalized needs. We hope it can grow to serve the burgeoning needs from the communities. FaceChain is open-sourced under Apache-2.0 license at \url{https://github.com/modelscope/facechain}.
Abstract:The rise in popularity of ChatGPT and GPT-4 has significantly accelerated the development of large models, leading to the creation of numerous impressive large language models(LLMs) and multimodal large language models (MLLMs). These cutting-edge models owe their remarkable performance to high-quality data. However, the details of the training data used in leading paradigms are often kept confidential. This lack of transparency, coupled with the scarcity of open-source data, impedes further developments within the community. As a response, this paper presents "Wan Juan", a large-scale multimodal dataset composed of both Chinese and English data, collected from a wide range of web sources. The dataset incorporates text, image-text, and video modalities, with a total volume exceeding 2TB. It was utilized in the training of InternLM, a model that demonstrated significant advantages in multi-dimensional evaluations when compared to models of a similar scale. All data can be accessed at https://opendatalab.org.cn/WanJuan1.0.




Abstract:Single image 3D reconstruction is an important but challenging task that requires extensive knowledge of our natural world. Many existing methods solve this problem by optimizing a neural radiance field under the guidance of 2D diffusion models but suffer from lengthy optimization time, 3D inconsistency results, and poor geometry. In this work, we propose a novel method that takes a single image of any object as input and generates a full 360-degree 3D textured mesh in a single feed-forward pass. Given a single image, we first use a view-conditioned 2D diffusion model, Zero123, to generate multi-view images for the input view, and then aim to lift them up to 3D space. Since traditional reconstruction methods struggle with inconsistent multi-view predictions, we build our 3D reconstruction module upon an SDF-based generalizable neural surface reconstruction method and propose several critical training strategies to enable the reconstruction of 360-degree meshes. Without costly optimizations, our method reconstructs 3D shapes in significantly less time than existing methods. Moreover, our method favors better geometry, generates more 3D consistent results, and adheres more closely to the input image. We evaluate our approach on both synthetic data and in-the-wild images and demonstrate its superiority in terms of both mesh quality and runtime. In addition, our approach can seamlessly support the text-to-3D task by integrating with off-the-shelf text-to-image diffusion models.
Abstract:For letting mobile robots travel flexibly through complicated environments, increasing attention has been paid to the whole-body collision evaluation. Most existing works either opt for the conservative corridor-based methods that impose strict requirements on the corridor generation, or ESDF-based methods that suffer from high computational overhead. It is still a great challenge to achieve fast and accurate whole-body collision evaluation. In this paper, we propose a Robo-centric ESDF (RC-ESDF) that is pre-built in the robot body frame and is capable of seamlessly applied to any-shape mobile robots, even for those with non-convex shapes. RC-ESDF enjoys lazy collision evaluation, which retains only the minimum information sufficient for whole-body safety constraint and significantly speeds up trajectory optimization. Based on the analytical gradients provided by RC-ESDF, we optimize the position and rotation of robot jointly, with whole-body safety, smoothness, and dynamical feasibility taken into account. Extensive simulation and real-world experiments verified the reliability and generalizability of our method.