Abstract:Recently, extensive research on image customization (e.g., identity, subject, style, background, etc.) demonstrates strong customization capabilities in large-scale generative models. However, most approaches are designed for specific tasks, restricting their generalizability to combine different types of condition. Developing a unified framework for image customization remains an open challenge. In this paper, we present DreamO, an image customization framework designed to support a wide range of tasks while facilitating seamless integration of multiple conditions. Specifically, DreamO utilizes a diffusion transformer (DiT) framework to uniformly process input of different types. During training, we construct a large-scale training dataset that includes various customization tasks, and we introduce a feature routing constraint to facilitate the precise querying of relevant information from reference images. Additionally, we design a placeholder strategy that associates specific placeholders with conditions at particular positions, enabling control over the placement of conditions in the generated results. Moreover, we employ a progressive training strategy consisting of three stages: an initial stage focused on simple tasks with limited data to establish baseline consistency, a full-scale training stage to comprehensively enhance the customization capabilities, and a final quality alignment stage to correct quality biases introduced by low-quality data. Extensive experiments demonstrate that the proposed DreamO can effectively perform various image customization tasks with high quality and flexibly integrate different types of control conditions.
Abstract:Communication is fundamental for multi-robot collaboration, with accurate radio mapping playing a crucial role in predicting signal strength between robots. However, modeling radio signal propagation in large and occluded environments is challenging due to complex interactions between signals and obstacles. Existing methods face two key limitations: they struggle to predict signal strength for transmitter-receiver pairs not present in the training set, while also requiring extensive manual data collection for modeling, making them impractical for large, obstacle-rich scenarios. To overcome these limitations, we propose FERMI, a flexible radio mapping framework. FERMI combines physics-based modeling of direct signal paths with a neural network to capture environmental interactions with radio signals. This hybrid model learns radio signal propagation more efficiently, requiring only sparse training data. Additionally, FERMI introduces a scalable planning method for autonomous data collection using a multi-robot team. By increasing parallelism in data collection and minimizing robot travel costs between regions, overall data collection efficiency is significantly improved. Experiments in both simulation and real-world scenarios demonstrate that FERMI enables accurate signal prediction and generalizes well to unseen positions in complex environments. It also supports fully autonomous data collection and scales to different team sizes, offering a flexible solution for creating radio maps. Our code is open-sourced at https://github.com/ymLuo1214/Flexible-Radio-Mapping.
Abstract:The advent of Large Language Models (LLMs) has profoundly transformed the paradigms of information retrieval and problem-solving, enabling students to access information acquisition more efficiently to support learning. However, there is currently a lack of standardized evaluation frameworks that guide learners in effectively leveraging LLMs. This paper proposes an LLM-driven Bloom's Educational Taxonomy that aims to recognize and evaluate students' information literacy (IL) with LLMs, and to formalize and guide students practice-based activities of using LLMs to solve complex problems. The framework delineates the IL corresponding to the cognitive abilities required to use LLM into two distinct stages: Exploration & Action and Creation & Metacognition. It further subdivides these into seven phases: Perceiving, Searching, Reasoning, Interacting, Evaluating, Organizing, and Curating. Through the case presentation, the analysis demonstrates the framework's applicability and feasibility, supporting its role in fostering IL among students with varying levels of prior knowledge. This framework fills the existing gap in the analysis of LLM usage frameworks and provides theoretical support for guiding learners to improve IL.
Abstract:Unmanned Aerial Vehicles (UAVs) have gained significant popularity in scene reconstruction. This paper presents SOAR, a LiDAR-Visual heterogeneous multi-UAV system specifically designed for fast autonomous reconstruction of complex environments. Our system comprises a LiDAR-equipped explorer with a large field-of-view (FoV), alongside photographers equipped with cameras. To ensure rapid acquisition of the scene's surface geometry, we employ a surface frontier-based exploration strategy for the explorer. As the surface is progressively explored, we identify the uncovered areas and generate viewpoints incrementally. These viewpoints are then assigned to photographers through solving a Consistent Multiple Depot Multiple Traveling Salesman Problem (Consistent-MDMTSP), which optimizes scanning efficiency while ensuring task consistency. Finally, photographers utilize the assigned viewpoints to determine optimal coverage paths for acquiring images. We present extensive benchmarks in the realistic simulator, which validates the performance of SOAR compared with classical and state-of-the-art methods. For more details, please see our project page at https://sysu-star.github.io/SOAR}{sysu-star.github.io/SOAR.
Abstract:In the information era, how learners find, evaluate, and effectively use information has become a challenging issue, especially with the added complexity of large language models (LLMs) that have further confused learners in their information retrieval and search activities. This study attempts to unpack this complexity by combining exploratory search strategies with the theories of exploratory learning to form a new theoretical model of exploratory learning from the perspective of students' learning. Our work adapts Kolb's learning model by incorporating high-frequency exploration and feedback loops, aiming to promote deep cognitive and higher-order cognitive skill development in students. Additionally, this paper discusses and suggests how advanced LLMs integrated into information retrieval and information theory can support students in their exploratory searches, contributing theoretically to promoting student-computer interaction and supporting their learning journeys in the new era with LLMs.
Abstract:The mental health assessment of middle school students has always been one of the focuses in the field of education. This paper introduces a new ensemble learning network based on BERT, employing the concept of enhancing model performance by integrating multiple classifiers. We trained a range of BERT-based learners, which combined using the majority voting method. We collect social network text data of middle school students through China's Weibo and apply the method to the task of classifying emotional tendencies in middle school students' social network texts. Experimental results suggest that the ensemble learning network has a better performance than the base model and the performance of the ensemble learning model, consisting of three single-layer BERT models, is barely the same as a three-layer BERT model but requires 11.58% more training time. Therefore, in terms of balancing prediction effect and efficiency, the deeper BERT network should be preferred for training. However, for interpretability, network ensembles can provide acceptable solutions.
Abstract:Large language models (LLMs) hold great promise in summarizing medical evidence. Most recent studies focus on the application of proprietary LLMs. Using proprietary LLMs introduces multiple risk factors, including a lack of transparency and vendor dependency. While open-source LLMs allow better transparency and customization, their performance falls short compared to proprietary ones. In this study, we investigated to what extent fine-tuning open-source LLMs can further improve their performance in summarizing medical evidence. Utilizing a benchmark dataset, MedReview, consisting of 8,161 pairs of systematic reviews and summaries, we fine-tuned three broadly-used, open-sourced LLMs, namely PRIMERA, LongT5, and Llama-2. Overall, the fine-tuned LLMs obtained an increase of 9.89 in ROUGE-L (95% confidence interval: 8.94-10.81), 13.21 in METEOR score (95% confidence interval: 12.05-14.37), and 15.82 in CHRF score (95% confidence interval: 13.89-16.44). The performance of fine-tuned LongT5 is close to GPT-3.5 with zero-shot settings. Furthermore, smaller fine-tuned models sometimes even demonstrated superior performance compared to larger zero-shot models. The above trends of improvement were also manifested in both human and GPT4-simulated evaluations. Our results can be applied to guide model selection for tasks demanding particular domain knowledge, such as medical evidence summarization.
Abstract:In recent years, implicit online dense mapping methods have achieved high-quality reconstruction results, showcasing great potential in robotics, AR/VR, and digital twins applications. However, existing methods struggle with slow texture modeling which limits their real-time performance. To address these limitations, we propose a NeRF-based dense mapping method that enables faster and higher-quality reconstruction. To improve texture modeling, we introduce quasi-heterogeneous feature grids, which inherit the fast querying ability of uniform feature grids while adapting to varying levels of texture complexity. Besides, we present a gradient-aided coverage-maximizing strategy for keyframe selection that enables the selected keyframes to exhibit a closer focus on rich-textured regions and a broader scope for weak-textured areas. Experimental results demonstrate that our method surpasses existing NeRF-based approaches in texture fidelity, geometry accuracy, and time consumption. The code for our method will be available at: https://github.com/SYSU-STAR/H3-Mapping.
Abstract:This paper tackles the challenge of autonomous target search using unmanned aerial vehicles (UAVs) in complex unknown environments. To fill the gap in systematic approaches for this task, we introduce Star-Searcher, an aerial system featuring specialized sensor suites, mapping, and planning modules to optimize searching. Path planning challenges due to increased inspection requirements are addressed through a hierarchical planner with a visibility-based viewpoint clustering method. This simplifies planning by breaking it into global and local sub-problems, ensuring efficient global and local path coverage in real-time. Furthermore, our global path planning employs a history-aware mechanism to reduce motion inconsistency from frequent map changes, significantly enhancing search efficiency. We conduct comparisons with state-of-the-art methods in both simulation and the real world, demonstrating shorter flight paths, reduced time, and higher target search completeness. Our approach will be open-sourced for community benefit at https://github.com/SYSU-STAR/STAR-Searcher.
Abstract:According to the Stimulus Organism Response (SOR) theory, all human behavioral reactions are stimulated by context, where people will process the received stimulus and produce an appropriate reaction. This implies that in a specific context for a given input stimulus, a person can react differently according to their internal state and other contextual factors. Analogously, in dyadic interactions, humans communicate using verbal and nonverbal cues, where a broad spectrum of listeners' non-verbal reactions might be appropriate for responding to a specific speaker behaviour. There already exists a body of work that investigated the problem of automatically generating an appropriate reaction for a given input. However, none attempted to automatically generate multiple appropriate reactions in the context of dyadic interactions and evaluate the appropriateness of those reactions using objective measures. This paper starts by defining the facial Multiple Appropriate Reaction Generation (fMARG) task for the first time in the literature and proposes a new set of objective evaluation metrics to evaluate the appropriateness of the generated reactions. The paper subsequently introduces a framework to predict, generate, and evaluate multiple appropriate facial reactions.