Jack




Abstract:We empirically study the scaling properties of various Diffusion Transformers (DiTs) for text-to-image generation by performing extensive and rigorous ablations, including training scaled DiTs ranging from 0.3B upto 8B parameters on datasets up to 600M images. We find that U-ViT, a pure self-attention based DiT model provides a simpler design and scales more effectively in comparison with cross-attention based DiT variants, which allows straightforward expansion for extra conditions and other modalities. We identify a 2.3B U-ViT model can get better performance than SDXL UNet and other DiT variants in controlled setting. On the data scaling side, we investigate how increasing dataset size and enhanced long caption improve the text-image alignment performance and the learning efficiency.




Abstract:The remarkable success of Large Language Models (LLMs) has extended to the multimodal domain, achieving outstanding performance in image understanding and generation. Recent efforts to develop unified Multimodal Large Language Models (MLLMs) that integrate these capabilities have shown promising results. However, existing approaches often involve complex designs in model architecture or training pipeline, increasing the difficulty of model training and scaling. In this paper, we propose SynerGen-VL, a simple yet powerful encoder-free MLLM capable of both image understanding and generation. To address challenges identified in existing encoder-free unified MLLMs, we introduce the token folding mechanism and the vision-expert-based progressive alignment pretraining strategy, which effectively support high-resolution image understanding while reducing training complexity. After being trained on large-scale mixed image-text data with a unified next-token prediction objective, SynerGen-VL achieves or surpasses the performance of existing encoder-free unified MLLMs with comparable or smaller parameter sizes, and narrows the gap with task-specific state-of-the-art models, highlighting a promising path toward future unified MLLMs. Our code and models shall be released.




Abstract:Tractor-trailer wheeled robots need to perform comprehensive perception tasks to enhance their operations in areas such as logistics parks and long-haul transportation. The perception of these robots face three major challenges: the relative pose change between the tractor and trailer, the asynchronous vibrations between the tractor and trailer, and the significant camera parallax caused by the large size. In this paper, we propose a novel Unified Vertex Motion Video Stabilization and Stitching framework designed for unknown environments. To establish the relationship between stabilization and stitching, the proposed Unified Vertex Motion framework comprises the Stitching Motion Field, which addresses relative positional change, and the Stabilization Motion Field, which tackles asynchronous vibrations. Then, recognizing the heterogeneity of optimization functions required for stabilization and stitching, a weighted cost function approach is proposed to address the problem of camera parallax. Furthermore, this framework has been successfully implemented in real tractor-trailer wheeled robots. The proposed Unified Vertex Motion Video Stabilization and Stitching method has been thoroughly tested in various challenging scenarios, demonstrating its accuracy and practicality in real-world robot tasks.




Abstract:Predicting roll call votes through modeling political actors has emerged as a focus in quantitative political science and computer science. Widely used embedding-based methods generate vectors for legislators from diverse data sets to predict legislative behaviors. However, these methods often contend with challenges such as the need for manually predefined features, reliance on extensive training data, and a lack of interpretability. Achieving more interpretable predictions under flexible conditions remains an unresolved issue. This paper introduces the Political Actor Agent (PAA), a novel agent-based framework that utilizes Large Language Models to overcome these limitations. By employing role-playing architectures and simulating legislative system, PAA provides a scalable and interpretable paradigm for predicting roll-call votes. Our approach not only enhances the accuracy of predictions but also offers multi-view, human-understandable decision reasoning, providing new insights into political actor behaviors. We conducted comprehensive experiments using voting records from the 117-118th U.S. House of Representatives, validating the superior performance and interpretability of PAA. This study not only demonstrates PAA's effectiveness but also its potential in political science research.




Abstract:With the advancement of computer vision, the recently emerged 3D Gaussian Splatting (3DGS) has increasingly become a popular scene reconstruction algorithm due to its outstanding performance. Distributed 3DGS can efficiently utilize edge devices to directly train on the collected images, thereby offloading computational demands and enhancing efficiency. However, traditional distributed frameworks often overlook computational and communication challenges in real-world environments, hindering large-scale deployment and potentially posing privacy risks. In this paper, we propose Radiant, a hierarchical 3DGS algorithm designed for large-scale scene reconstruction that considers system heterogeneity, enhancing the model performance and training efficiency. Via extensive empirical study, we find that it is crucial to partition the regions for each edge appropriately and allocate varying camera positions to each device for image collection and training. The core of Radiant is partitioning regions based on heterogeneous environment information and allocating workloads to each device accordingly. Furthermore, we provide a 3DGS model aggregation algorithm that enhances the quality and ensures the continuity of models' boundaries. Finally, we develop a testbed, and experiments demonstrate that Radiant improved reconstruction quality by up to 25.7\% and reduced up to 79.6\% end-to-end latency.




Abstract:Recent advancements in text-to-video (T2V) generative models have shown impressive capabilities. However, these models are still inadequate in aligning synthesized videos with human preferences (e.g., accurately reflecting text descriptions), which is particularly difficult to address, as human preferences are inherently subjective and challenging to formalize as objective functions. Therefore, this paper proposes LiFT, a novel fine-tuning method leveraging human feedback for T2V model alignment. Specifically, we first construct a Human Rating Annotation dataset, LiFT-HRA, consisting of approximately 10k human annotations, each including a score and its corresponding rationale. Based on this, we train a reward model LiFT-Critic to learn reward function effectively, which serves as a proxy for human judgment, measuring the alignment between given videos and human expectations. Lastly, we leverage the learned reward function to align the T2V model by maximizing the reward-weighted likelihood. As a case study, we apply our pipeline to CogVideoX-2B, showing that the fine-tuned model outperforms the CogVideoX-5B across all 16 metrics, highlighting the potential of human feedback in improving the alignment and quality of synthesized videos.




Abstract:In the field of autonomous driving or robotics, simultaneous localization and mapping (SLAM) and multi-object tracking (MOT) are two fundamental problems and are generally applied separately. Solutions to SLAM and MOT usually rely on certain assumptions, such as the static environment assumption for SLAM and the accurate ego-vehicle pose assumption for MOT. But in complex dynamic environments, it is difficult or even impossible to meet these assumptions. Therefore, the SLAMMOT, i.e., simultaneous localization, mapping, and moving object tracking, integrated system of SLAM and object tracking, has emerged for autonomous vehicles in dynamic environments. However, many conventional SLAMMOT solutions directly perform data association on the predictions and detections for object tracking, but ignore their quality. In practice, inaccurate predictions caused by continuous multi-frame missed detections in temporary occlusion scenarios, may degrade the performance of tracking, thereby affecting SLAMMOT. To address this challenge, this paper presents a LiDAR SLAMMOT based on confidence-guided data association (Conf SLAMMOT) method, which tightly couples the LiDAR SLAM and the confidence-guided data association based multi-object tracking into a graph optimization backend for estimating the state of the ego-vehicle and objects simultaneously. The confidence of prediction and detection are applied in the factor graph-based multi-object tracking for its data association, which not only avoids the performance degradation caused by incorrect initial assignments in some filter-based methods but also handles issues such as continuous missed detection in tracking while also improving the overall performance of SLAMMOT. Various comparative experiments demonstrate the superior advantages of Conf SLAMMOT, especially in scenes with some missed detections.




Abstract:Despite the significant progress made by all-in-one models in universal image restoration, existing methods suffer from a generalization bottleneck in real-world scenarios, as they are mostly trained on small-scale synthetic datasets with limited degradations. Therefore, large-scale high-quality real-world training data is urgently needed to facilitate the emergence of foundational models for image restoration. To advance this field, we spare no effort in contributing a million-scale dataset with two notable advantages over existing training data: real-world samples with larger-scale, and degradation types with higher diversity. By adjusting internal camera settings and external imaging conditions, we can capture aligned image pairs using our well-designed data acquisition system over multiple rounds and our data alignment criterion. Moreover, we propose a robust model, FoundIR, to better address a broader range of restoration tasks in real-world scenarios, taking a further step toward foundation models. Specifically, we first utilize a diffusion-based generalist model to remove degradations by learning the degradation-agnostic common representations from diverse inputs, where incremental learning strategy is adopted to better guide model training. To refine the model's restoration capability in complex scenarios, we introduce degradation-aware specialist models for achieving final high-quality results. Extensive experiments show the value of our dataset and the effectiveness of our method.
Abstract:Safety concerns of Multimodal large language models (MLLMs) have gradually become an important problem in various applications. Surprisingly, previous works indicate a counter-intuitive phenomenon that using textual unlearning to align MLLMs achieves comparable safety performances with MLLMs trained with image-text pairs. To explain such a counter-intuitive phenomenon, we discover a visual safety information leakage (VSIL) problem in existing multimodal safety benchmarks, i.e., the potentially risky and sensitive content in the image has been revealed in the textual query. In this way, MLLMs can easily refuse these sensitive text-image queries according to textual queries. However, image-text pairs without VSIL are common in real-world scenarios and are overlooked by existing multimodal safety benchmarks. To this end, we construct multimodal visual leakless safety benchmark (VLSBench) preventing visual safety leakage from image to textual query with 2.4k image-text pairs. Experimental results indicate that VLSBench poses a significant challenge to both open-source and close-source MLLMs, including LLaVA, Qwen2-VL, Llama3.2-Vision, and GPT-4o. This study demonstrates that textual alignment is enough for multimodal safety scenarios with VSIL, while multimodal alignment is a more promising solution for multimodal safety scenarios without VSIL. Please see our code and data at: http://hxhcreate.github.io/VLSBench
Abstract:Simultaneous Localization and Mapping (SLAM) and Multi-Object Tracking (MOT) are pivotal tasks in the realm of autonomous driving, attracting considerable research attention. While SLAM endeavors to generate real-time maps and determine the vehicle's pose in unfamiliar settings, MOT focuses on the real-time identification and tracking of multiple dynamic objects. Despite their importance, the prevalent approach treats SLAM and MOT as independent modules within an autonomous vehicle system, leading to inherent limitations. Classical SLAM methodologies often rely on a static environment assumption, suitable for indoor rather than dynamic outdoor scenarios. Conversely, conventional MOT techniques typically rely on the vehicle's known state, constraining the accuracy of object state estimations based on this prior. To address these challenges, previous efforts introduced the unified SLAMMOT paradigm, yet primarily focused on simplistic motion patterns. In our team's previous work IMM-SLAMMOT\cite{IMM-SLAMMOT}, we present a novel methodology incorporating consideration of multiple motion models into SLAMMOT i.e. tightly coupled SLAM and MOT, demonstrating its efficacy in LiDAR-based systems. This paper studies feasibility and advantages of instantiating this methodology as visual SLAMMOT, bridging the gap between LiDAR and vision-based sensing mechanisms. Specifically, we propose a solution of visual SLAMMOT considering multiple motion models and validate the inherent advantages of IMM-SLAMMOT in the visual domain.