Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Hui Li

Jiangnan University, Wuxi, China

SPDFusion: An Infrared and Visible Image Fusion Network Based on a Non-Euclidean Representation of Riemannian Manifolds

Nov 16, 2024

Huan Kang, Hui Li, Tianyang Xu, Rui Wang, Xiao-Jun Wu, Josef Kittler

Figure 1 for SPDFusion: An Infrared and Visible Image Fusion Network Based on a Non-Euclidean Representation of Riemannian Manifolds

Figure 2 for SPDFusion: An Infrared and Visible Image Fusion Network Based on a Non-Euclidean Representation of Riemannian Manifolds

Figure 3 for SPDFusion: An Infrared and Visible Image Fusion Network Based on a Non-Euclidean Representation of Riemannian Manifolds

Figure 4 for SPDFusion: An Infrared and Visible Image Fusion Network Based on a Non-Euclidean Representation of Riemannian Manifolds

Abstract:Euclidean representation learning methods have achieved commendable results in image fusion tasks, which can be attributed to their clear advantages in handling with linear space. However, data collected from a realistic scene usually have a non-Euclidean structure, where Euclidean metric might be limited in representing the true data relationships, degrading fusion performance. To address this issue, a novel SPD (symmetric positive definite) manifold learning framework is proposed for multi-modal image fusion, named SPDFusion, which extends the image fusion approach from the Euclidean space to the SPD manifolds. Specifically, we encode images according to the Riemannian geometry to exploit their intrinsic statistical correlations, thereby aligning with human visual perception. Actually, the SPD matrix underpins our network learning, with a cross-modal fusion strategy employed to harness modality-specific dependencies and augment complementary information. Subsequently, an attention module is designed to process the learned weight matrix, facilitating the weighting of spatial global correlation semantics via SPD matrix multiplication. Based on this, we design an end-to-end fusion network based on cross-modal manifold learning. Extensive experiments on public datasets demonstrate that our framework exhibits superior performance compared to the current state-of-the-art methods.

* 14 pages, 12 figures

Via

Access Paper or Ask Questions

Hallo2: Long-Duration and High-Resolution Audio-Driven Portrait Image Animation

Oct 10, 2024

Jiahao Cui, Hui Li, Yao Yao, Hao Zhu, Hanlin Shang, Kaihui Cheng, Hang Zhou, Siyu Zhu, Jingdong Wang

Figure 1 for Hallo2: Long-Duration and High-Resolution Audio-Driven Portrait Image Animation

Figure 2 for Hallo2: Long-Duration and High-Resolution Audio-Driven Portrait Image Animation

Figure 3 for Hallo2: Long-Duration and High-Resolution Audio-Driven Portrait Image Animation

Figure 4 for Hallo2: Long-Duration and High-Resolution Audio-Driven Portrait Image Animation

Abstract:Recent advances in latent diffusion-based generative models for portrait image animation, such as Hallo, have achieved impressive results in short-duration video synthesis. In this paper, we present updates to Hallo, introducing several design enhancements to extend its capabilities. First, we extend the method to produce long-duration videos. To address substantial challenges such as appearance drift and temporal artifacts, we investigate augmentation strategies within the image space of conditional motion frames. Specifically, we introduce a patch-drop technique augmented with Gaussian noise to enhance visual consistency and temporal coherence over long duration. Second, we achieve 4K resolution portrait video generation. To accomplish this, we implement vector quantization of latent codes and apply temporal alignment techniques to maintain coherence across the temporal dimension. By integrating a high-quality decoder, we realize visual synthesis at 4K resolution. Third, we incorporate adjustable semantic textual labels for portrait expressions as conditional inputs. This extends beyond traditional audio cues to improve controllability and increase the diversity of the generated content. To the best of our knowledge, Hallo2, proposed in this paper, is the first method to achieve 4K resolution and generate hour-long, audio-driven portrait image animations enhanced with textual prompts. We have conducted extensive experiments to evaluate our method on publicly available datasets, including HDTF, CelebV, and our introduced "Wild" dataset. The experimental results demonstrate that our approach achieves state-of-the-art performance in long-duration portrait video animation, successfully generating rich and controllable content at 4K resolution for duration extending up to tens of minutes. Project page https://fudan-generative-vision.github.io/hallo2

Via

Access Paper or Ask Questions

Hierarchical Hybrid Learning for Long-Horizon Contact-Rich Robotic Assembly

Sep 24, 2024

Jiankai Sun, Aidan Curtis, Yang You, Yan Xu, Michael Koehle, Leonidas Guibas, Sachin Chitta, Mac Schwager, Hui Li

Figure 1 for Hierarchical Hybrid Learning for Long-Horizon Contact-Rich Robotic Assembly

Figure 2 for Hierarchical Hybrid Learning for Long-Horizon Contact-Rich Robotic Assembly

Figure 3 for Hierarchical Hybrid Learning for Long-Horizon Contact-Rich Robotic Assembly

Figure 4 for Hierarchical Hybrid Learning for Long-Horizon Contact-Rich Robotic Assembly

Abstract:Generalizable long-horizon robotic assembly requires reasoning at multiple levels of abstraction. End-to-end imitation learning (IL) has been proven a promising approach, but it requires a large amount of demonstration data for training and often fails to meet the high-precision requirement of assembly tasks. Reinforcement Learning (RL) approaches have succeeded in high-precision assembly tasks, but suffer from sample inefficiency and hence, are less competent at long-horizon tasks. To address these challenges, we propose a hierarchical modular approach, named ARCH (Adaptive Robotic Composition Hierarchy), which enables long-horizon high-precision assembly in contact-rich settings. ARCH employs a hierarchical planning framework, including a low-level primitive library of continuously parameterized skills and a high-level policy. The low-level primitive library includes essential skills for assembly tasks, such as grasping and inserting. These primitives consist of both RL and model-based controllers. The high-level policy, learned via imitation learning from a handful of demonstrations, selects the appropriate primitive skills and instantiates them with continuous input parameters. We extensively evaluate our approach on a real robot manipulation platform. We show that while trained on a single task, ARCH generalizes well to unseen tasks and outperforms baseline methods in terms of success rate and data efficiency. Videos can be found at https://long-horizon-assembly.github.io.

Via

Access Paper or Ask Questions

Agents in Software Engineering: Survey, Landscape, and Vision

Sep 13, 2024

Yanxian Huang, Wanjun Zhong, Ensheng Shi, Min Yang, Jiachi Chen, Hui Li, Yuchi Ma, Qianxiang Wang, Zibin Zheng, Yanlin Wang

Figure 1 for Agents in Software Engineering: Survey, Landscape, and Vision

Figure 2 for Agents in Software Engineering: Survey, Landscape, and Vision

Figure 3 for Agents in Software Engineering: Survey, Landscape, and Vision

Figure 4 for Agents in Software Engineering: Survey, Landscape, and Vision

Abstract:In recent years, Large Language Models (LLMs) have achieved remarkable success and have been widely used in various downstream tasks, especially in the tasks of the software engineering (SE) field. We find that many studies combining LLMs with SE have employed the concept of agents either explicitly or implicitly. However, there is a lack of an in-depth survey to sort out the development context of existing works, analyze how existing works combine the LLM-based agent technologies to optimize various tasks, and clarify the framework of LLM-based agents in SE. In this paper, we conduct the first survey of the studies on combining LLM-based agents with SE and present a framework of LLM-based agents in SE which includes three key modules: perception, memory, and action. We also summarize the current challenges in combining the two fields and propose future opportunities in response to existing challenges. We maintain a GitHub repository of the related papers at: https://github.com/DeepSoftwareAnalytics/Awesome-Agent4SE.

* 12 pages, 4 figures

Via

Access Paper or Ask Questions

Seed-Music: A Unified Framework for High Quality and Controlled Music Generation

Sep 13, 2024

Ye Bai, Haonan Chen, Jitong Chen, Zhuo Chen, Yi Deng, Xiaohong Dong, Lamtharn Hantrakul, Weituo Hao, Qingqing Huang, Zhongyi Huang(+28 more)

Figure 1 for Seed-Music: A Unified Framework for High Quality and Controlled Music Generation

Figure 2 for Seed-Music: A Unified Framework for High Quality and Controlled Music Generation

Figure 3 for Seed-Music: A Unified Framework for High Quality and Controlled Music Generation

Figure 4 for Seed-Music: A Unified Framework for High Quality and Controlled Music Generation

Abstract:We introduce Seed-Music, a suite of music generation systems capable of producing high-quality music with fine-grained style control. Our unified framework leverages both auto-regressive language modeling and diffusion approaches to support two key music creation workflows: \textit{controlled music generation} and \textit{post-production editing}. For controlled music generation, our system enables vocal music generation with performance controls from multi-modal inputs, including style descriptions, audio references, musical scores, and voice prompts. For post-production editing, it offers interactive tools for editing lyrics and vocal melodies directly in the generated audio. We encourage readers to listen to demo audio examples at https://team.doubao.com/seed-music .

* Seed-Music technical report, 20 pages, 5 figures

Via

Access Paper or Ask Questions

In-Context Imitation Learning via Next-Token Prediction

Aug 28, 2024

Letian Fu, Huang Huang, Gaurav Datta, Lawrence Yunliang Chen, William Chung-Ho Panitch, Fangchen Liu, Hui Li, Ken Goldberg

Figure 1 for In-Context Imitation Learning via Next-Token Prediction

Figure 2 for In-Context Imitation Learning via Next-Token Prediction

Figure 3 for In-Context Imitation Learning via Next-Token Prediction

Figure 4 for In-Context Imitation Learning via Next-Token Prediction

Abstract:We explore how to enhance next-token prediction models to perform in-context imitation learning on a real robot, where the robot executes new tasks by interpreting contextual information provided during the input phase, without updating its underlying policy parameters. We propose In-Context Robot Transformer (ICRT), a causal transformer that performs autoregressive prediction on sensorimotor trajectories without relying on any linguistic data or reward function. This formulation enables flexible and training-free execution of new tasks at test time, achieved by prompting the model with sensorimotor trajectories of the new task composing of image observations, actions and states tuples, collected through human teleoperation. Experiments with a Franka Emika robot demonstrate that the ICRT can adapt to new tasks specified by prompts, even in environment configurations that differ from both the prompt and the training data. In a multitask environment setup, ICRT significantly outperforms current state-of-the-art next-token prediction models in robotics on generalizing to unseen tasks. Code, checkpoints and data are available on https://icrt.dev/

Via

Access Paper or Ask Questions

NeuralOOD: Improving Out-of-Distribution Generalization Performance with Brain-machine Fusion Learning Framework

Aug 27, 2024

Shuangchen Zhao, Changde Du, Hui Li, Huiguang He

Figure 1 for NeuralOOD: Improving Out-of-Distribution Generalization Performance with Brain-machine Fusion Learning Framework

Figure 2 for NeuralOOD: Improving Out-of-Distribution Generalization Performance with Brain-machine Fusion Learning Framework

Figure 3 for NeuralOOD: Improving Out-of-Distribution Generalization Performance with Brain-machine Fusion Learning Framework

Figure 4 for NeuralOOD: Improving Out-of-Distribution Generalization Performance with Brain-machine Fusion Learning Framework

Abstract:Deep Neural Networks (DNNs) have demonstrated exceptional recognition capabilities in traditional computer vision (CV) tasks. However, existing CV models often suffer a significant decrease in accuracy when confronted with out-of-distribution (OOD) data. In contrast to these DNN models, human can maintain a consistently low error rate when facing OOD scenes, partly attributed to the rich prior cognitive knowledge stored in the human brain. Previous OOD generalization researches only focus on the single modal, overlooking the advantages of multimodal learning method. In this paper, we utilize the multimodal learning method to improve the OOD generalization and propose a novel Brain-machine Fusion Learning (BMFL) framework. We adopt the cross-attention mechanism to fuse the visual knowledge from CV model and prior cognitive knowledge from the human brain. Specially, we employ a pre-trained visual neural encoding model to predict the functional Magnetic Resonance Imaging (fMRI) from visual features which eliminates the need for the fMRI data collection and pre-processing, effectively reduces the workload associated with conventional BMFL methods. Furthermore, we construct a brain transformer to facilitate the extraction of knowledge inside the fMRI data. Moreover, we introduce the Pearson correlation coefficient maximization regularization method into the training process, which improves the fusion capability with better constrains. Our model outperforms the DINOv2 and baseline models on the ImageNet-1k validation dataset as well as six curated OOD datasets, showcasing its superior performance in diverse scenarios.

Via

Access Paper or Ask Questions

Deep Code Search with Naming-Agnostic Contrastive Multi-View Learning

Aug 18, 2024

Jiadong Feng, Wei Li, Zhao Wei, Yong Xu, Juhong Wang, Hui Li

Figure 1 for Deep Code Search with Naming-Agnostic Contrastive Multi-View Learning

Figure 2 for Deep Code Search with Naming-Agnostic Contrastive Multi-View Learning

Figure 3 for Deep Code Search with Naming-Agnostic Contrastive Multi-View Learning

Figure 4 for Deep Code Search with Naming-Agnostic Contrastive Multi-View Learning

Abstract:Software development is a repetitive task, as developers usually reuse or get inspiration from existing implementations. Code search, which refers to the retrieval of relevant code snippets from a codebase according to the developer's intent that has been expressed as a query, has become increasingly important in the software development process. Due to the success of deep learning in various applications, a great number of deep learning based code search approaches have sprung up and achieved promising results. However, developers may not follow the same naming conventions and the same variable may have different variable names in different implementations, bringing a challenge to deep learning based code search methods that rely on explicit variable correspondences to understand source code. To overcome this challenge, we propose a naming-agnostic code search method (NACS) based on contrastive multi-view code representation learning. NACS strips information bound to variable names from Abstract Syntax Tree (AST), the representation of the abstract syntactic structure of source code, and focuses on capturing intrinsic properties solely from AST structures. We use semantic-level and syntax-level augmentation techniques to prepare realistically rational data and adopt contrastive learning to design a graph-view modeling component in NACS to enhance the understanding of code snippets. We further model ASTs in a path view to strengthen the graph-view modeling component through multi-view learning. Extensive experiments show that NACS provides superior code search performance compared to baselines and NACS can be adapted to help existing code search methods overcome the impact of different naming conventions.

Via

Access Paper or Ask Questions

DePrompt: Desensitization and Evaluation of Personal Identifiable Information in Large Language Model Prompts

Aug 16, 2024

Xiongtao Sun, Gan Liu, Zhipeng He, Hui Li, Xiaoguang Li

Abstract:Prompt serves as a crucial link in interacting with large language models (LLMs), widely impacting the accuracy and interpretability of model outputs. However, acquiring accurate and high-quality responses necessitates precise prompts, which inevitably pose significant risks of personal identifiable information (PII) leakage. Therefore, this paper proposes DePrompt, a desensitization protection and effectiveness evaluation framework for prompt, enabling users to safely and transparently utilize LLMs. Specifically, by leveraging large model fine-tuning techniques as the underlying privacy protection method, we integrate contextual attributes to define privacy types, achieving high-precision PII entity identification. Additionally, through the analysis of key features in prompt desensitization scenarios, we devise adversarial generative desensitization methods that retain important semantic content while disrupting the link between identifiers and privacy attributes. Furthermore, we present utility evaluation metrics for prompt to better gauge and balance privacy and usability. Our framework is adaptable to prompts and can be extended to text usability-dependent scenarios. Through comparison with benchmarks and other model methods, experimental evaluations demonstrate that our desensitized prompt exhibit superior privacy protection utility and model inference results.

Via

Access Paper or Ask Questions

System States Forecasting of Microservices with Dynamic Spatio-Temporal Data

Aug 15, 2024

Yifei Xu, Jingguo Ge, Haina Tang, Shuai Ding, Tong Li, Hui Li

Figure 1 for System States Forecasting of Microservices with Dynamic Spatio-Temporal Data

Figure 2 for System States Forecasting of Microservices with Dynamic Spatio-Temporal Data

Figure 3 for System States Forecasting of Microservices with Dynamic Spatio-Temporal Data

Figure 4 for System States Forecasting of Microservices with Dynamic Spatio-Temporal Data

Abstract:In the AIOps (Artificial Intelligence for IT Operations) era, accurately forecasting system states is crucial. In microservices systems, this task encounters the challenge of dynamic and complex spatio-temporal relationships among microservice instances, primarily due to dynamic deployments, diverse call paths, and cascading effects among instances. Current time-series forecasting methods, which focus mainly on intrinsic patterns, are insufficient in environments where spatial relationships are critical. Similarly, spatio-temporal graph approaches often neglect the nature of temporal trend, concentrating mostly on message passing between nodes. Moreover, current research in microservices domain frequently underestimates the importance of network metrics and topological structures in capturing the evolving dynamics of systems. This paper introduces STMformer, a model tailored for forecasting system states in microservices environments, capable of handling multi-node and multivariate time series. Our method leverages dynamic network connection data and topological information to assist in modeling the intricate spatio-temporal relationships within the system. Additionally, we integrate the PatchCrossAttention module to compute the impact of cascading effects globally. We have developed a dataset based on a microservices system and conducted comprehensive experiments with STMformer against leading methods. In both short-term and long-term forecasting tasks, our model consistently achieved a 8.6% reduction in MAE(Mean Absolute Error) and a 2.2% reduction in MSE (Mean Squared Error). The source code is available at https://github.com/xuyifeiiie/STMformer.

Via

Access Paper or Ask Questions