Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Siyuan Li

Enhancing Image Generation Fidelity via Progressive Prompts

Jan 13, 2025

Zhen Xiong, Yuqi Li, Chuanguang Yang, Tiao Tan, Zhihong Zhu, Siyuan Li, Yue Ma

Figure 1 for Enhancing Image Generation Fidelity via Progressive Prompts

Figure 2 for Enhancing Image Generation Fidelity via Progressive Prompts

Figure 3 for Enhancing Image Generation Fidelity via Progressive Prompts

Figure 4 for Enhancing Image Generation Fidelity via Progressive Prompts

Abstract:The diffusion transformer (DiT) architecture has attracted significant attention in image generation, achieving better fidelity, performance, and diversity. However, most existing DiT - based image generation methods focus on global - aware synthesis, and regional prompt control has been less explored. In this paper, we propose a coarse - to - fine generation pipeline for regional prompt - following generation. Specifically, we first utilize the powerful large language model (LLM) to generate both high - level descriptions of the image (such as content, topic, and objects) and low - level descriptions (such as details and style). Then, we explore the influence of cross - attention layers at different depths. We find that deeper layers are always responsible for high - level content control, while shallow layers handle low - level content control. Various prompts are injected into the proposed regional cross - attention control for coarse - to - fine generation. By using the proposed pipeline, we enhance the controllability of DiT - based image generation. Extensive quantitative and qualitative results show that our pipeline can improve the performance of the generated images.

* Accepted by ICASSP 2025, Github: https://github.com/ZhenXiong-dl/ICASSP2025-RCAC

Via

Access Paper or Ask Questions

PhyloGen: Language Model-Enhanced Phylogenetic Inference via Graph Structure Generation

Dec 25, 2024

ChenRui Duan, Zelin Zang, Siyuan Li, Yongjie Xu, Stan Z. Li

Figure 1 for PhyloGen: Language Model-Enhanced Phylogenetic Inference via Graph Structure Generation

Figure 2 for PhyloGen: Language Model-Enhanced Phylogenetic Inference via Graph Structure Generation

Figure 3 for PhyloGen: Language Model-Enhanced Phylogenetic Inference via Graph Structure Generation

Figure 4 for PhyloGen: Language Model-Enhanced Phylogenetic Inference via Graph Structure Generation

Abstract:Phylogenetic trees elucidate evolutionary relationships among species, but phylogenetic inference remains challenging due to the complexity of combining continuous (branch lengths) and discrete parameters (tree topology). Traditional Markov Chain Monte Carlo methods face slow convergence and computational burdens. Existing Variational Inference methods, which require pre-generated topologies and typically treat tree structures and branch lengths independently, may overlook critical sequence features, limiting their accuracy and flexibility. We propose PhyloGen, a novel method leveraging a pre-trained genomic language model to generate and optimize phylogenetic trees without dependence on evolutionary models or aligned sequence constraints. PhyloGen views phylogenetic inference as a conditionally constrained tree structure generation problem, jointly optimizing tree topology and branch lengths through three core modules: (i) Feature Extraction, (ii) PhyloTree Construction, and (iii) PhyloTree Structure Modeling. Meanwhile, we introduce a Scoring Function to guide the model towards a more stable gradient descent. We demonstrate the effectiveness and robustness of PhyloGen on eight real-world benchmark datasets. Visualization results confirm PhyloGen provides deeper insights into phylogenetic relationships.

Via

Access Paper or Ask Questions

Event USKT : U-State Space Model in Knowledge Transfer for Event Cameras

Nov 22, 2024

Yuhui Lin, Jiahao Zhang, Siyuan Li, Jimin Xiao, Ding Xu, Wenjun Wu, Jiaxuan Lu

Figure 1 for Event USKT : U-State Space Model in Knowledge Transfer for Event Cameras

Figure 2 for Event USKT : U-State Space Model in Knowledge Transfer for Event Cameras

Figure 3 for Event USKT : U-State Space Model in Knowledge Transfer for Event Cameras

Figure 4 for Event USKT : U-State Space Model in Knowledge Transfer for Event Cameras

Abstract:Event cameras, as an emerging imaging technology, offer distinct advantages over traditional RGB cameras, including reduced energy consumption and higher frame rates. However, the limited quantity of available event data presents a significant challenge, hindering their broader development. To alleviate this issue, we introduce a tailored U-shaped State Space Model Knowledge Transfer (USKT) framework for Event-to-RGB knowledge transfer. This framework generates inputs compatible with RGB frames, enabling event data to effectively reuse pre-trained RGB models and achieve competitive performance with minimal parameter tuning. Within the USKT architecture, we also propose a bidirectional reverse state space model. Unlike conventional bidirectional scanning mechanisms, the proposed Bidirectional Reverse State Space Model (BiR-SSM) leverages a shared weight strategy, which facilitates efficient modeling while conserving computational resources. In terms of effectiveness, integrating USKT with ResNet50 as the backbone improves model performance by 0.95%, 3.57%, and 2.9% on DVS128 Gesture, N-Caltech101, and CIFAR-10-DVS datasets, respectively, underscoring USKT's adaptability and effectiveness. The code will be made available upon acceptance.

Via

Access Paper or Ask Questions

SkillTree: Explainable Skill-Based Deep Reinforcement Learning for Long-Horizon Control Tasks

Nov 19, 2024

Yongyan Wen, Siyuan Li, Rongchang Zuo, Lei Yuan, Hangyu Mao, Peng Liu

Figure 1 for SkillTree: Explainable Skill-Based Deep Reinforcement Learning for Long-Horizon Control Tasks

Figure 2 for SkillTree: Explainable Skill-Based Deep Reinforcement Learning for Long-Horizon Control Tasks

Figure 3 for SkillTree: Explainable Skill-Based Deep Reinforcement Learning for Long-Horizon Control Tasks

Figure 4 for SkillTree: Explainable Skill-Based Deep Reinforcement Learning for Long-Horizon Control Tasks

Abstract:Deep reinforcement learning (DRL) has achieved remarkable success in various research domains. However, its reliance on neural networks results in a lack of transparency, which limits its practical applications. To achieve explainability, decision trees have emerged as a popular and promising alternative to neural networks. Nonetheless, due to their limited expressiveness, traditional decision trees struggle with high-dimensional long-horizon continuous control tasks. In this paper, we proposes SkillTree, a novel framework that reduces complex continuous action spaces into discrete skill spaces. Our hierarchical approach integrates a differentiable decision tree within the high-level policy to generate skill embeddings, which subsequently guide the low-level policy in executing skills. By making skill decisions explainable, we achieve skill-level explainability, enhancing the understanding of the decision-making process in complex tasks. Experimental results demonstrate that our method achieves performance comparable to skill-based neural networks in complex robotic arm control domains. Furthermore, SkillTree offers explanations at the skill level, thereby increasing the transparency of the decision-making process.

Via

Access Paper or Ask Questions

Safe Planner: Empowering Safety Awareness in Large Pre-Trained Models for Robot Task Planning

Nov 11, 2024

Siyuan Li, Zhe Ma, Feifan Liu, Jiani Lu, Qinqin Xiao, Kewu Sun, Lingfei Cui, Xirui Yang, Peng Liu, Xun Wang

Figure 1 for Safe Planner: Empowering Safety Awareness in Large Pre-Trained Models for Robot Task Planning

Figure 2 for Safe Planner: Empowering Safety Awareness in Large Pre-Trained Models for Robot Task Planning

Figure 3 for Safe Planner: Empowering Safety Awareness in Large Pre-Trained Models for Robot Task Planning

Figure 4 for Safe Planner: Empowering Safety Awareness in Large Pre-Trained Models for Robot Task Planning

Abstract:Robot task planning is an important problem for autonomous robots in long-horizon challenging tasks. As large pre-trained models have demonstrated superior planning ability, recent research investigates utilizing large models to achieve autonomous planning for robots in diverse tasks. However, since the large models are pre-trained with Internet data and lack the knowledge of real task scenes, large models as planners may make unsafe decisions that hurt the robots and the surrounding environments. To solve this challenge, we propose a novel Safe Planner framework, which empowers safety awareness in large pre-trained models to accomplish safe and executable planning. In this framework, we develop a safety prediction module to guide the high-level large model planner, and this safety module trained in a simulator can be effectively transferred to real-world tasks. The proposed Safe Planner framework is evaluated on both simulated environments and real robots. The experiment results demonstrate that Safe Planner not only achieves state-of-the-art task success rates, but also substantially improves safety during task execution. The experiment videos are shown in https://sites.google.com/view/safeplanner .

* 9 pages, 6 figures

Via

Access Paper or Ask Questions

MeToken: Uniform Micro-environment Token Boosts Post-Translational Modification Prediction

Nov 04, 2024

Cheng Tan, Zhenxiao Cao, Zhangyang Gao, Lirong Wu, Siyuan Li, Yufei Huang, Jun Xia, Bozhen Hu, Stan Z. Li

Figure 1 for MeToken: Uniform Micro-environment Token Boosts Post-Translational Modification Prediction

Figure 2 for MeToken: Uniform Micro-environment Token Boosts Post-Translational Modification Prediction

Figure 3 for MeToken: Uniform Micro-environment Token Boosts Post-Translational Modification Prediction

Figure 4 for MeToken: Uniform Micro-environment Token Boosts Post-Translational Modification Prediction

Abstract:Post-translational modifications (PTMs) profoundly expand the complexity and functionality of the proteome, regulating protein attributes and interactions that are crucial for biological processes. Accurately predicting PTM sites and their specific types is therefore essential for elucidating protein function and understanding disease mechanisms. Existing computational approaches predominantly focus on protein sequences to predict PTM sites, driven by the recognition of sequence-dependent motifs. However, these approaches often overlook protein structural contexts. In this work, we first compile a large-scale sequence-structure PTM dataset, which serves as the foundation for fair comparison. We introduce the MeToken model, which tokenizes the micro-environment of each amino acid, integrating both sequence and structural information into unified discrete tokens. This model not only captures the typical sequence motifs associated with PTMs but also leverages the spatial arrangements dictated by protein tertiary structures, thus providing a holistic view of the factors influencing PTM sites. Designed to address the long-tail distribution of PTM types, MeToken employs uniform sub-codebooks that ensure even the rarest PTMs are adequately represented and distinguished. We validate the effectiveness and generalizability of MeToken across multiple datasets, demonstrating its superior performance in accurately identifying PTM types. The results underscore the importance of incorporating structural data and highlight MeToken's potential in facilitating accurate and comprehensive PTM predictions, which could significantly impact proteomics research. The code and datasets are available at https://github.com/A4Bio/MeToken.

* 26 pages, 20 figures, 10 tables

Via

Access Paper or Ask Questions

TopoFR: A Closer Look at Topology Alignment on Face Recognition

Oct 14, 2024

Jun Dan, Yang Liu, Jiankang Deng, Haoyu Xie, Siyuan Li, Baigui Sun, Shan Luo

Figure 1 for TopoFR: A Closer Look at Topology Alignment on Face Recognition

Figure 2 for TopoFR: A Closer Look at Topology Alignment on Face Recognition

Figure 3 for TopoFR: A Closer Look at Topology Alignment on Face Recognition

Figure 4 for TopoFR: A Closer Look at Topology Alignment on Face Recognition

Abstract:The field of face recognition (FR) has undergone significant advancements with the rise of deep learning. Recently, the success of unsupervised learning and graph neural networks has demonstrated the effectiveness of data structure information. Considering that the FR task can leverage large-scale training data, which intrinsically contains significant structure information, we aim to investigate how to encode such critical structure information into the latent space. As revealed from our observations, directly aligning the structure information between the input and latent spaces inevitably suffers from an overfitting problem, leading to a structure collapse phenomenon in the latent space. To address this problem, we propose TopoFR, a novel FR model that leverages a topological structure alignment strategy called PTSA and a hard sample mining strategy named SDE. Concretely, PTSA uses persistent homology to align the topological structures of the input and latent spaces, effectively preserving the structure information and improving the generalization performance of FR model. To mitigate the impact of hard samples on the latent space structure, SDE accurately identifies hard samples by automatically computing structure damage score (SDS) for each sample, and directs the model to prioritize optimizing these samples. Experimental results on popular face benchmarks demonstrate the superiority of our TopoFR over the state-of-the-art methods. Code and models are available at: https://github.com/modelscope/facechain/tree/main/face_module/TopoFR.

* Accepted by NeurIPS 2024

Via

Access Paper or Ask Questions

Towards Homogeneous Lexical Tone Decoding from Heterogeneous Intracranial Recordings

Oct 13, 2024

Di Wu, Siyuan Li, Chen Feng, Lu Cao, Yue Zhang, Jie Yang, Mohamad Sawan

Figure 1 for Towards Homogeneous Lexical Tone Decoding from Heterogeneous Intracranial Recordings

Figure 2 for Towards Homogeneous Lexical Tone Decoding from Heterogeneous Intracranial Recordings

Figure 3 for Towards Homogeneous Lexical Tone Decoding from Heterogeneous Intracranial Recordings

Figure 4 for Towards Homogeneous Lexical Tone Decoding from Heterogeneous Intracranial Recordings

Abstract:Recent advancements in brain-computer interfaces (BCIs) have enabled the decoding of lexical tones from intracranial recordings, offering the potential to restore the communication abilities of speech-impaired tonal language speakers. However, data heterogeneity induced by both physiological and instrumental factors poses a significant challenge for unified invasive brain tone decoding. Traditional subject-specific models, which operate under a heterogeneous decoding paradigm, fail to capture generalized neural representations and cannot effectively leverage data across subjects. To address these limitations, we introduce Homogeneity-Heterogeneity Disentangled Learning for neural Representations (H2DiLR), a novel framework that disentangles and learns both the homogeneity and heterogeneity from intracranial recordings across multiple subjects. To evaluate H2DiLR, we collected stereoelectroencephalography (sEEG) data from multiple participants reading Mandarin materials comprising 407 syllables, representing nearly all Mandarin characters. Extensive experiments demonstrate that H2DiLR, as a unified decoding paradigm, significantly outperforms the conventional heterogeneous decoding approach. Furthermore, we empirically confirm that H2DiLR effectively captures both homogeneity and heterogeneity during neural representation learning.

* Preprint V1 with 10 pages main text

Via

Access Paper or Ask Questions

Unveiling the Backbone-Optimizer Coupling Bias in Visual Representation Learning

Oct 08, 2024

Siyuan Li, Juanxi Tian, Zedong Wang, Luyuan Zhang, Zicheng Liu, Weiyang Jin, Yang Liu, Baigui Sun, Stan Z. Li

Figure 1 for Unveiling the Backbone-Optimizer Coupling Bias in Visual Representation Learning

Figure 2 for Unveiling the Backbone-Optimizer Coupling Bias in Visual Representation Learning

Figure 3 for Unveiling the Backbone-Optimizer Coupling Bias in Visual Representation Learning

Figure 4 for Unveiling the Backbone-Optimizer Coupling Bias in Visual Representation Learning

Abstract:This paper delves into the interplay between vision backbones and optimizers, unvealing an inter-dependent phenomenon termed \textit{\textbf{b}ackbone-\textbf{o}ptimizer \textbf{c}oupling \textbf{b}ias} (BOCB). We observe that canonical CNNs, such as VGG and ResNet, exhibit a marked co-dependency with SGD families, while recent architectures like ViTs and ConvNeXt share a tight coupling with the adaptive learning rate ones. We further show that BOCB can be introduced by both optimizers and certain backbone designs and may significantly impact the pre-training and downstream fine-tuning of vision models. Through in-depth empirical analysis, we summarize takeaways on recommended optimizers and insights into robust vision backbone architectures. We hope this work can inspire the community to question long-held assumptions on backbones and optimizers, stimulate further explorations, and thereby contribute to more robust vision systems. The source code and models are publicly available at https://bocb-ai.github.io/.

* Preprint V1. Online project at https://bocb-ai.github.io/

Via

Access Paper or Ask Questions

Samba: Synchronized Set-of-Sequences Modeling for Multiple Object Tracking

Oct 02, 2024

Mattia Segu, Luigi Piccinelli, Siyuan Li, Yung-Hsu Yang, Bernt Schiele, Luc Van Gool

Figure 1 for Samba: Synchronized Set-of-Sequences Modeling for Multiple Object Tracking

Figure 2 for Samba: Synchronized Set-of-Sequences Modeling for Multiple Object Tracking

Figure 3 for Samba: Synchronized Set-of-Sequences Modeling for Multiple Object Tracking

Figure 4 for Samba: Synchronized Set-of-Sequences Modeling for Multiple Object Tracking

Abstract:Multiple object tracking in complex scenarios - such as coordinated dance performances, team sports, or dynamic animal groups - presents unique challenges. In these settings, objects frequently move in coordinated patterns, occlude each other, and exhibit long-term dependencies in their trajectories. However, it remains a key open research question on how to model long-range dependencies within tracklets, interdependencies among tracklets, and the associated temporal occlusions. To this end, we introduce Samba, a novel linear-time set-of-sequences model designed to jointly process multiple tracklets by synchronizing the multiple selective state-spaces used to model each tracklet. Samba autoregressively predicts the future track query for each sequence while maintaining synchronized long-term memory representations across tracklets. By integrating Samba into a tracking-by-propagation framework, we propose SambaMOTR, the first tracker effectively addressing the aforementioned issues, including long-range dependencies, tracklet interdependencies, and temporal occlusions. Additionally, we introduce an effective technique for dealing with uncertain observations (MaskObs) and an efficient training recipe to scale SambaMOTR to longer sequences. By modeling long-range dependencies and interactions among tracked objects, SambaMOTR implicitly learns to track objects accurately through occlusions without any hand-crafted heuristics. Our approach significantly surpasses prior state-of-the-art on the DanceTrack, BFT, and SportsMOT datasets.

Via

Access Paper or Ask Questions