Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jie Li

University of Science and Technology of China, AnyWit Robotics Co., Ltd

HOIGen-1M: A Large-scale Dataset for Human-Object Interaction Video Generation

Mar 31, 2025

Kun Liu, Qi Liu, Xinchen Liu, Jie Li, Yongdong Zhang, Jiebo Luo, Xiaodong He, Wu Liu

Abstract:Text-to-video (T2V) generation has made tremendous progress in generating complicated scenes based on texts. However, human-object interaction (HOI) often cannot be precisely generated by current T2V models due to the lack of large-scale videos with accurate captions for HOI. To address this issue, we introduce HOIGen-1M, the first largescale dataset for HOI Generation, consisting of over one million high-quality videos collected from diverse sources. In particular, to guarantee the high quality of videos, we first design an efficient framework to automatically curate HOI videos using the powerful multimodal large language models (MLLMs), and then the videos are further cleaned by human annotators. Moreover, to obtain accurate textual captions for HOI videos, we design a novel video description method based on a Mixture-of-Multimodal-Experts (MoME) strategy that not only generates expressive captions but also eliminates the hallucination by individual MLLM. Furthermore, due to the lack of an evaluation framework for generated HOI videos, we propose two new metrics to assess the quality of generated videos in a coarse-to-fine manner. Extensive experiments reveal that current T2V models struggle to generate high-quality HOI videos and confirm that our HOIGen-1M dataset is instrumental for improving HOI video generation. Project webpage is available at https://liuqi-creat.github.io/HOIGen.github.io.

* CVPR 2025

Via

Access Paper or Ask Questions

Sparse Meets Dense: Unified Generative Recommendations with Cascaded Sparse-Dense Representations

Mar 04, 2025

Yuhao Yang, Zhi Ji, Zhaopeng Li, Yi Li, Zhonglin Mo, Yue Ding, Kai Chen, Zijian Zhang, Jie Li, Shuanglong Li(+1 more)

Figure 1 for Sparse Meets Dense: Unified Generative Recommendations with Cascaded Sparse-Dense Representations

Figure 2 for Sparse Meets Dense: Unified Generative Recommendations with Cascaded Sparse-Dense Representations

Figure 3 for Sparse Meets Dense: Unified Generative Recommendations with Cascaded Sparse-Dense Representations

Figure 4 for Sparse Meets Dense: Unified Generative Recommendations with Cascaded Sparse-Dense Representations

Abstract:Generative models have recently gained attention in recommendation systems by directly predicting item identifiers from user interaction sequences. However, existing methods suffer from significant information loss due to the separation of stages such as quantization and sequence modeling, hindering their ability to achieve the modeling precision and accuracy of sequential dense retrieval techniques. Integrating generative and dense retrieval methods remains a critical challenge. To address this, we introduce the Cascaded Organized Bi-Represented generAtive retrieval (COBRA) framework, which innovatively integrates sparse semantic IDs and dense vectors through a cascading process. Our method alternates between generating these representations by first generating sparse IDs, which serve as conditions to aid in the generation of dense vectors. End-to-end training enables dynamic refinement of dense representations, capturing both semantic insights and collaborative signals from user-item interactions. During inference, COBRA employs a coarse-to-fine strategy, starting with sparse ID generation and refining them into dense vectors via the generative model. We further propose BeamFusion, an innovative approach combining beam search with nearest neighbor scores to enhance inference flexibility and recommendation diversity. Extensive experiments on public datasets and offline tests validate our method's robustness. Online A/B tests on a real-world advertising platform with over 200 million daily users demonstrate substantial improvements in key metrics, highlighting COBRA's practical advantages.

Via

Access Paper or Ask Questions

A Single-Frame and Multi-Frame Cascaded Image Super-Resolution Method

Dec 13, 2024

Jing Sun, Qiangqiang Yuan, Huanfeng Shen, Jie Li, Liangpei Zhang

Figure 1 for A Single-Frame and Multi-Frame Cascaded Image Super-Resolution Method

Figure 2 for A Single-Frame and Multi-Frame Cascaded Image Super-Resolution Method

Figure 3 for A Single-Frame and Multi-Frame Cascaded Image Super-Resolution Method

Figure 4 for A Single-Frame and Multi-Frame Cascaded Image Super-Resolution Method

Abstract:The objective of image super-resolution is to reconstruct a high-resolution (HR) image with the prior knowledge from one or several low-resolution (LR) images. However, in the real world, due to the limited complementary information, the performance of both single-frame and multi-frame super-resolution reconstruction degrades rapidly as the magnification increases. In this paper, we propose a novel two-step image super resolution method concatenating multi-frame super-resolution (MFSR) with single-frame super-resolution (SFSR), to progressively upsample images to the desired resolution. The proposed method consisting of an L0-norm constrained reconstruction scheme and an enhanced residual back-projection network, integrating the flexibility of the variational modelbased method and the feature learning capacity of the deep learning-based method. To verify the effectiveness of the proposed algorithm, extensive experiments with both simulated and real world sequences were implemented. The experimental results show that the proposed method yields superior performance in both objective and perceptual quality measurements. The average PSNRs of the cascade model in set5 and set14 are 33.413 dB and 29.658 dB respectively, which are 0.76 dB and 0.621 dB more than the baseline method. In addition, the experiment indicates that this cascade model can be robustly applied to different SFSR and MFSR methods.

* Sensors, vol. 24, no. 17, p. 5566, Aug. 2024
* 20 pages, 13 figures

Via

Access Paper or Ask Questions

Inference-to-complete: A High-performance and Programmable Data-plane Co-processor for Neural-network-driven Traffic Analysis

Nov 01, 2024

Dong Wen, Zhongpei Liu, Tong Yang, Tao Li, Tianyun Li, Chenglong Li, Jie Li, Zhigang Sun

Figure 1 for Inference-to-complete: A High-performance and Programmable Data-plane Co-processor for Neural-network-driven Traffic Analysis

Figure 2 for Inference-to-complete: A High-performance and Programmable Data-plane Co-processor for Neural-network-driven Traffic Analysis

Figure 3 for Inference-to-complete: A High-performance and Programmable Data-plane Co-processor for Neural-network-driven Traffic Analysis

Figure 4 for Inference-to-complete: A High-performance and Programmable Data-plane Co-processor for Neural-network-driven Traffic Analysis

Abstract:Neural-networks-driven intelligent data-plane (NN-driven IDP) is becoming an emerging topic for excellent accuracy and high performance. Meanwhile we argue that NN-driven IDP should satisfy three design goals: the flexibility to support various NNs models, the low-latency-high-throughput inference performance, and the data-plane-unawareness harming no performance and functionality. Unfortunately, existing work either over-modify NNs for IDP, or insert inline pipelined accelerators into the data-plane, failing to meet the flexibility and unawareness goals. In this paper, we propose Kaleidoscope, a flexible and high-performance co-processor located at the bypass of the data-plane. To address the challenge of meeting three design goals, three key techniques are presented. The programmable run-to-completion accelerators are developed for flexible inference. To further improve performance, we design a scalable inference engine which completes low-latency and low-cost inference for the mouse flows, and perform complex NNs with high-accuracy for the elephant flows. Finally, raw-bytes-based NNs are introduced, which help to achieve unawareness. We prototype Kaleidoscope on both FPGA and ASIC library. In evaluation on six NNs models, Kaleidoscope reaches 256-352 ns inference latency and 100 Gbps throughput with negligible influence on the data-plane. The on-board tested NNs perform state-of-the-art accuracy among other NN-driven IDP, exhibiting the the significant impact of flexibility on enhancing traffic analysis accuracy.

* Under review

Via

Access Paper or Ask Questions

Performance-Driven QUBO for Recommender Systems on Quantum Annealers

Oct 20, 2024

Jiayang Niu, Jie Li, Ke Deng, Mark Sanderson, Yongli Ren

Figure 1 for Performance-Driven QUBO for Recommender Systems on Quantum Annealers

Figure 2 for Performance-Driven QUBO for Recommender Systems on Quantum Annealers

Figure 3 for Performance-Driven QUBO for Recommender Systems on Quantum Annealers

Figure 4 for Performance-Driven QUBO for Recommender Systems on Quantum Annealers

Abstract:We propose Counterfactual Analysis Quadratic Unconstrained Binary Optimization (CAQUBO) to solve QUBO problems for feature selection in recommender systems. CAQUBO leverages counterfactual analysis to measure the impact of individual features and feature combinations on model performance and employs the measurements to construct the coefficient matrix for a quantum annealer to select the optimal feature combinations for recommender systems, thereby improving their final recommendation performance. By establishing explicit connections between features and the recommendation performance, the proposed approach demonstrates superior performance compared to the state-of-the-art quantum annealing methods. Extensive experiments indicate that integrating quantum computing with counterfactual analysis holds great promise for addressing these challenges.

Via

Access Paper or Ask Questions

MemFusionMap: Working Memory Fusion for Online Vectorized HD Map Construction

Sep 26, 2024

Jingyu Song, Xudong Chen, Liupei Lu, Jie Li, Katherine A. Skinner

Figure 1 for MemFusionMap: Working Memory Fusion for Online Vectorized HD Map Construction

Figure 2 for MemFusionMap: Working Memory Fusion for Online Vectorized HD Map Construction

Figure 3 for MemFusionMap: Working Memory Fusion for Online Vectorized HD Map Construction

Figure 4 for MemFusionMap: Working Memory Fusion for Online Vectorized HD Map Construction

Abstract:High-definition (HD) maps provide environmental information for autonomous driving systems and are essential for safe planning. While existing methods with single-frame input achieve impressive performance for online vectorized HD map construction, they still struggle with complex scenarios and occlusions. We propose MemFusionMap, a novel temporal fusion model with enhanced temporal reasoning capabilities for online HD map construction. Specifically, we contribute a working memory fusion module that improves the model's memory capacity to reason across history frames. We also design a novel temporal overlap heatmap to explicitly inform the model about the temporal overlap information and vehicle trajectory in the Bird's Eye View space. By integrating these two designs, MemFusionMap significantly outperforms existing methods while also maintaining a versatile design for scalability. We conduct extensive evaluation on open-source benchmarks and demonstrate a maximum improvement of 5.4% in mAP over state-of-the-art methods. The code for MemFusionMap will be made open-source upon publication of this paper.

Via

Access Paper or Ask Questions

Speaker Contrastive Learning for Source Speaker Tracing

Sep 16, 2024

Qing Wang, Hongmei Guo, Jian Kang, Mengjie Du, Jie Li, Xiao-Lei Zhang, Lei Xie

Figure 1 for Speaker Contrastive Learning for Source Speaker Tracing

Figure 2 for Speaker Contrastive Learning for Source Speaker Tracing

Figure 3 for Speaker Contrastive Learning for Source Speaker Tracing

Figure 4 for Speaker Contrastive Learning for Source Speaker Tracing

Abstract:As a form of biometric authentication technology, the security of speaker verification systems is of utmost importance. However, SV systems are inherently vulnerable to various types of attacks that can compromise their accuracy and reliability. One such attack is voice conversion, which modifies a persons speech to sound like another person by altering various vocal characteristics. This poses a significant threat to SV systems. To address this challenge, the Source Speaker Tracing Challenge in IEEE SLT2024 aims to identify the source speaker information in manipulated speech signals. Specifically, SSTC focuses on source speaker verification against voice conversion to determine whether two converted speech samples originate from the same source speaker. In this study, we propose a speaker contrastive learning-based approach for source speaker tracing to learn the latent source speaker information in converted speech. To learn a more source-speaker-related representation, we employ speaker contrastive loss during the training of the embedding extractor. This speaker contrastive loss helps identify the true source speaker embedding among several distractor speaker embeddings, enabling the embedding extractor to learn the potentially possessing source speaker information present in the converted speech. Experiments demonstrate that our proposed speaker contrastive learning system achieves the lowest EER of 16.788% on the challenge test set, securing first place in the challenge.

* 7 pages, 2 figures, accepted by SLT

Via

Access Paper or Ask Questions

DiffFAS: Face Anti-Spoofing via Generative Diffusion Models

Sep 13, 2024

Xinxu Ge, Xin Liu, Zitong Yu, Jingang Shi, Chun Qi, Jie Li, Heikki Kälviäinen

Figure 1 for DiffFAS: Face Anti-Spoofing via Generative Diffusion Models

Figure 2 for DiffFAS: Face Anti-Spoofing via Generative Diffusion Models

Figure 3 for DiffFAS: Face Anti-Spoofing via Generative Diffusion Models

Figure 4 for DiffFAS: Face Anti-Spoofing via Generative Diffusion Models

Abstract:Face anti-spoofing (FAS) plays a vital role in preventing face recognition (FR) systems from presentation attacks. Nowadays, FAS systems face the challenge of domain shift, impacting the generalization performance of existing FAS methods. In this paper, we rethink about the inherence of domain shift and deconstruct it into two factors: image style and image quality. Quality influences the purity of the presentation of spoof information, while style affects the manner in which spoof information is presented. Based on our analysis, we propose DiffFAS framework, which quantifies quality as prior information input into the network to counter image quality shift, and performs diffusion-based high-fidelity cross-domain and cross-attack types generation to counter image style shift. DiffFAS transforms easily collectible live faces into high-fidelity attack faces with precise labels while maintaining consistency between live and spoof face identities, which can also alleviate the scarcity of labeled data with novel type attacks faced by nowadays FAS system. We demonstrate the effectiveness of our framework on challenging cross-domain and cross-attack FAS datasets, achieving the state-of-the-art performance. Available at https://github.com/murphytju/DiffFAS.

* ECCV 24

Via

Access Paper or Ask Questions

SDformer: Efficient End-to-End Transformer for Depth Completion

Sep 12, 2024

Jian Qian, Miao Sun, Ashley Lee, Jie Li, Shenglong Zhuo, Patrick Yin Chiang

Figure 1 for SDformer: Efficient End-to-End Transformer for Depth Completion

Figure 2 for SDformer: Efficient End-to-End Transformer for Depth Completion

Figure 3 for SDformer: Efficient End-to-End Transformer for Depth Completion

Figure 4 for SDformer: Efficient End-to-End Transformer for Depth Completion

Abstract:Depth completion aims to predict dense depth maps with sparse depth measurements from a depth sensor. Currently, Convolutional Neural Network (CNN) based models are the most popular methods applied to depth completion tasks. However, despite the excellent high-end performance, they suffer from a limited representation area. To overcome the drawbacks of CNNs, a more effective and powerful method has been presented: the Transformer, which is an adaptive self-attention setting sequence-to-sequence model. While the standard Transformer quadratically increases the computational cost from the key-query dot-product of input resolution which improperly employs depth completion tasks. In this work, we propose a different window-based Transformer architecture for depth completion tasks named Sparse-to-Dense Transformer (SDformer). The network consists of an input module for the depth map and RGB image features extraction and concatenation, a U-shaped encoder-decoder Transformer for extracting deep features, and a refinement module. Specifically, we first concatenate the depth map features with the RGB image features through the input model. Then, instead of calculating self-attention with the whole feature maps, we apply different window sizes to extract the long-range depth dependencies. Finally, we refine the predicted features from the input module and the U-shaped encoder-decoder Transformer module to get the enriching depth features and employ a convolution layer to obtain the dense depth map. In practice, the SDformer obtains state-of-the-art results against the CNN-based depth completion models with lower computing loads and parameters on the NYU Depth V2 and KITTI DC datasets.

* Presented at the International Conference on Industrial Automation, Robotics and Control Engineering (IARCE) 2022

Via

Access Paper or Ask Questions

AgileIR: Memory-Efficient Group Shifted Windows Attention for Agile Image Restoration

Sep 10, 2024

Hongyi Cai, Mohammad Mahdinur Rahman, Mohammad Shahid Akhtar, Jie Li, Jingyu Wu, Zhili Fang

Figure 1 for AgileIR: Memory-Efficient Group Shifted Windows Attention for Agile Image Restoration

Figure 2 for AgileIR: Memory-Efficient Group Shifted Windows Attention for Agile Image Restoration

Figure 3 for AgileIR: Memory-Efficient Group Shifted Windows Attention for Agile Image Restoration

Figure 4 for AgileIR: Memory-Efficient Group Shifted Windows Attention for Agile Image Restoration

Abstract:Image Transformers show a magnificent success in Image Restoration tasks. Nevertheless, most of transformer-based models are strictly bounded by exorbitant memory occupancy. Our goal is to reduce the memory consumption of Swin Transformer and at the same time speed up the model during training process. Thus, we introduce AgileIR, group shifted attention mechanism along with window attention, which sparsely simplifies the model in architecture. We propose Group Shifted Window Attention (GSWA) to decompose Shift Window Multi-head Self Attention (SW-MSA) and Window Multi-head Self Attention (W-MSA) into groups across their attention heads, contributing to shrinking memory usage in back propagation. In addition to that, we keep shifted window masking and its shifted learnable biases during training, in order to induce the model interacting across windows within the channel. We also re-allocate projection parameters to accelerate attention matrix calculation, which we found a negligible decrease in performance. As a result of experiment, compared with our baseline SwinIR and other efficient quantization models, AgileIR keeps the performance still at 32.20 dB on Set5 evaluation dataset, exceeding other methods with tailor-made efficient methods and saves over 50% memory while a large batch size is employed.

Via

Access Paper or Ask Questions