Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xin Jin

MotionBank: A Large-scale Video Motion Benchmark with Disentangled Rule-based Annotations

Oct 17, 2024

Liang Xu, Shaoyang Hua, Zili Lin, Yifan Liu, Feipeng Ma, Yichao Yan, Xin Jin, Xiaokang Yang, Wenjun Zeng

Figure 1 for MotionBank: A Large-scale Video Motion Benchmark with Disentangled Rule-based Annotations

Figure 2 for MotionBank: A Large-scale Video Motion Benchmark with Disentangled Rule-based Annotations

Figure 3 for MotionBank: A Large-scale Video Motion Benchmark with Disentangled Rule-based Annotations

Figure 4 for MotionBank: A Large-scale Video Motion Benchmark with Disentangled Rule-based Annotations

Abstract:In this paper, we tackle the problem of how to build and benchmark a large motion model (LMM). The ultimate goal of LMM is to serve as a foundation model for versatile motion-related tasks, e.g., human motion generation, with interpretability and generalizability. Though advanced, recent LMM-related works are still limited by small-scale motion data and costly text descriptions. Besides, previous motion benchmarks primarily focus on pure body movements, neglecting the ubiquitous motions in context, i.e., humans interacting with humans, objects, and scenes. To address these limitations, we consolidate large-scale video action datasets as knowledge banks to build MotionBank, which comprises 13 video action datasets, 1.24M motion sequences, and 132.9M frames of natural and diverse human motions. Different from laboratory-captured motions, in-the-wild human-centric videos contain abundant motions in context. To facilitate better motion text alignment, we also meticulously devise a motion caption generation algorithm to automatically produce rule-based, unbiased, and disentangled text descriptions via the kinematic characteristics for each motion. Extensive experiments show that our MotionBank is beneficial for general motion-related tasks of human motion generation, motion in-context generation, and motion understanding. Video motions together with the rule-based text annotations could serve as an efficient alternative for larger LMMs. Our dataset, codes, and benchmark will be publicly available at https://github.com/liangxuy/MotionBank.

Via

Access Paper or Ask Questions

Learning AND-OR Templates for Professional Photograph Parsing and Guidance

Oct 08, 2024

Xin Jin, Liaoruxing Zhang, Chenyu Fan, Wenbo Yuan

Figure 1 for Learning AND-OR Templates for Professional Photograph Parsing and Guidance

Figure 2 for Learning AND-OR Templates for Professional Photograph Parsing and Guidance

Figure 3 for Learning AND-OR Templates for Professional Photograph Parsing and Guidance

Figure 4 for Learning AND-OR Templates for Professional Photograph Parsing and Guidance

Abstract:Since the development of photography art, many so-called "templates" have been formed, namely visual styles summarized from a series of themed and stylized photography works. In this paper, we propose to analysize and and summarize these 'templates' in photography by learning composite templates of photography images. We present a framework for learning a hierarchical reconfigurable image template from photography images to learn and characterize the "templates" used in these photography images. Using this method, we measured the artistic quality of photography on the photos and conducted photography guidance. In addition, we also utilized the "templates" for guidance in several image generation tasks. Experimental results show that the learned templates can well describe the photography techniques and styles, whereas the proposed approach can assess the quality of photography images as human being does.

Via

Access Paper or Ask Questions

Open-World Reinforcement Learning over Long Short-Term Imagination

Oct 04, 2024

Jiajian Li, Qi Wang, Yunbo Wang, Xin Jin, Yang Li, Wenjun Zeng, Xiaokang Yang

Figure 1 for Open-World Reinforcement Learning over Long Short-Term Imagination

Figure 2 for Open-World Reinforcement Learning over Long Short-Term Imagination

Figure 3 for Open-World Reinforcement Learning over Long Short-Term Imagination

Figure 4 for Open-World Reinforcement Learning over Long Short-Term Imagination

Abstract:Training visual reinforcement learning agents in a high-dimensional open world presents significant challenges. While various model-based methods have improved sample efficiency by learning interactive world models, these agents tend to be "short-sighted", as they are typically trained on short snippets of imagined experiences. We argue that the primary obstacle in open-world decision-making is improving the efficiency of off-policy exploration across an extensive state space. In this paper, we present LS-Imagine, which extends the imagination horizon within a limited number of state transition steps, enabling the agent to explore behaviors that potentially lead to promising long-term feedback. The foundation of our approach is to build a long short-term world model. To achieve this, we simulate goal-conditioned jumpy state transitions and compute corresponding affordance maps by zooming in on specific areas within single images. This facilitates the integration of direct long-term values into behavior learning. Our method demonstrates significant improvements over state-of-the-art techniques in MineDojo.

Via

Access Paper or Ask Questions

Scene Graph Disentanglement and Composition for Generalizable Complex Image Generation

Oct 01, 2024

Yunnan Wang, Ziqiang Li, Zequn Zhang, Wenyao Zhang, Baao Xie, Xihui Liu, Wenjun Zeng, Xin Jin

Figure 1 for Scene Graph Disentanglement and Composition for Generalizable Complex Image Generation

Figure 2 for Scene Graph Disentanglement and Composition for Generalizable Complex Image Generation

Figure 3 for Scene Graph Disentanglement and Composition for Generalizable Complex Image Generation

Figure 4 for Scene Graph Disentanglement and Composition for Generalizable Complex Image Generation

Abstract:There has been exciting progress in generating images from natural language or layout conditions. However, these methods struggle to faithfully reproduce complex scenes due to the insufficient modeling of multiple objects and their relationships. To address this issue, we leverage the scene graph, a powerful structured representation, for complex image generation. Different from the previous works that directly use scene graphs for generation, we employ the generative capabilities of variational autoencoders and diffusion models in a generalizable manner, compositing diverse disentangled visual clues from scene graphs. Specifically, we first propose a Semantics-Layout Variational AutoEncoder (SL-VAE) to jointly derive (layouts, semantics) from the input scene graph, which allows a more diverse and reasonable generation in a one-to-many mapping. We then develop a Compositional Masked Attention (CMA) integrated with a diffusion model, incorporating (layouts, semantics) with fine-grained attributes as generation guidance. To further achieve graph manipulation while keeping the visual content consistent, we introduce a Multi-Layered Sampler (MLS) for an "isolated" image editing effect. Extensive experiments demonstrate that our method outperforms recent competitors based on text, layout, or scene graph, in terms of generation rationality and controllability.

* Accepted by NeurlPS 2024

Via

Access Paper or Ask Questions

EM-DARTS: Hierarchical Differentiable Architecture Search for Eye Movement Recognition

Sep 22, 2024

Huafeng Qin, Hongyu Zhu, Xin Jin, Xin Yu, Mounim A. El-Yacoubi, Xinbo Gao

Figure 1 for EM-DARTS: Hierarchical Differentiable Architecture Search for Eye Movement Recognition

Figure 2 for EM-DARTS: Hierarchical Differentiable Architecture Search for Eye Movement Recognition

Figure 3 for EM-DARTS: Hierarchical Differentiable Architecture Search for Eye Movement Recognition

Figure 4 for EM-DARTS: Hierarchical Differentiable Architecture Search for Eye Movement Recognition

Abstract:Eye movement biometrics has received increasing attention thanks to its high secure identification. Although deep learning (DL) models have been recently successfully applied for eye movement recognition, the DL architecture still is determined by human prior knowledge. Differentiable Neural Architecture Search (DARTS) automates the manual process of architecture design with high search efficiency. DARTS, however, usually stacks the same multiple learned cells to form a final neural network for evaluation, limiting therefore the diversity of the network. Incidentally, DARTS usually searches the architecture in a shallow network while evaluating it in a deeper one, which results in a large gap between the architecture depths in the search and evaluation scenarios. To address this issue, we propose EM-DARTS, a hierarchical differentiable architecture search algorithm to automatically design the DL architecture for eye movement recognition. First, we define a supernet and propose a global and local alternate Neural Architecture Search method to search the optimal architecture alternately with an differentiable neural architecture search. The local search strategy aims to find an optimal architecture for different cells while the global search strategy is responsible for optimizing the architecture of the target network. To further reduce redundancy, a transfer entropy is proposed to compute the information amount of each layer, so as to further simplify search network. Our experiments on three public databases demonstrate that the proposed EM-DARTS is capable of producing an optimal architecture that leads to state-of-the-art recognition performance.

* Submited to IEEE Transactions on Information Forensics and Security

Via

Access Paper or Ask Questions

Personalized Route Recommendation Based on User Habits for Vehicle Navigation

Sep 21, 2024

Yinuo Huang, Xin Jin, Miao Fan, Xunwei Yang, Fangliang Jiang

Figure 1 for Personalized Route Recommendation Based on User Habits for Vehicle Navigation

Figure 2 for Personalized Route Recommendation Based on User Habits for Vehicle Navigation

Figure 3 for Personalized Route Recommendation Based on User Habits for Vehicle Navigation

Figure 4 for Personalized Route Recommendation Based on User Habits for Vehicle Navigation

Abstract:Navigation route recommendation is one of the important functions of intelligent transportation. However, users frequently deviate from recommended routes for various reasons, with personalization being a key problem in the field of research. This paper introduces a personalized route recommendation method based on user historical navigation data. First, we formulate route sorting as a pointwise problem based on a large set of pertinent features. Second, we construct route features and user profiles to establish a comprehensive feature dataset. Furthermore, we propose a Deep-Cross-Recurrent (DCR) learning model aimed at learning route sorting scores and offering customized route recommendations. This approach effectively captures recommended navigation routes and user preferences by integrating DCN-v2 and LSTM. In offline evaluations, our method compared with the minimum ETA (estimated time of arrival), LightGBM, and DCN-v2 indicated 8.72%, 2.19%, and 0.9% reduction in the mean inconsistency rate respectively, demonstrating significant improvements in recommendation accuracy.

* Accepted by IDST 2024

Via

Access Paper or Ask Questions

Relax DARTS: Relaxing the Constraints of Differentiable Architecture Search for Eye Movement Recognition

Sep 18, 2024

Hongyu Zhu, Xin Jin, Hongchao Liao, Yan Xiang, Mounim A. El-Yacoubi, Huafeng Qin

Figure 1 for Relax DARTS: Relaxing the Constraints of Differentiable Architecture Search for Eye Movement Recognition

Figure 2 for Relax DARTS: Relaxing the Constraints of Differentiable Architecture Search for Eye Movement Recognition

Figure 3 for Relax DARTS: Relaxing the Constraints of Differentiable Architecture Search for Eye Movement Recognition

Figure 4 for Relax DARTS: Relaxing the Constraints of Differentiable Architecture Search for Eye Movement Recognition

Abstract:Eye movement biometrics is a secure and innovative identification method. Deep learning methods have shown good performance, but their network architecture relies on manual design and combined priori knowledge. To address these issues, we introduce automated network search (NAS) algorithms to the field of eye movement recognition and present Relax DARTS, which is an improvement of the Differentiable Architecture Search (DARTS) to realize more efficient network search and training. The key idea is to circumvent the issue of weight sharing by independently training the architecture parameters $\alpha$ to achieve a more precise target architecture. Moreover, the introduction of module input weights $\beta$ allows cells the flexibility to select inputs, to alleviate the overfitting phenomenon and improve the model performance. Results on four public databases demonstrate that the Relax DARTS achieves state-of-the-art recognition performance. Notably, Relax DARTS exhibits adaptability to other multi-feature temporal classification tasks.

* Accepted By CCBR 2024

Via

Access Paper or Ask Questions

Look One and More: Distilling Hybrid Order Relational Knowledge for Cross-Resolution Image Recognition

Sep 09, 2024

Shiming Ge, Kangkai Zhang, Haolin Liu, Yingying Hua, Shengwei Zhao, Xin Jin, Hao Wen

Figure 1 for Look One and More: Distilling Hybrid Order Relational Knowledge for Cross-Resolution Image Recognition

Figure 2 for Look One and More: Distilling Hybrid Order Relational Knowledge for Cross-Resolution Image Recognition

Figure 3 for Look One and More: Distilling Hybrid Order Relational Knowledge for Cross-Resolution Image Recognition

Figure 4 for Look One and More: Distilling Hybrid Order Relational Knowledge for Cross-Resolution Image Recognition

Abstract:In spite of great success in many image recognition tasks achieved by recent deep models, directly applying them to recognize low-resolution images may suffer from low accuracy due to the missing of informative details during resolution degradation. However, these images are still recognizable for subjects who are familiar with the corresponding high-resolution ones. Inspired by that, we propose a teacher-student learning approach to facilitate low-resolution image recognition via hybrid order relational knowledge distillation. The approach refers to three streams: the teacher stream is pretrained to recognize high-resolution images in high accuracy, the student stream is learned to identify low-resolution images by mimicking the teacher's behaviors, and the extra assistant stream is introduced as bridge to help knowledge transfer across the teacher to the student. To extract sufficient knowledge for reducing the loss in accuracy, the learning of student is supervised with multiple losses, which preserves the similarities in various order relational structures. In this way, the capability of recovering missing details of familiar low-resolution images can be effectively enhanced, leading to a better knowledge transfer. Extensive experiments on metric learning, low-resolution image classification and low-resolution face recognition tasks show the effectiveness of our approach, while taking reduced models.

* Accepted by AAAI 2020

Via

Access Paper or Ask Questions

A Survey on Mixup Augmentations and Beyond

Sep 08, 2024

Xin Jin, Hongyu Zhu, Siyuan Li, Zedong Wang, Zicheng Liu, Chang Yu, Huafeng Qin, Stan Z. Li

Figure 1 for A Survey on Mixup Augmentations and Beyond

Figure 2 for A Survey on Mixup Augmentations and Beyond

Figure 3 for A Survey on Mixup Augmentations and Beyond

Figure 4 for A Survey on Mixup Augmentations and Beyond

Abstract:As Deep Neural Networks have achieved thrilling breakthroughs in the past decade, data augmentations have garnered increasing attention as regularization techniques when massive labeled data are unavailable. Among existing augmentations, Mixup and relevant data-mixing methods that convexly combine selected samples and the corresponding labels are widely adopted because they yield high performances by generating data-dependent virtual data while easily migrating to various domains. This survey presents a comprehensive review of foundational mixup methods and their applications. We first elaborate on the training pipeline with mixup augmentations as a unified framework containing modules. A reformulated framework could contain various mixup methods and give intuitive operational procedures. Then, we systematically investigate the applications of mixup augmentations on vision downstream tasks, various data modalities, and some analysis \& theorems of mixup. Meanwhile, we conclude the current status and limitations of mixup research and point out further work for effective and efficient mixup augmentations. This survey can provide researchers with the current state of the art in mixup methods and provide some insights and guidance roles in the mixup arena. An online project with this survey is available at \url{https://github.com/Westlake-AI/Awesome-Mixup}.

* Preprint V1 with 27 pages main text. Online project at https://github.com/Westlake-AI/Awesome-Mixup

Via

Access Paper or Ask Questions

Anchor-Controlled Generative Adversarial Network for High-Fidelity Electromagnetic and Structurally Diverse Metasurface Design

Aug 29, 2024

Yunhui Zeng, Hongkun Cao, Xin Jin

Figure 1 for Anchor-Controlled Generative Adversarial Network for High-Fidelity Electromagnetic and Structurally Diverse Metasurface Design

Figure 2 for Anchor-Controlled Generative Adversarial Network for High-Fidelity Electromagnetic and Structurally Diverse Metasurface Design

Figure 3 for Anchor-Controlled Generative Adversarial Network for High-Fidelity Electromagnetic and Structurally Diverse Metasurface Design

Figure 4 for Anchor-Controlled Generative Adversarial Network for High-Fidelity Electromagnetic and Structurally Diverse Metasurface Design

Abstract:In optoelectronics, designing free-form metasurfaces presents significant challenges, particularly in achieving high electromagnetic response fidelity due to the complex relationship between physical structures and electromagnetic behaviors. A key difficulty arises from the one-to-many mapping dilemma, where multiple distinct physical structures can yield similar electromagnetic responses, complicating the design process. This paper introduces a novel generative framework, the Anchor-controlled Generative Adversarial Network (AcGAN), which prioritizes electromagnetic fidelity while effectively navigating the one-to-many challenge to create structurally diverse metasurfaces. Unlike existing methods that mainly replicate physical appearances, AcGAN excels in generating a variety of structures that, despite their differences in physical attributes, exhibit similar electromagnetic responses, thereby accommodating fabrication constraints and tolerances. We introduce the Spectral Overlap Coefficient (SOC) as a precise metric to measure the spectral fidelity between generated designs and their targets. Additionally, a cluster-guided controller refines input processing, ensuring multi-level spectral integration and enhancing electromagnetic fidelity. The integration of AnchorNet into our loss function facilitates a nuanced assessment of electromagnetic qualities, supported by a dynamic loss weighting strategy that optimizes spectral alignment. Collectively, these innovations represent a transformative stride in metasurface inverse design, advancing electromagnetic response-oriented engineering and overcoming the complexities of the one-to-many mapping dilemma.Empirical evidence underscores AcGAN's effectiveness in streamlining the design process, achieving superior electromagnetic precision, and fostering a broad spectrum of design possibilities.

Via

Access Paper or Ask Questions