Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jun Yu

Lehigh University

Learning the Generalizable Manipulation Skills on Soft-body Tasks via Guided Self-attention Behavior Cloning Policy

Oct 08, 2024

Xuetao Li, Fang Gao, Jun Yu, Shaodong Li, Feng Shuang

Figure 1 for Learning the Generalizable Manipulation Skills on Soft-body Tasks via Guided Self-attention Behavior Cloning Policy

Figure 2 for Learning the Generalizable Manipulation Skills on Soft-body Tasks via Guided Self-attention Behavior Cloning Policy

Figure 3 for Learning the Generalizable Manipulation Skills on Soft-body Tasks via Guided Self-attention Behavior Cloning Policy

Figure 4 for Learning the Generalizable Manipulation Skills on Soft-body Tasks via Guided Self-attention Behavior Cloning Policy

Abstract:Embodied AI represents a paradigm in AI research where artificial agents are situated within and interact with physical or virtual environments. Despite the recent progress in Embodied AI, it is still very challenging to learn the generalizable manipulation skills that can handle large deformation and topological changes on soft-body objects, such as clay, water, and soil. In this work, we proposed an effective policy, namely GP2E behavior cloning policy, which can guide the agent to learn the generalizable manipulation skills from soft-body tasks, including pouring, filling, hanging, excavating, pinching, and writing. Concretely, we build our policy from three insights:(1) Extracting intricate semantic features from point cloud data and seamlessly integrating them into the robot's end-effector frame; (2) Capturing long-distance interactions in long-horizon tasks through the incorporation of our guided self-attention module; (3) Mitigating overfitting concerns and facilitating model convergence to higher accuracy levels via the introduction of our two-stage fine-tuning strategy. Through extensive experiments, we demonstrate the effectiveness of our approach by achieving the 1st prize in the soft-body track of the ManiSkill2 Challenge at the CVPR 2023 4th Embodied AI workshop. Our findings highlight the potential of our method to improve the generalization abilities of Embodied AI models and pave the way for their practical applications in real-world scenarios.

Via

Access Paper or Ask Questions

Guided Self-attention: Find the Generalized Necessarily Distinct Vectors for Grain Size Grading

Oct 08, 2024

Fang Gao, Xuetao Li, Jiabao Wang, Shengheng Ma, Jun Yu

Figure 1 for Guided Self-attention: Find the Generalized Necessarily Distinct Vectors for Grain Size Grading

Figure 2 for Guided Self-attention: Find the Generalized Necessarily Distinct Vectors for Grain Size Grading

Figure 3 for Guided Self-attention: Find the Generalized Necessarily Distinct Vectors for Grain Size Grading

Figure 4 for Guided Self-attention: Find the Generalized Necessarily Distinct Vectors for Grain Size Grading

Abstract:With the development of steel materials, metallographic analysis has become increasingly important. Unfortunately, grain size analysis is a manual process that requires experts to evaluate metallographic photographs, which is unreliable and time-consuming. To resolve this problem, we propose a novel classifi-cation method based on deep learning, namely GSNets, a family of hybrid models which can effectively introduce guided self-attention for classifying grain size. Concretely, we build our models from three insights:(1) Introducing our novel guided self-attention module can assist the model in finding the generalized necessarily distinct vectors capable of retaining intricate rela-tional connections and rich local feature information; (2) By improving the pixel-wise linear independence of the feature map, the highly condensed semantic representation will be captured by the model; (3) Our novel triple-stream merging module can significantly improve the generalization capability and efficiency of the model. Experiments show that our GSNet yields a classifi-cation accuracy of 90.1%, surpassing the state-of-the-art Swin Transformer V2 by 1.9% on the steel grain size dataset, which comprises 3,599 images with 14 grain size levels. Furthermore, we intuitively believe our approach is applicable to broader ap-plications like object detection and semantic segmentation.

Via

Access Paper or Ask Questions

A General Framework for Producing Interpretable Semantic Text Embeddings

Oct 04, 2024

Yiqun Sun, Qiang Huang, Yixuan Tang, Anthony K. H. Tung, Jun Yu

Figure 1 for A General Framework for Producing Interpretable Semantic Text Embeddings

Figure 2 for A General Framework for Producing Interpretable Semantic Text Embeddings

Figure 3 for A General Framework for Producing Interpretable Semantic Text Embeddings

Figure 4 for A General Framework for Producing Interpretable Semantic Text Embeddings

Abstract:Semantic text embedding is essential to many tasks in Natural Language Processing (NLP). While black-box models are capable of generating high-quality embeddings, their lack of interpretability limits their use in tasks that demand transparency. Recent approaches have improved interpretability by leveraging domain-expert-crafted or LLM-generated questions, but these methods rely heavily on expert input or well-prompt design, which restricts their generalizability and ability to generate discriminative questions across a wide range of tasks. To address these challenges, we introduce \algo{CQG-MBQA} (Contrastive Question Generation - Multi-task Binary Question Answering), a general framework for producing interpretable semantic text embeddings across diverse tasks. Our framework systematically generates highly discriminative, low cognitive load yes/no questions through the \algo{CQG} method and answers them efficiently with the \algo{MBQA} model, resulting in interpretable embeddings in a cost-effective manner. We validate the effectiveness and interpretability of \algo{CQG-MBQA} through extensive experiments and ablation studies, demonstrating that it delivers embedding quality comparable to many advanced black-box models while maintaining inherently interpretability. Additionally, \algo{CQG-MBQA} outperforms other interpretable text embedding methods across various downstream tasks.

* 19 pages, 5 figures, and 9 tables

Via

Access Paper or Ask Questions

DDNet: Deformable Convolution and Dense FPN for Surface Defect Detection in Recycled Books

Sep 08, 2024

Jun Yu, WenJian Wang

Figure 1 for DDNet: Deformable Convolution and Dense FPN for Surface Defect Detection in Recycled Books

Figure 2 for DDNet: Deformable Convolution and Dense FPN for Surface Defect Detection in Recycled Books

Figure 3 for DDNet: Deformable Convolution and Dense FPN for Surface Defect Detection in Recycled Books

Figure 4 for DDNet: Deformable Convolution and Dense FPN for Surface Defect Detection in Recycled Books

Abstract:Recycled and recirculated books, such as ancient texts and reused textbooks, hold significant value in the second-hand goods market, with their worth largely dependent on surface preservation. However, accurately assessing surface defects is challenging due to the wide variations in shape, size, and the often imprecise detection of defects. To address these issues, we propose DDNet, an innovative detection model designed to enhance defect localization and classification. DDNet introduces a surface defect feature extraction module based on a deformable convolution operator (DC) and a densely connected FPN module (DFPN). The DC module dynamically adjusts the convolution grid to better align with object contours, capturing subtle shape variations and improving boundary delineation and prediction accuracy. Meanwhile, DFPN leverages dense skip connections to enhance feature fusion, constructing a hierarchical structure that generates multi-resolution, high-fidelity feature maps, thus effectively detecting defects of various sizes. In addition to the model, we present a comprehensive dataset specifically curated for surface defect detection in recycled and recirculated books. This dataset encompasses a diverse range of defect types, shapes, and sizes, making it ideal for evaluating the robustness and effectiveness of defect detection models. Through extensive evaluations, DDNet achieves precise localization and classification of surface defects, recording a mAP value of 46.7% on our proprietary dataset - an improvement of 14.2% over the baseline model - demonstrating its superior detection capabilities.

Via

Access Paper or Ask Questions

LMT-GP: Combined Latent Mean-Teacher and Gaussian Process for Semi-supervised Low-light Image Enhancement

Aug 29, 2024

Ye Yu, Fengxin Chen, Jun Yu, Zhen Kan

Figure 1 for LMT-GP: Combined Latent Mean-Teacher and Gaussian Process for Semi-supervised Low-light Image Enhancement

Figure 2 for LMT-GP: Combined Latent Mean-Teacher and Gaussian Process for Semi-supervised Low-light Image Enhancement

Figure 3 for LMT-GP: Combined Latent Mean-Teacher and Gaussian Process for Semi-supervised Low-light Image Enhancement

Figure 4 for LMT-GP: Combined Latent Mean-Teacher and Gaussian Process for Semi-supervised Low-light Image Enhancement

Abstract:While recent low-light image enhancement (LLIE) methods have made significant advancements, they still face challenges in terms of low visual quality and weak generalization ability when applied to complex scenarios. To address these issues, we propose a semi-supervised method based on latent mean-teacher and Gaussian process, named LMT-GP. We first design a latent mean-teacher framework that integrates both labeled and unlabeled data, as well as their latent vectors, into model training. Meanwhile, we use a mean-teacher-assisted Gaussian process learning strategy to establish a connection between the latent and pseudo-latent vectors obtained from the labeled and unlabeled data. To guide the learning process, we utilize an assisted Gaussian process regression (GPR) loss function. Furthermore, we design a pseudo-label adaptation module (PAM) to ensure the reliability of the network learning. To demonstrate our method's generalization ability and effectiveness, we apply it to multiple LLIE datasets and high-level vision tasks. Experiment results demonstrate that our method achieves high generalization performance and image quality. The code is available at https://github.com/HFUT-CV/LMT-GP.

Via

Access Paper or Ask Questions

MVPbev: Multi-view Perspective Image Generation from BEV with Test-time Controllability and Generalizability

Jul 28, 2024

Buyu Liu, Kai Wang, Yansong Liu, Jun Bao, Tingting Han, Jun Yu

Figure 1 for MVPbev: Multi-view Perspective Image Generation from BEV with Test-time Controllability and Generalizability

Figure 2 for MVPbev: Multi-view Perspective Image Generation from BEV with Test-time Controllability and Generalizability

Figure 3 for MVPbev: Multi-view Perspective Image Generation from BEV with Test-time Controllability and Generalizability

Figure 4 for MVPbev: Multi-view Perspective Image Generation from BEV with Test-time Controllability and Generalizability

Abstract:This work aims to address the multi-view perspective RGB generation from text prompts given Bird-Eye-View(BEV) semantics. Unlike prior methods that neglect layout consistency, lack the ability to handle detailed text prompts, or are incapable of generalizing to unseen view points, MVPbev simultaneously generates cross-view consistent images of different perspective views with a two-stage design, allowing object-level control and novel view generation at test-time. Specifically, MVPbev firstly projects given BEV semantics to perspective view with camera parameters, empowering the model to generalize to unseen view points. Then we introduce a multi-view attention module where special initialization and de-noising processes are introduced to explicitly enforce local consistency among overlapping views w.r.t. cross-view homography. Last but not least, MVPbev further allows test-time instance-level controllability by refining a pre-trained text-to-image diffusion model. Our extensive experiments on NuScenes demonstrate that our method is capable of generating high-resolution photorealistic images from text descriptions with thousands of training samples, surpassing the state-of-the-art methods under various evaluation metrics. We further demonstrate the advances of our method in terms of generalizability and controllability with the help of novel evaluation metrics and comprehensive human analysis. Our code, data, and model can be found in \url{https://github.com/kkaiwwana/MVPbev}.

* Accepted by ACM MM24

Via

Access Paper or Ask Questions

Facial Identity Anonymization via Intrinsic and Extrinsic Attention Distraction

Jun 25, 2024

Zhenzhong Kuang, Xiaochen Yang, Yingjie Shen, Chao Hu, Jun Yu

Figure 1 for Facial Identity Anonymization via Intrinsic and Extrinsic Attention Distraction

Figure 2 for Facial Identity Anonymization via Intrinsic and Extrinsic Attention Distraction

Figure 3 for Facial Identity Anonymization via Intrinsic and Extrinsic Attention Distraction

Figure 4 for Facial Identity Anonymization via Intrinsic and Extrinsic Attention Distraction

Abstract:The unprecedented capture and application of face images raise increasing concerns on anonymization to fight against privacy disclosure. Most existing methods may suffer from the problem of excessive change of the identity-independent information or insufficient identity protection. In this paper, we present a new face anonymization approach by distracting the intrinsic and extrinsic identity attentions. On the one hand, we anonymize the identity information in the feature space by distracting the intrinsic identity attention. On the other, we anonymize the visual clues (i.e. appearance and geometry structure) by distracting the extrinsic identity attention. Our approach allows for flexible and intuitive manipulation of face appearance and geometry structure to produce diverse results, and it can also be used to instruct users to perform personalized anonymization. We conduct extensive experiments on multiple datasets and demonstrate that our approach outperforms state-of-the-art methods.

* IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024: 12406-12415

Via

Access Paper or Ask Questions

Learning to Discover Knowledge: A Weakly-Supervised Partial Domain Adaptation Approach

Jun 20, 2024

Mengcheng Lan, Min Meng, Jun Yu, Jigang Wu

Figure 1 for Learning to Discover Knowledge: A Weakly-Supervised Partial Domain Adaptation Approach

Figure 2 for Learning to Discover Knowledge: A Weakly-Supervised Partial Domain Adaptation Approach

Figure 3 for Learning to Discover Knowledge: A Weakly-Supervised Partial Domain Adaptation Approach

Figure 4 for Learning to Discover Knowledge: A Weakly-Supervised Partial Domain Adaptation Approach

Abstract:Domain adaptation has shown appealing performance by leveraging knowledge from a source domain with rich annotations. However, for a specific target task, it is cumbersome to collect related and high-quality source domains. In real-world scenarios, large-scale datasets corrupted with noisy labels are easy to collect, stimulating a great demand for automatic recognition in a generalized setting, i.e., weakly-supervised partial domain adaptation (WS-PDA), which transfers a classifier from a large source domain with noises in labels to a small unlabeled target domain. As such, the key issues of WS-PDA are: 1) how to sufficiently discover the knowledge from the noisy labeled source domain and the unlabeled target domain, and 2) how to successfully adapt the knowledge across domains. In this paper, we propose a simple yet effective domain adaptation approach, termed as self-paced transfer classifier learning (SP-TCL), to address the above issues, which could be regarded as a well-performing baseline for several generalized domain adaptation tasks. The proposed model is established upon the self-paced learning scheme, seeking a preferable classifier for the target domain. Specifically, SP-TCL learns to discover faithful knowledge via a carefully designed prudent loss function and simultaneously adapts the learned knowledge to the target domain by iteratively excluding source examples from training under the self-paced fashion. Extensive evaluations on several benchmark datasets demonstrate that SP-TCL significantly outperforms state-of-the-art approaches on several generalized domain adaptation tasks.

* Accepted to TIP 2024. Code available: https://github.com/mc-lan/SP-TCL

Via

Access Paper or Ask Questions

Solution for CVPR 2024 UG2+ Challenge Track on All Weather Semantic Segmentation

Jun 09, 2024

Jun Yu, Yunxiang Zhang, Fengzhao Sun, Leilei Wang, Renjie Lu

Figure 1 for Solution for CVPR 2024 UG2+ Challenge Track on All Weather Semantic Segmentation

Figure 2 for Solution for CVPR 2024 UG2+ Challenge Track on All Weather Semantic Segmentation

Abstract:In this report, we present our solution for the semantic segmentation in adverse weather, in UG2+ Challenge at CVPR 2024. To achieve robust and accurate segmentation results across various weather conditions, we initialize the InternImage-H backbone with pre-trained weights from the large-scale joint dataset and enhance it with the state-of-the-art Upernet segmentation method. Specifically, we utilize offline and online data augmentation approaches to extend the train set, which helps us to further improve the performance of the segmenter. As a result, our proposed solution demonstrates advanced performance on the test set and achieves 3rd position in this challenge.

* Solution for CVPR 2024 UG2+ Challenge Track on All Weather Semantic Segmentation

Via

Access Paper or Ask Questions

Large Language Model Assisted Adversarial Robustness Neural Architecture Search

Jun 08, 2024

Rui Zhong, Yang Cao, Jun Yu, Masaharu Munetomo

Figure 1 for Large Language Model Assisted Adversarial Robustness Neural Architecture Search

Figure 2 for Large Language Model Assisted Adversarial Robustness Neural Architecture Search

Figure 3 for Large Language Model Assisted Adversarial Robustness Neural Architecture Search

Figure 4 for Large Language Model Assisted Adversarial Robustness Neural Architecture Search

Abstract:Motivated by the potential of large language models (LLMs) as optimizers for solving combinatorial optimization problems, this paper proposes a novel LLM-assisted optimizer (LLMO) to address adversarial robustness neural architecture search (ARNAS), a specific application of combinatorial optimization. We design the prompt using the standard CRISPE framework (i.e., Capacity and Role, Insight, Statement, Personality, and Experiment). In this study, we employ Gemini, a powerful LLM developed by Google. We iteratively refine the prompt, and the responses from Gemini are adapted as solutions to ARNAS instances. Numerical experiments are conducted on NAS-Bench-201-based ARNAS tasks with CIFAR-10 and CIFAR-100 datasets. Six well-known meta-heuristic algorithms (MHAs) including genetic algorithm (GA), particle swarm optimization (PSO), differential evolution (DE), and its variants serve as baselines. The experimental results confirm the competitiveness of the proposed LLMO and highlight the potential of LLMs as effective combinatorial optimizers. The source code of this research can be downloaded from \url{https://github.com/RuiZhong961230/LLMO}.

* Accepted by The 6th International Conference on Data-driven Optimization of Complex Systems (DOCS)

Via

Access Paper or Ask Questions