Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Dacheng Tao

and Other Contributors

Learning to Learn Better for Video Object Segmentation

Dec 05, 2022

Meng Lan, Jing Zhang, Lefei Zhang, Dacheng Tao

Abstract:Recently, the joint learning framework (JOINT) integrates matching based transductive reasoning and online inductive learning to achieve accurate and robust semi-supervised video object segmentation (SVOS). However, using the mask embedding as the label to guide the generation of target features in the two branches may result in inadequate target representation and degrade the performance. Besides, how to reasonably fuse the target features in the two different branches rather than simply adding them together to avoid the adverse effect of one dominant branch has not been investigated. In this paper, we propose a novel framework that emphasizes Learning to Learn Better (LLB) target features for SVOS, termed LLB, where we design the discriminative label generation module (DLGM) and the adaptive fusion module to address these issues. Technically, the DLGM takes the background-filtered frame instead of the target mask as input and adopts a lightweight encoder to generate the target features, which serves as the label of the online few-shot learner and the value of the decoder in the transformer to guide the two branches to learn more discriminative target representation. The adaptive fusion module maintains a learnable gate for each branch, which reweighs the element-wise feature representation and allows an adaptive amount of target information in each branch flowing to the fused target feature, thus preventing one branch from being dominant and making the target feature more robust to distractor. Extensive experiments on public benchmarks show that our proposed LLB method achieves state-of-the-art performance.

Via

Access Paper or Ask Questions

Toward Efficient Language Model Pretraining and Downstream Adaptation via Self-Evolution: A Case Study on SuperGLUE

Dec 04, 2022

Qihuang Zhong, Liang Ding, Yibing Zhan, Yu Qiao, Yonggang Wen, Li Shen, Juhua Liu, Baosheng Yu, Bo Du, Yixin Chen(+4 more)

Figure 1 for Toward Efficient Language Model Pretraining and Downstream Adaptation via Self-Evolution: A Case Study on SuperGLUE

Figure 2 for Toward Efficient Language Model Pretraining and Downstream Adaptation via Self-Evolution: A Case Study on SuperGLUE

Figure 3 for Toward Efficient Language Model Pretraining and Downstream Adaptation via Self-Evolution: A Case Study on SuperGLUE

Figure 4 for Toward Efficient Language Model Pretraining and Downstream Adaptation via Self-Evolution: A Case Study on SuperGLUE

Abstract:This technical report briefly describes our JDExplore d-team's Vega v2 submission on the SuperGLUE leaderboard. SuperGLUE is more challenging than the widely used general language understanding evaluation (GLUE) benchmark, containing eight difficult language understanding tasks, including question answering, natural language inference, word sense disambiguation, coreference resolution, and reasoning. [Method] Instead of arbitrarily increasing the size of a pretrained language model (PLM), our aim is to 1) fully extract knowledge from the input pretraining data given a certain parameter budget, e.g., 6B, and 2) effectively transfer this knowledge to downstream tasks. To achieve goal 1), we propose self-evolution learning for PLMs to wisely predict the informative tokens that should be masked, and supervise the masked language modeling (MLM) process with rectified smooth labels. For goal 2), we leverage the prompt transfer technique to improve the low-resource tasks by transferring the knowledge from the foundation model and related downstream tasks to the target task. [Results] According to our submission record (Oct. 2022), with our optimized pretraining and fine-tuning strategies, our 6B Vega method achieved new state-of-the-art performance on 4/8 tasks, sitting atop the SuperGLUE leaderboard on Oct. 8, 2022, with an average score of 91.3.

* Technical report

Via

Access Paper or Ask Questions

3DDesigner: Towards Photorealistic 3D Object Generation and Editing with Text-guided Diffusion Models

Dec 02, 2022

Gang Li, Heliang Zheng, Chaoyue Wang, Chang Li, Changwen Zheng, Dacheng Tao

Figure 1 for 3DDesigner: Towards Photorealistic 3D Object Generation and Editing with Text-guided Diffusion Models

Figure 2 for 3DDesigner: Towards Photorealistic 3D Object Generation and Editing with Text-guided Diffusion Models

Figure 3 for 3DDesigner: Towards Photorealistic 3D Object Generation and Editing with Text-guided Diffusion Models

Figure 4 for 3DDesigner: Towards Photorealistic 3D Object Generation and Editing with Text-guided Diffusion Models

Abstract:Text-guided diffusion models have shown superior performance in image/video generation and editing. While few explorations have been performed in 3D scenarios. In this paper, we discuss three fundamental and interesting problems on this topic. First, we equip text-guided diffusion models to achieve $\textbf{3D-consistent generation}$. Specifically, we integrate a NeRF-like neural field to generate low-resolution coarse results for a given camera view. Such results can provide 3D priors as condition information for the following diffusion process. During denoising diffusion, we further enhance the 3D consistency by modeling cross-view correspondences with a novel two-stream (corresponding to two different views) asynchronous diffusion process. Second, we study $\textbf{3D local editing}$ and propose a two-step solution that can generate 360$^{\circ}$ manipulated results by editing an object from a single view. Step 1, we propose to perform 2D local editing by blending the predicted noises. Step 2, we conduct a noise-to-text inversion process that maps 2D blended noises into the view-independent text embedding space. Once the corresponding text embedding is obtained, 360$^{\circ}$ images can be generated. Last but not least, we extend our model to perform \textbf{one-shot novel view synthesis} by fine-tuning on a single image, firstly showing the potential of leveraging text guidance for novel view synthesis. Extensive experiments and various applications show the prowess of our 3DDesigner. The project page is available at https://3ddesigner-diffusion.github.io/.

* 15 pages, 12 figures, conference

Via

Access Paper or Ask Questions

Improving Simultaneous Machine Translation with Monolingual Data

Dec 02, 2022

Hexuan Deng, Liang Ding, Xuebo Liu, Meishan Zhang, Dacheng Tao, Min Zhang

Figure 1 for Improving Simultaneous Machine Translation with Monolingual Data

Figure 2 for Improving Simultaneous Machine Translation with Monolingual Data

Figure 3 for Improving Simultaneous Machine Translation with Monolingual Data

Figure 4 for Improving Simultaneous Machine Translation with Monolingual Data

Abstract:Simultaneous machine translation (SiMT) is usually done via sequence-level knowledge distillation (Seq-KD) from a full-sentence neural machine translation (NMT) model. However, there is still a significant performance gap between NMT and SiMT. In this work, we propose to leverage monolingual data to improve SiMT, which trains a SiMT student on the combination of bilingual data and external monolingual data distilled by Seq-KD. Preliminary experiments on En-Zh and En-Ja news domain corpora demonstrate that monolingual data can significantly improve translation quality (e.g., +3.15 BLEU on En-Zh). Inspired by the behavior of human simultaneous interpreters, we propose a novel monolingual sampling strategy for SiMT, considering both chunk length and monotonicity. Experimental results show that our sampling strategy consistently outperforms the random sampling strategy (and other conventional typical NMT monolingual sampling strategies) by avoiding the key problem of SiMT -- hallucination, and has better scalability. We achieve +0.72 BLEU improvements on average against random sampling on En-Zh and En-Ja. Data and codes can be found at https://github.com/hexuandeng/Mono4SiMT.

* Accepted by AAAI 2023. Extended version includes supplementary material. 10 pages, 4 figures, 8 tables

Via

Access Paper or Ask Questions

AL-iGAN: An Active Learning Framework for Tunnel Geological Reconstruction Based on TBM Operational Data

Dec 02, 2022

Hao Wang, Lixue Liu, Xueguan Song, Chao Zhang, Dacheng Tao

Figure 1 for AL-iGAN: An Active Learning Framework for Tunnel Geological Reconstruction Based on TBM Operational Data

Figure 2 for AL-iGAN: An Active Learning Framework for Tunnel Geological Reconstruction Based on TBM Operational Data

Figure 3 for AL-iGAN: An Active Learning Framework for Tunnel Geological Reconstruction Based on TBM Operational Data

Figure 4 for AL-iGAN: An Active Learning Framework for Tunnel Geological Reconstruction Based on TBM Operational Data

Abstract:In tunnel boring machine (TBM) underground projects, an accurate description of the rock-soil types distributed in the tunnel can decrease the construction risk ({\it e.g.} surface settlement and landslide) and improve the efficiency of construction. In this paper, we propose an active learning framework, called AL-iGAN, for tunnel geological reconstruction based on TBM operational data. This framework contains two main parts: one is the usage of active learning techniques for recommending new drilling locations to label the TBM operational data and then to form new training samples; and the other is an incremental generative adversarial network for geological reconstruction (iGAN-GR), whose weights can be incrementally updated to improve the reconstruction performance by using the new samples. The numerical experiment validate the effectiveness of the proposed framework as well.

Via

Access Paper or Ask Questions

1st Workshop on Maritime Computer Vision 2023: Challenge Results

Nov 28, 2022

Benjamin Kiefer, Matej Kristan, Janez Perš, Lojze Žust, Fabio Poiesi, Fabio Augusto de Alcantara Andrade, Alexandre Bernardino, Matthew Dawkins, Jenni Raitoharju, Yitong Quan(+63 more)

Figure 1 for 1st Workshop on Maritime Computer Vision 2023: Challenge Results

Figure 2 for 1st Workshop on Maritime Computer Vision 2023: Challenge Results

Figure 3 for 1st Workshop on Maritime Computer Vision 2023: Challenge Results

Figure 4 for 1st Workshop on Maritime Computer Vision 2023: Challenge Results

Abstract:The 1$^{\text{st}}$ Workshop on Maritime Computer Vision (MaCVi) 2023 focused on maritime computer vision for Unmanned Aerial Vehicles (UAV) and Unmanned Surface Vehicle (USV), and organized several subchallenges in this domain: (i) UAV-based Maritime Object Detection, (ii) UAV-based Maritime Object Tracking, (iii) USV-based Maritime Obstacle Segmentation and (iv) USV-based Maritime Obstacle Detection. The subchallenges were based on the SeaDronesSee and MODS benchmarks. This report summarizes the main findings of the individual subchallenges and introduces a new benchmark, called SeaDronesSee Object Detection v2, which extends the previous benchmark by including more classes and footage. We provide statistical and qualitative analyses, and assess trends in the best-performing methodologies of over 130 submissions. The methods are summarized in the appendix. The datasets, evaluation code and the leaderboard are publicly available at https://seadronessee.cs.uni-tuebingen.de/macvi.

* MaCVi 2023 was part of WACV 2023. This report (38 pages) discusses the competition as part of MaCVi

Via

Access Paper or Ask Questions

Unified Discrete Diffusion for Simultaneous Vision-Language Generation

Nov 27, 2022

Minghui Hu, Chuanxia Zheng, Heliang Zheng, Tat-Jen Cham, Chaoyue Wang, Zuopeng Yang, Dacheng Tao, Ponnuthurai N. Suganthan

Figure 1 for Unified Discrete Diffusion for Simultaneous Vision-Language Generation

Figure 2 for Unified Discrete Diffusion for Simultaneous Vision-Language Generation

Figure 3 for Unified Discrete Diffusion for Simultaneous Vision-Language Generation

Figure 4 for Unified Discrete Diffusion for Simultaneous Vision-Language Generation

Abstract:The recently developed discrete diffusion models perform extraordinarily well in the text-to-image task, showing significant promise for handling the multi-modality signals. In this work, we harness these traits and present a unified multimodal generation model that can conduct both the "modality translation" and "multi-modality generation" tasks using a single model, performing text-based, image-based, and even vision-language simultaneous generation. Specifically, we unify the discrete diffusion process for multimodal signals by proposing a unified transition matrix. Moreover, we design a mutual attention module with fused embedding layer and a unified objective function to emphasise the inter-modal linkages, which are vital for multi-modality generation. Extensive experiments indicate that our proposed method can perform comparably to the state-of-the-art solutions in various generation tasks.

Via

Access Paper or Ask Questions

Responsible Active Learning via Human-in-the-loop Peer Study

Nov 24, 2022

Yu-Tong Cao, Jingya Wang, Baosheng Yu, Dacheng Tao

Figure 1 for Responsible Active Learning via Human-in-the-loop Peer Study

Figure 2 for Responsible Active Learning via Human-in-the-loop Peer Study

Figure 3 for Responsible Active Learning via Human-in-the-loop Peer Study

Figure 4 for Responsible Active Learning via Human-in-the-loop Peer Study

Abstract:Active learning has been proposed to reduce data annotation efforts by only manually labelling representative data samples for training. Meanwhile, recent active learning applications have benefited a lot from cloud computing services with not only sufficient computational resources but also crowdsourcing frameworks that include many humans in the active learning loop. However, previous active learning methods that always require passing large-scale unlabelled data to cloud may potentially raise significant data privacy issues. To mitigate such a risk, we propose a responsible active learning method, namely Peer Study Learning (PSL), to simultaneously preserve data privacy and improve model stability. Specifically, we first introduce a human-in-the-loop teacher-student architecture to isolate unlabelled data from the task learner (teacher) on the cloud-side by maintaining an active learner (student) on the client-side. During training, the task learner instructs the light-weight active learner which then provides feedback on the active sampling criterion. To further enhance the active learner via large-scale unlabelled data, we introduce multiple peer students into the active learner which is trained by a novel learning paradigm, including the In-Class Peer Study on labelled data and the Out-of-Class Peer Study on unlabelled data. Lastly, we devise a discrepancy-based active sampling criterion, Peer Study Feedback, that exploits the variability of peer students to select the most informative data to improve model stability. Extensive experiments demonstrate the superiority of the proposed PSL over a wide range of active learning methods in both standard and sensitive protection settings.

* 15 pages, 8 figures

Via

Access Paper or Ask Questions

Knowledge-Aware Federated Active Learning with Non-IID Data

Nov 24, 2022

Yu-Tong Cao, Jingya Wang, Ye Shi, Baosheng Yu, Dacheng Tao

Figure 1 for Knowledge-Aware Federated Active Learning with Non-IID Data

Figure 2 for Knowledge-Aware Federated Active Learning with Non-IID Data

Figure 3 for Knowledge-Aware Federated Active Learning with Non-IID Data

Figure 4 for Knowledge-Aware Federated Active Learning with Non-IID Data

Abstract:Federated learning enables multiple decentralized clients to learn collaboratively without sharing the local training data. However, the expensive annotation cost to acquire data labels on local clients remains an obstacle in utilizing local data. In this paper, we propose a federated active learning paradigm to efficiently learn a global model with limited annotation budget while protecting data privacy in a decentralized learning way. The main challenge faced by federated active learning is the mismatch between the active sampling goal of the global model on the server and that of the asynchronous local clients. This becomes even more significant when data is distributed non-IID across local clients. To address the aforementioned challenge, we propose Knowledge-Aware Federated Active Learning (KAFAL), which consists of Knowledge-Specialized Active Sampling (KSAS) and Knowledge-Compensatory Federated Update (KCFU). KSAS is a novel active sampling method tailored for the federated active learning problem. It deals with the mismatch challenge by sampling actively based on the discrepancies between local and global models. KSAS intensifies specialized knowledge in local clients, ensuring the sampled data to be informative for both the local clients and the global model. KCFU, in the meantime, deals with the client heterogeneity caused by limited data and non-IID data distributions. It compensates for each client's ability in weak classes by the assistance of the global model. Extensive experiments and analyses are conducted to show the superiority of KSAS over the state-of-the-art active learning methods and the efficiency of KCFU under the federated active learning framework.

* 14 pages, 12 figures

Via

Access Paper or Ask Questions

DeepSolo: Let Transformer Decoder with Explicit Points Solo for Text Spotting

Nov 23, 2022

Maoyuan Ye, Jing Zhang, Shanshan Zhao, Juhua Liu, Tongliang Liu, Bo Du, Dacheng Tao

Figure 1 for DeepSolo: Let Transformer Decoder with Explicit Points Solo for Text Spotting

Figure 2 for DeepSolo: Let Transformer Decoder with Explicit Points Solo for Text Spotting

Figure 3 for DeepSolo: Let Transformer Decoder with Explicit Points Solo for Text Spotting

Figure 4 for DeepSolo: Let Transformer Decoder with Explicit Points Solo for Text Spotting

Abstract:End-to-end text spotting aims to integrate scene text detection and recognition into a unified framework. Dealing with the relationship between the two sub-tasks plays a pivotal role in designing effective spotters. Although transformer-based methods eliminate the heuristic post-processing, they still suffer from the synergy issue between the sub-tasks and low training efficiency. In this paper, we present DeepSolo, a simple detection transformer baseline that lets a single Decoder with Explicit Points Solo for text detection and recognition simultaneously. Technically, for each text instance, we represent the character sequence as ordered points and model them with learnable explicit point queries. After passing a single decoder, the point queries have encoded requisite text semantics and locations and thus can be further decoded to the center line, boundary, script, and confidence of text via very simple prediction heads in parallel, solving the sub-tasks in text spotting in a unified framework. Besides, we also introduce a text-matching criterion to deliver more accurate supervisory signals, thus enabling more efficient training. Quantitative experiments on public benchmarks demonstrate that DeepSolo outperforms previous state-of-the-art methods and achieves better training efficiency. In addition, DeepSolo is also compatible with line annotations, which require much less annotation cost than polygons. The code will be released.

* The code will be available at https://github.com/ViTAE-Transformer/DeepSolo

Via

Access Paper or Ask Questions