Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Han Hu

University of Toronto

Federated Learning Robust to Byzantine Attacks: Achieving Zero Optimality Gap

Aug 21, 2023

Shiyuan Zuo, Rongfei Fan, Han Hu, Ning Zhang, Shimin Gong

Figure 1 for Federated Learning Robust to Byzantine Attacks: Achieving Zero Optimality Gap

Figure 2 for Federated Learning Robust to Byzantine Attacks: Achieving Zero Optimality Gap

Abstract:In this paper, we propose a robust aggregation method for federated learning (FL) that can effectively tackle malicious Byzantine attacks. At each user, model parameter is firstly updated by multiple steps, which is adjustable over iterations, and then pushed to the aggregation center directly. This decreases the number of interactions between the aggregation center and users, allows each user to set training parameter in a flexible way, and reduces computation burden compared with existing works that need to combine multiple historical model parameters. At the aggregation center, geometric median is leveraged to combine the received model parameters from each user. Rigorous proof shows that zero optimality gap is achieved by our proposed method with linear convergence, as long as the fraction of Byzantine attackers is below half. Numerical results verify the effectiveness of our proposed method.

Via

Access Paper or Ask Questions

SimDA: Simple Diffusion Adapter for Efficient Video Generation

Aug 18, 2023

Zhen Xing, Qi Dai, Han Hu, Zuxuan Wu, Yu-Gang Jiang

Figure 1 for SimDA: Simple Diffusion Adapter for Efficient Video Generation

Figure 2 for SimDA: Simple Diffusion Adapter for Efficient Video Generation

Figure 3 for SimDA: Simple Diffusion Adapter for Efficient Video Generation

Figure 4 for SimDA: Simple Diffusion Adapter for Efficient Video Generation

Abstract:The recent wave of AI-generated content has witnessed the great development and success of Text-to-Image (T2I) technologies. By contrast, Text-to-Video (T2V) still falls short of expectations though attracting increasing interests. Existing works either train from scratch or adapt large T2I model to videos, both of which are computation and resource expensive. In this work, we propose a Simple Diffusion Adapter (SimDA) that fine-tunes only 24M out of 1.1B parameters of a strong T2I model, adapting it to video generation in a parameter-efficient way. In particular, we turn the T2I model for T2V by designing light-weight spatial and temporal adapters for transfer learning. Besides, we change the original spatial attention to the proposed Latent-Shift Attention (LSA) for temporal consistency. With similar model architecture, we further train a video super-resolution model to generate high-definition (1024x1024) videos. In addition to T2V generation in the wild, SimDA could also be utilized in one-shot video editing with only 2 minutes tuning. Doing so, our method could minimize the training effort with extremely few tunable parameters for model adaptation.

Via

Access Paper or Ask Questions

Joint Power Control and Data Size Selection for Over-the-Air Computation Aided Federated Learning

Aug 17, 2023

Xuming An, Rongfei Fan, Shiyuan Zuo, Han Hu, Hai Jiang, Ning Zhang

Figure 1 for Joint Power Control and Data Size Selection for Over-the-Air Computation Aided Federated Learning

Figure 2 for Joint Power Control and Data Size Selection for Over-the-Air Computation Aided Federated Learning

Figure 3 for Joint Power Control and Data Size Selection for Over-the-Air Computation Aided Federated Learning

Figure 4 for Joint Power Control and Data Size Selection for Over-the-Air Computation Aided Federated Learning

Abstract:Federated learning (FL) has emerged as an appealing machine learning approach to deal with massive raw data generated at multiple mobile devices, {which needs to aggregate the training model parameter of every mobile device at one base station (BS) iteratively}. For parameter aggregating in FL, over-the-air computation is a spectrum-efficient solution, which allows all mobile devices to transmit their parameter-mapped signals concurrently to a BS. Due to heterogeneous channel fading and noise, there exists difference between the BS's received signal and its desired signal, measured as the mean-squared error (MSE). To minimize the MSE, we propose to jointly optimize the signal amplification factors at the BS and the mobile devices as well as the data size (the number of data samples involved in local training) at every mobile device. The formulated problem is challenging to solve due to its non-convexity. To find the optimal solution, with some simplification on cost function and variable replacement, which still preserves equivalence, we transform the changed problem to be a bi-level problem equivalently. For the lower-level problem, optimal solution is found by enumerating every candidate solution from the Karush-Kuhn-Tucker (KKT) condition. For the upper-level problem, the optimal solution is found by exploring its piecewise convexity. Numerical results show that our proposed method can greatly reduce the MSE and can help to improve the training performance of FL compared with benchmark methods.

Via

Access Paper or Ask Questions

Rethinking the Localization in Weakly Supervised Object Localization

Aug 11, 2023

Rui Xu, Yong Luo, Han Hu, Bo Du, Jialie Shen, Yonggang Wen

Figure 1 for Rethinking the Localization in Weakly Supervised Object Localization

Figure 2 for Rethinking the Localization in Weakly Supervised Object Localization

Figure 3 for Rethinking the Localization in Weakly Supervised Object Localization

Figure 4 for Rethinking the Localization in Weakly Supervised Object Localization

Abstract:Weakly supervised object localization (WSOL) is one of the most popular and challenging tasks in computer vision. This task is to localize the objects in the images given only the image-level supervision. Recently, dividing WSOL into two parts (class-agnostic object localization and object classification) has become the state-of-the-art pipeline for this task. However, existing solutions under this pipeline usually suffer from the following drawbacks: 1) they are not flexible since they can only localize one object for each image due to the adopted single-class regression (SCR) for localization; 2) the generated pseudo bounding boxes may be noisy, but the negative impact of such noise is not well addressed. To remedy these drawbacks, we first propose to replace SCR with a binary-class detector (BCD) for localizing multiple objects, where the detector is trained by discriminating the foreground and background. Then we design a weighted entropy (WE) loss using the unlabeled data to reduce the negative impact of noisy bounding boxes. Extensive experiments on the popular CUB-200-2011 and ImageNet-1K datasets demonstrate the effectiveness of our method.

* Accepted by ACM International Conference on Multimedia 2023

Via

Access Paper or Ask Questions

V-DETR: DETR with Vertex Relative Position Encoding for 3D Object Detection

Aug 08, 2023

Yichao Shen, Zigang Geng, Yuhui Yuan, Yutong Lin, Ze Liu, Chunyu Wang, Han Hu, Nanning Zheng, Baining Guo

Figure 1 for V-DETR: DETR with Vertex Relative Position Encoding for 3D Object Detection

Figure 2 for V-DETR: DETR with Vertex Relative Position Encoding for 3D Object Detection

Figure 3 for V-DETR: DETR with Vertex Relative Position Encoding for 3D Object Detection

Figure 4 for V-DETR: DETR with Vertex Relative Position Encoding for 3D Object Detection

Abstract:We introduce a highly performant 3D object detector for point clouds using the DETR framework. The prior attempts all end up with suboptimal results because they fail to learn accurate inductive biases from the limited scale of training data. In particular, the queries often attend to points that are far away from the target objects, violating the locality principle in object detection. To address the limitation, we introduce a novel 3D Vertex Relative Position Encoding (3DV-RPE) method which computes position encoding for each point based on its relative position to the 3D boxes predicted by the queries in each decoder layer, thus providing clear information to guide the model to focus on points near the objects, in accordance with the principle of locality. In addition, we systematically improve the pipeline from various aspects such as data normalization based on our understanding of the task. We show exceptional results on the challenging ScanNetV2 benchmark, achieving significant improvements over the previous 3DETR in $\rm{AP}_{25}$/$\rm{AP}_{50}$ from 65.0\%/47.0\% to 77.8\%/66.0\%, respectively. In addition, our method sets a new record on ScanNetV2 and SUN RGB-D datasets.Code will be released at http://github.com/yichaoshen-MS/V-DETR.

Via

Access Paper or Ask Questions

Cross-Silo Prototypical Calibration for Federated Learning with Non-IID Data

Aug 07, 2023

Zhuang Qi, Lei Meng, Zitan Chen, Han Hu, Hui Lin, Xiangxu Meng

Abstract:Federated Learning aims to learn a global model on the server side that generalizes to all clients in a privacy-preserving manner, by leveraging the local models from different clients. Existing solutions focus on either regularizing the objective functions among clients or improving the aggregation mechanism for the improved model generalization capability. However, their performance is typically limited by the dataset biases, such as the heterogeneous data distributions and the missing classes. To address this issue, this paper presents a cross-silo prototypical calibration method (FedCSPC), which takes additional prototype information from the clients to learn a unified feature space on the server side. Specifically, FedCSPC first employs the Data Prototypical Modeling (DPM) module to learn data patterns via clustering to aid calibration. Subsequently, the cross-silo prototypical calibration (CSPC) module develops an augmented contrastive learning method to improve the robustness of the calibration, which can effectively project cross-source features into a consistent space while maintaining clear decision boundaries. Moreover, the CSPC module's ease of implementation and plug-and-play characteristics make it even more remarkable. Experiments were conducted on four datasets in terms of performance comparison, ablation study, in-depth analysis and case study, and the results verified that FedCSPC is capable of learning the consistent features across different data sources of the same class under the guidance of calibrated model, which leads to better performance than the state-of-the-art methods. The source codes have been released at https://github.com/qizhuang-qz/FedCSPC.

Via

Access Paper or Ask Questions

DETR Doesn't Need Multi-Scale or Locality Design

Aug 03, 2023

Yutong Lin, Yuhui Yuan, Zheng Zhang, Chen Li, Nanning Zheng, Han Hu

Abstract:This paper presents an improved DETR detector that maintains a "plain" nature: using a single-scale feature map and global cross-attention calculations without specific locality constraints, in contrast to previous leading DETR-based detectors that reintroduce architectural inductive biases of multi-scale and locality into the decoder. We show that two simple technologies are surprisingly effective within a plain design to compensate for the lack of multi-scale feature maps and locality constraints. The first is a box-to-pixel relative position bias (BoxRPB) term added to the cross-attention formulation, which well guides each query to attend to the corresponding object region while also providing encoding flexibility. The second is masked image modeling (MIM)-based backbone pre-training which helps learn representation with fine-grained localization ability and proves crucial for remedying dependencies on the multi-scale feature maps. By incorporating these technologies and recent advancements in training and problem formation, the improved "plain" DETR showed exceptional improvements over the original DETR detector. By leveraging the Object365 dataset for pre-training, it achieved 63.9 mAP accuracy using a Swin-L backbone, which is highly competitive with state-of-the-art detectors which all heavily rely on multi-scale feature maps and region-based feature extraction. Code is available at https://github.com/impiga/Plain-DETR .

* To be published in ICCV2023

Via

Access Paper or Ask Questions

ImageBrush: Learning Visual In-Context Instructions for Exemplar-Based Image Manipulation

Aug 02, 2023

Yasheng Sun, Yifan Yang, Houwen Peng, Yifei Shen, Yuqing Yang, Han Hu, Lili Qiu, Hideki Koike

Abstract:While language-guided image manipulation has made remarkable progress, the challenge of how to instruct the manipulation process faithfully reflecting human intentions persists. An accurate and comprehensive description of a manipulation task using natural language is laborious and sometimes even impossible, primarily due to the inherent uncertainty and ambiguity present in linguistic expressions. Is it feasible to accomplish image manipulation without resorting to external cross-modal language information? If this possibility exists, the inherent modality gap would be effortlessly eliminated. In this paper, we propose a novel manipulation methodology, dubbed ImageBrush, that learns visual instructions for more accurate image editing. Our key idea is to employ a pair of transformation images as visual instructions, which not only precisely captures human intention but also facilitates accessibility in real-world scenarios. Capturing visual instructions is particularly challenging because it involves extracting the underlying intentions solely from visual demonstrations and then applying this operation to a new image. To address this challenge, we formulate visual instruction learning as a diffusion-based inpainting problem, where the contextual information is fully exploited through an iterative process of generation. A visual prompting encoder is carefully devised to enhance the model's capacity in uncovering human intent behind the visual instructions. Extensive experiments show that our method generates engaging manipulation results conforming to the transformations entailed in demonstrations. Moreover, our model exhibits robust generalization capabilities on various downstream tasks such as pose transfer, image translation and video inpainting.

Via

Access Paper or Ask Questions

LGViT: Dynamic Early Exiting for Accelerating Vision Transformer

Aug 01, 2023

Guanyu Xu, Jiawei Hao, Li Shen, Han Hu, Yong Luo, Hui Lin, Jialie Shen

Figure 1 for LGViT: Dynamic Early Exiting for Accelerating Vision Transformer

Figure 2 for LGViT: Dynamic Early Exiting for Accelerating Vision Transformer

Figure 3 for LGViT: Dynamic Early Exiting for Accelerating Vision Transformer

Figure 4 for LGViT: Dynamic Early Exiting for Accelerating Vision Transformer

Abstract:Recently, the efficient deployment and acceleration of powerful vision transformers (ViTs) on resource-limited edge devices for providing multimedia services have become attractive tasks. Although early exiting is a feasible solution for accelerating inference, most works focus on convolutional neural networks (CNNs) and transformer models in natural language processing (NLP).Moreover, the direct application of early exiting methods to ViTs may result in substantial performance degradation. To tackle this challenge, we systematically investigate the efficacy of early exiting in ViTs and point out that the insufficient feature representations in shallow internal classifiers and the limited ability to capture target semantic information in deep internal classifiers restrict the performance of these methods. We then propose an early exiting framework for general ViTs termed LGViT, which incorporates heterogeneous exiting heads, namely, local perception head and global aggregation head, to achieve an efficiency-accuracy trade-off. In particular, we develop a novel two-stage training scheme, including end-to-end training and self-distillation with the backbone frozen to generate early exiting ViTs, which facilitates the fusion of global and local information extracted by the two types of heads. We conduct extensive experiments using three popular ViT backbones on three vision datasets. Results demonstrate that our LGViT can achieve competitive performance with approximately 1.8 $\times$ speed-up.

* ACM MM 2023

Via

Access Paper or Ask Questions

FHA-Kitchens: A Novel Dataset for Fine-Grained Hand Action Recognition in Kitchen Scenes

Jun 19, 2023

Ting Zhe, Yongqian Li, Jing Zhang, Yong Luo, Han Hu, Bo Du, Yonggang Wen, Dacheng Tao

Figure 1 for FHA-Kitchens: A Novel Dataset for Fine-Grained Hand Action Recognition in Kitchen Scenes

Figure 2 for FHA-Kitchens: A Novel Dataset for Fine-Grained Hand Action Recognition in Kitchen Scenes

Figure 3 for FHA-Kitchens: A Novel Dataset for Fine-Grained Hand Action Recognition in Kitchen Scenes

Figure 4 for FHA-Kitchens: A Novel Dataset for Fine-Grained Hand Action Recognition in Kitchen Scenes

Abstract:A typical task in the field of video understanding is hand action recognition, which has a wide range of applications. Existing works either mainly focus on full-body actions, or the defined action categories are relatively coarse-grained. In this paper, we propose FHA-Kitchens, a novel dataset of fine-grained hand actions in kitchen scenes. In particular, we focus on human hand interaction regions and perform deep excavation to further refine hand action information and interaction regions. Our FHA-Kitchens dataset consists of 2,377 video clips and 30,047 images collected from 8 different types of dishes, and all hand interaction regions in each image are labeled with high-quality fine-grained action classes and bounding boxes. We represent the action information in each hand interaction region as a triplet, resulting in a total of 878 action triplets. Based on the constructed dataset, we benchmark representative action recognition and detection models on the following three tracks: (1) supervised learning for hand interaction region and object detection, (2) supervised learning for fine-grained hand action recognition, and (3) intra- and inter-class domain generalization for hand interaction region detection. The experimental results offer compelling empirical evidence that highlights the challenges inherent in fine-grained hand action recognition, while also shedding light on potential avenues for future research, particularly in relation to pre-training strategy, model design, and domain generalization. The dataset will be released at https://github.com/tingZ123/FHA-Kitchens.

Via

Access Paper or Ask Questions