Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Deng Cai

Neural Collapse Inspired Federated Learning with Non-iid Data

Mar 31, 2023

Chenxi Huang, Liang Xie, Yibo Yang, Wenxiao Wang, Binbin Lin, Deng Cai

Figure 1 for Neural Collapse Inspired Federated Learning with Non-iid Data

Figure 2 for Neural Collapse Inspired Federated Learning with Non-iid Data

Figure 3 for Neural Collapse Inspired Federated Learning with Non-iid Data

Figure 4 for Neural Collapse Inspired Federated Learning with Non-iid Data

Abstract:One of the challenges in federated learning is the non-independent and identically distributed (non-iid) characteristics between heterogeneous devices, which cause significant differences in local updates and affect the performance of the central server. Although many studies have been proposed to address this challenge, they only focus on local training and aggregation processes to smooth the changes and fail to achieve high performance with deep learning models. Inspired by the phenomenon of neural collapse, we force each client to be optimized toward an optimal global structure for classification. Specifically, we initialize it as a random simplex Equiangular Tight Frame (ETF) and fix it as the unit optimization target of all clients during the local updating. After guaranteeing all clients are learning to converge to the global optimum, we propose to add a global memory vector for each category to remedy the parameter fluctuation caused by the bias of the intra-class condition distribution among clients. Our experimental results show that our method can improve the performance with faster convergence speed on different-size datasets.

* 11 pages, 5 figures

Via

Access Paper or Ask Questions

APPT : Asymmetric Parallel Point Transformer for 3D Point Cloud Understanding

Mar 31, 2023

Hengjia Li, Tu Zheng, Zhihao Chi, Zheng Yang, Wenxiao Wang, Boxi Wu, Binbin Lin, Deng Cai

Figure 1 for APPT : Asymmetric Parallel Point Transformer for 3D Point Cloud Understanding

Figure 2 for APPT : Asymmetric Parallel Point Transformer for 3D Point Cloud Understanding

Figure 3 for APPT : Asymmetric Parallel Point Transformer for 3D Point Cloud Understanding

Figure 4 for APPT : Asymmetric Parallel Point Transformer for 3D Point Cloud Understanding

Abstract:Transformer-based networks have achieved impressive performance in 3D point cloud understanding. However, most of them concentrate on aggregating local features, but neglect to directly model global dependencies, which results in a limited effective receptive field. Besides, how to effectively incorporate local and global components also remains challenging. To tackle these problems, we propose Asymmetric Parallel Point Transformer (APPT). Specifically, we introduce Global Pivot Attention to extract global features and enlarge the effective receptive field. Moreover, we design the Asymmetric Parallel structure to effectively integrate local and global information. Combined with these designs, APPT is able to capture features globally throughout the entire network while focusing on local-detailed features. Extensive experiments show that our method outperforms the priors and achieves state-of-the-art on several benchmarks for 3D point cloud understanding, such as 3D semantic segmentation on S3DIS, 3D shape classification on ModelNet40, and 3D part segmentation on ShapeNet.

Via

Access Paper or Ask Questions

General Rotation Invariance Learning for Point Clouds via Weight-Feature Alignment

Feb 20, 2023

Liang Xie, Yibo Yang, Wenxiao Wang, Binbin Lin, Deng Cai, Xiaofei He

Figure 1 for General Rotation Invariance Learning for Point Clouds via Weight-Feature Alignment

Figure 2 for General Rotation Invariance Learning for Point Clouds via Weight-Feature Alignment

Figure 3 for General Rotation Invariance Learning for Point Clouds via Weight-Feature Alignment

Figure 4 for General Rotation Invariance Learning for Point Clouds via Weight-Feature Alignment

Abstract:Compared to 2D images, 3D point clouds are much more sensitive to rotations. We expect the point features describing certain patterns to keep invariant to the rotation transformation. There are many recent SOTA works dedicated to rotation-invariant learning for 3D point clouds. However, current rotation-invariant methods lack generalizability on the point clouds in the open scenes due to the reliance on the global distribution, \ie the global scene and backgrounds. Considering that the output activation is a function of the pattern and its orientation, we need to eliminate the effect of the orientation.In this paper, inspired by the idea that the network weights can be considered a set of points distributed in the same 3D space as the input points, we propose Weight-Feature Alignment (WFA) to construct a local Invariant Reference Frame (IRF) via aligning the features with the principal axes of the network weights. Our WFA algorithm provides a general solution for the point clouds of all scenes. WFA ensures the model achieves the target that the response activity is a necessary and sufficient condition of the pattern matching degree. Practically, we perform experiments on the point clouds of both single objects and open large-range scenes. The results suggest that our method almost bridges the gap between rotation invariance learning and normal methods.

* 14 pages, 4 figures

Via

Access Paper or Ask Questions

LUT-NN: Towards Unified Neural Network Inference by Table Lookup

Feb 07, 2023

Xiaohu Tang, Yang Wang, Ting Cao, Li Lyna Zhang, Qi Chen, Deng Cai, Yunxin Liu, Mao Yang

Figure 1 for LUT-NN: Towards Unified Neural Network Inference by Table Lookup

Figure 2 for LUT-NN: Towards Unified Neural Network Inference by Table Lookup

Figure 3 for LUT-NN: Towards Unified Neural Network Inference by Table Lookup

Figure 4 for LUT-NN: Towards Unified Neural Network Inference by Table Lookup

Abstract:DNN inference requires huge effort of system development and resource cost. This drives us to propose LUT-NN, the first trial towards empowering deep neural network (DNN) inference by table lookup, to eliminate the diverse computation kernels as well as save running cost. Based on the feature similarity of each layer, LUT-NN can learn the typical features, named centroids, of each layer from the training data, precompute them with model weights, and save the results in tables. For future input, the results of the closest centroids with the input features can be directly read from the table, as the approximation of layer output. We propose the novel centroid learning technique for DNN, which enables centroid learning through backpropagation, and adapts three levels of approximation to minimize the model loss. By this technique, LUT-NN achieves comparable accuracy (<5% difference) with original models on real complex dataset, including CIFAR, ImageNet, and GLUE. LUT-NN simplifies the computing operators to only two: closest centroid search and table lookup. We implement them for Intel and ARM CPUs. The model size is reduced by up to 3.5x for CNN models and 7x for BERT. Latency-wise, the real speedup of LUT-NN is up to 7x for BERT and 2x for ResNet, much lower than theoretical results because of the current unfriendly hardware design for table lookup. We expect firstclass table lookup support in the future to unleash the potential of LUT-NN.

Via

Access Paper or Ask Questions

Self-supervised and Weakly Supervised Contrastive Learning for Frame-wise Action Representations

Dec 23, 2022

Minghao Chen, Renbo Tu, Chenxi Huang, Yuqi Lin, Boxi Wu, Deng Cai

Figure 1 for Self-supervised and Weakly Supervised Contrastive Learning for Frame-wise Action Representations

Figure 2 for Self-supervised and Weakly Supervised Contrastive Learning for Frame-wise Action Representations

Figure 3 for Self-supervised and Weakly Supervised Contrastive Learning for Frame-wise Action Representations

Figure 4 for Self-supervised and Weakly Supervised Contrastive Learning for Frame-wise Action Representations

Abstract:Previous work on action representation learning focused on global representations for short video clips. In contrast, many practical applications, such as video alignment, strongly demand learning the intensive representation of long videos. In this paper, we introduce a new framework of contrastive action representation learning (CARL) to learn frame-wise action representation in a self-supervised or weakly-supervised manner, especially for long videos. Specifically, we introduce a simple but effective video encoder that considers both spatial and temporal context by combining convolution and transformer. Inspired by the recent massive progress in self-supervised learning, we propose a new sequence contrast loss (SCL) applied to two related views obtained by expanding a series of spatio-temporal data in two versions. One is the self-supervised version that optimizes embedding space by minimizing KL-divergence between sequence similarity of two augmented views and prior Gaussian distribution of timestamp distance. The other is the weakly-supervised version that builds more sample pairs among videos using video-level labels by dynamic time wrapping (DTW). Experiments on FineGym, PennAction, and Pouring datasets show that our method outperforms previous state-of-the-art by a large margin for downstream fine-grained action classification and even faster inference. Surprisingly, although without training on paired videos like in previous works, our self-supervised version also shows outstanding performance in video alignment and fine-grained frame retrieval tasks.

* 13 pages, 8 figures. arXiv admin note: substantial text overlap with arXiv:2203.14957

Via

Access Paper or Ask Questions

OBMO: One Bounding Box Multiple Objects for Monocular 3D Object Detection

Dec 20, 2022

Chenxi Huang, Tong He, Haidong Ren, Wenxiao Wang, Binbin Lin, Deng Cai

Abstract:Compared to typical multi-sensor systems, monocular 3D object detection has attracted much attention due to its simple configuration. However, there is still a significant gap between LiDAR-based and monocular-based methods. In this paper, we find that the ill-posed nature of monocular imagery can lead to depth ambiguity. Specifically, objects with different depths can appear with the same bounding boxes and similar visual features in the 2D image. Unfortunately, the network cannot accurately distinguish different depths from such non-discriminative visual features, resulting in unstable depth training. To facilitate depth learning, we propose a simple yet effective plug-and-play module, One Bounding Box Multiple Objects (OBMO). Concretely, we add a set of suitable pseudo labels by shifting the 3D bounding box along the viewing frustum. To constrain the pseudo-3D labels to be reasonable, we carefully design two label scoring strategies to represent their quality. In contrast to the original hard depth labels, such soft pseudo labels with quality scores allow the network to learn a reasonable depth range, boosting training stability and thus improving final performance. Extensive experiments on KITTI and Waymo benchmarks show that our method significantly improves state-of-the-art monocular 3D detectors by a significant margin (The improvements under the moderate setting on KITTI validation set are $\mathbf{1.82\sim 10.91\%}$ mAP in BEV and $\mathbf{1.18\sim 9.36\%}$ mAP in 3D}. Codes have been released at https://github.com/mrsempress/OBMO.

* 9 pages, 9 figures

Via

Access Paper or Ask Questions

One-shot Implicit Animatable Avatars with Model-based Priors

Dec 05, 2022

Yangyi Huang, Hongwei Yi, Weiyang Liu, Haofan Wang, Boxi Wu, Wenxiao Wang, Binbin Lin, Debing Zhang, Deng Cai

Abstract:Existing neural rendering methods for creating human avatars typically either require dense input signals such as video or multi-view images, or leverage a learned prior from large-scale specific 3D human datasets such that reconstruction can be performed with sparse-view inputs. Most of these methods fail to achieve realistic reconstruction when only a single image is available. To enable the data-efficient creation of realistic animatable 3D humans, we propose ELICIT, a novel method for learning human-specific neural radiance fields from a single image. Inspired by the fact that humans can easily reconstruct the body geometry and infer the full-body clothing from a single image, we leverage two priors in ELICIT: 3D geometry prior and visual semantic prior. Specifically, ELICIT introduces the 3D body shape geometry prior from a skinned vertex-based template model (i.e., SMPL) and implements the visual clothing semantic prior with the CLIP-based pre-trained models. Both priors are used to jointly guide the optimization for creating plausible content in the invisible areas. In order to further improve visual details, we propose a segmentation-based sampling strategy that locally refines different parts of the avatar. Comprehensive evaluations on multiple popular benchmarks, including ZJU-MoCAP, Human3.6M, and DeepFashion, show that ELICIT has outperformed current state-of-the-art avatar creation methods when only a single image is available. Code will be public for reseach purpose at https://elicit3d.github.io .

* Project website: https://elicit3d.github.io

Via

Access Paper or Ask Questions

What would Harry say? Building Dialogue Agents for Characters in a Story

Nov 15, 2022

Nuo Chen, Yan Wang, Haiyun Jiang, Deng Cai, Ziyang Chen, Jia Li

Abstract:We present HPD: Harry Potter Dialogue Dataset to facilitate the study of building dialogue agents for characters in a story. It differs from existing dialogue datasets in two aspects: 1) HPD provides rich background information about the novel Harry Potter, including scene, character attributes, and character relations; 2) All these background information will change as the story goes on. In other words, each dialogue session in HPD correlates to a different background, and the storyline determines how the background changes. We evaluate some baselines (e.g., GPT-2, BOB) on both automatic and human metrics to determine how well they can generate Harry Potter-like responses. Experimental results indicate that although the generated responses are fluent and relevant to the dialogue history, they are remained to sound out of character for Harry, indicating there is a large headroom for future studies. Our dataset is available.

* 15 pages

Via

Access Paper or Ask Questions

Boosting Semi-Supervised 3D Object Detection with Semi-Sampling

Nov 15, 2022

Xiaopei Wu, Yang Zhao, Liang Peng, Hua Chen, Xiaoshui Huang, Binbin Lin, Haifeng Liu, Deng Cai, Wanli Ouyang

Figure 1 for Boosting Semi-Supervised 3D Object Detection with Semi-Sampling

Figure 2 for Boosting Semi-Supervised 3D Object Detection with Semi-Sampling

Figure 3 for Boosting Semi-Supervised 3D Object Detection with Semi-Sampling

Figure 4 for Boosting Semi-Supervised 3D Object Detection with Semi-Sampling

Abstract:Current 3D object detection methods heavily rely on an enormous amount of annotations. Semi-supervised learning can be used to alleviate this issue. Previous semi-supervised 3D object detection methods directly follow the practice of fully-supervised methods to augment labeled and unlabeled data, which is sub-optimal. In this paper, we design a data augmentation method for semi-supervised learning, which we call Semi-Sampling. Specifically, we use ground truth labels and pseudo labels to crop gt samples and pseudo samples on labeled frames and unlabeled frames, respectively. Then we can generate a gt sample database and a pseudo sample database. When training a teacher-student semi-supervised framework, we randomly select gt samples and pseudo samples to both labeled frames and unlabeled frames, making a strong data augmentation for them. Our semi-sampling can be regarded as an extension of gt-sampling to semi-supervised learning. Our method is simple but effective. We consistently improve state-of-the-art methods on ScanNet, SUN-RGBD, and KITTI benchmarks by large margins. For example, when training using only 10% labeled data on ScanNet, we achieve 3.1 mAP and 6.4 mAP improvement upon 3DIoUMatch in terms of mAP@0.25 and mAP@0.5. When training using only 1% labeled data on KITTI, we boost 3DIoUMatch by 3.5 mAP, 6.7 mAP and 14.1 mAP on car, pedestrian and cyclist classes. Codes will be made publicly available at https://github.com/LittlePey/Semi-Sampling.

Via

Access Paper or Ask Questions

$N$-gram Is Back: Residual Learning of Neural Text Generation with $n$-gram Language Model

Nov 03, 2022

Huayang Li, Deng Cai, Jin Xu, Taro Watanabe

Abstract:$N$-gram language models (LM) have been largely superseded by neural LMs as the latter exhibits better performance. However, we find that $n$-gram models can achieve satisfactory performance on a large proportion of testing cases, indicating they have already captured abundant knowledge of the language with relatively low computational cost. With this observation, we propose to learn a neural LM that fits the residual between an $n$-gram LM and the real-data distribution. The combination of $n$-gram and neural LMs not only allows the neural part to focus on the deeper understanding of language but also provides a flexible way to customize an LM by switching the underlying $n$-gram model without changing the neural model. Experimental results on three typical language tasks (i.e., language modeling, machine translation, and summarization) demonstrate that our approach attains additional performance gains over popular standalone neural models consistently. We also show that our approach allows for effective domain adaptation by simply switching to a domain-specific $n$-gram model, without any extra training. Our code is released at https://github.com/ghrua/NgramRes.

* Accepted to findings of EMNLP 2022

Via

Access Paper or Ask Questions