Since the recent success of Vision Transformers (ViTs), explorations toward transformer-style architectures have triggered the resurgence of modern ConvNets. In this work, we explore the representation ability of DNNs through the lens of interaction complexities. We empirically show that interaction complexity is an overlooked but essential indicator for visual recognition. Accordingly, a new family of efficient ConvNets, named MogaNet, is presented to pursue informative context mining in pure ConvNet-based models, with preferable complexity-performance trade-offs. In MogaNet, interactions across multiple complexities are facilitated and contextualized by leveraging two specially designed aggregation blocks in both spatial and channel interaction spaces. Extensive studies are conducted on ImageNet classification, COCO object detection, and ADE20K semantic segmentation tasks. The results demonstrate that our MogaNet establishes new state-of-the-art over other popular methods in mainstream scenarios and all model scales. Typically, the lightweight MogaNet-T achieves 80.0\% top-1 accuracy with only 1.44G FLOPs using a refined training setup on ImageNet-1K, surpassing ParC-Net-S by 1.4\% accuracy but saving 59\% (2.04G) FLOPs.
Sign Language (SL), as the mother tongue of the deaf community, is a special visual language that most hearing people cannot understand. In recent years, neural Sign Language Translation (SLT), as a possible way for bridging communication gap between the deaf and the hearing people, has attracted widespread academic attention. We found that the current mainstream end-to-end neural SLT models, which tries to learning language knowledge in a weakly supervised manner, could not mine enough semantic information under the condition of low data resources. Therefore, we propose to introduce additional word-level semantic knowledge of sign language linguistics to assist in improving current end-to-end neural SLT models. Concretely, we propose a novel neural SLT model with multi-modal feature fusion based on the dynamic graph, in which the cross-modal information, i.e. text and video, is first assembled as a dynamic graph according to their correlation, and then the graph is processed by a multi-modal graph encoder to generate the multi-modal embeddings for further usage in the subsequent neural translation models. To the best of our knowledge, we are the first to introduce graph neural networks, for fusing multi-modal information, into neural sign language translation models. Moreover, we conducted experiments on a publicly available popular SLT dataset RWTH-PHOENIX-Weather-2014T. and the quantitative experiments show that our method can improve the model.
Multi-agent reinforcement learning(MARL) is a prevalent learning paradigm for solving stochastic games. In previous studies, agents in a game are defined to be teammates or enemies beforehand, and the relation of the agents is fixed throughout the game. Those works can hardly work in the games where the competitive and collaborative relationships are not public and dynamically changing, which is decided by the \textit{identities} of the agents. How to learn a successful policy in such a situation where the identities of agents are ambiguous is still a problem. Focusing on this problem, in this work, we develop a novel MARL framework: IDRL, which identifies the identities of the agents dynamically and then chooses the corresponding policy to perform in the task. In the IDRL framework, a relation network is constructed to deduce the identities of the multi-agents through feeling the kindness and hostility unleashed by other agents; a dangerous network is built to estimate the risk of the identification. We also propose an intrinsic reward to help train the relation network and the dangerous network to get a trade-off between the need to maximize external reward and the accuracy of identification. After identifying the cooperation-competition pattern among the agents, the proposed method IDRL applies one of the off-the-shelf MARL methods to learn the policy. Taking the poker game \textit{red-10} as the experiment environment, experiments show that the IDRL can achieve superior performance compared to the other MARL methods. Significantly, the relation network has the par performance to identify the identities of agents with top human players; the dangerous network reasonably avoids the risk of imperfect identification.
The ability to reuse previous policies is an important aspect of human intelligence. To achieve efficient policy reuse, a Deep Reinforcement Learning (DRL) agent needs to decide when to reuse and which source policies to reuse. Previous methods solve this problem by introducing extra components to the underlying algorithm, such as hierarchical high-level policies over source policies, or estimations of source policies' value functions on the target task. However, training these components induces either optimization non-stationarity or heavy sampling cost, significantly impairing the effectiveness of transfer. To tackle this problem, we propose a novel policy reuse algorithm called Critic-gUided Policy reuse (CUP), which avoids training any extra components and efficiently reuses source policies. CUP utilizes the critic, a common component in actor-critic methods, to evaluate and choose source policies. At each state, CUP chooses the source policy that has the largest one-step improvement over the current target policy, and forms a guidance policy. The guidance policy is theoretically guaranteed to be a monotonic improvement over the current target policy. Then the target policy is regularized to imitate the guidance policy to perform efficient policy search. Empirical results demonstrate that CUP achieves efficient transfer and significantly outperforms baseline algorithms.
With the remarkable progress of deep neural networks in computer vision, data mixing augmentation techniques are widely studied to alleviate problems of degraded generalization when the amount of training data is limited. However, mixup strategies have not been well assembled in current vision toolboxes. In this paper, we propose \texttt{OpenMixup}, an open-source all-in-one toolbox for supervised, semi-, and self-supervised visual representation learning with mixup. It offers an integrated model design and training platform, comprising a rich set of prevailing network architectures and modules, a collection of data mixing augmentation methods as well as practical model analysis tools. In addition, we also provide standard mixup image classification benchmarks on various datasets, which expedites practitioners to make fair comparisons among state-of-the-art methods under the same settings. The source code and user documents are available at \url{https://github.com/Westlake-AI/openmixup}.
Graph edge perturbations are dedicated to damaging the prediction of graph neural networks by modifying the graph structure. Previous gray-box attackers employ gradients from the surrogate model to locate the vulnerable edges to perturb the graph structure. However, unreliability exists in gradients on graph structures, which is rarely studied by previous works. In this paper, we discuss and analyze the errors caused by the unreliability of the structural gradients. These errors arise from rough gradient usage due to the discreteness of the graph structure and from the unreliability in the meta-gradient on the graph structure. In order to address these problems, we propose a novel attack model with methods to reduce the errors inside the structural gradients. We propose edge discrete sampling to select the edge perturbations associated with hierarchical candidate selection to ensure computational efficiency. In addition, semantic invariance and momentum gradient ensemble are proposed to address the gradient fluctuation on semantic-augmented graphs and the instability of the surrogate model. Experiments are conducted in untargeted gray-box poisoning scenarios and demonstrate the improvement in the performance of our approach.
Current multi-category Multiple Object Tracking (MOT) metrics use class labels to group tracking results for per-class evaluation. Similarly, MOT methods typically only associate objects with the same class predictions. These two prevalent strategies in MOT implicitly assume that the classification performance is near-perfect. However, this is far from the case in recent large-scale MOT datasets, which contain large numbers of classes with many rare or semantically similar categories. Therefore, the resulting inaccurate classification leads to sub-optimal tracking and inadequate benchmarking of trackers. We address these issues by disentangling classification from tracking. We introduce a new metric, Track Every Thing Accuracy (TETA), breaking tracking measurement into three sub-factors: localization, association, and classification, allowing comprehensive benchmarking of tracking performance even under inaccurate classification. TETA also deals with the challenging incomplete annotation problem in large-scale tracking datasets. We further introduce a Track Every Thing tracker (TETer), that performs association using Class Exemplar Matching (CEM). Our experiments show that TETA evaluates trackers more comprehensively, and TETer achieves significant improvements on the challenging large-scale datasets BDD100K and TAO compared to the state-of-the-art.
Dimensional reduction~(DR) maps high-dimensional data into a lower dimensions latent space with minimized defined optimization objectives. The DR method usually falls into feature selection~(FS) and feature projection~(FP). FS focuses on selecting a critical subset of dimensions but risks destroying the data distribution (structure). On the other hand, FP combines all the input features into lower dimensions space, aiming to maintain the data structure; but lacks interpretability and sparsity. FS and FP are traditionally incompatible categories; thus, they have not been unified into an amicable framework. We propose that the ideal DR approach combines both FS and FP into a unified end-to-end manifold learning framework, simultaneously performing fundamental feature discovery while maintaining the intrinsic relationships between data samples in the latent space. In this work, we develop a unified framework, Unified Dimensional Reduction Neural-network~(UDRN), that integrates FS and FP in a compatible, end-to-end way. We improve the neural network structure by implementing FS and FP tasks separately using two stacked sub-networks. In addition, we designed data augmentation of the DR process to improve the generalization ability of the method when dealing with extensive feature datasets and designed loss functions that can cooperate with the data augmentation. Extensive experimental results on four image and four biological datasets, including very high-dimensional data, demonstrate the advantages of DRN over existing methods~(FS, FP, and FS\&FP pipeline), especially in downstream tasks such as classification and visualization.
Manifold learning~(ML) aims to find low-dimensional embedding from high-dimensional data. Previous works focus on handcraft or easy datasets with simple and ideal scenarios; however, we find they perform poorly on real-world datasets with under-sampling data. Generally, ML methods primarily model data structure and subsequently process a low-dimensional embedding, where the poor local connectivity of under-sampling data in the former step and inappropriate optimization objectives in the later step will lead to \emph{structural distortion} and \emph{underconstrained embedding}. To solve this problem, we propose Deep Local-flatness Manifold Embedding (DLME), a novel ML framework to obtain reliable manifold embedding by reducing distortion. Our proposed DLME constructs semantic manifolds by data augmentation and overcomes \emph{structural distortion} problems with the help of its smooth framework. To overcome \emph{underconstrained embedding}, we design a specific loss for DLME and mathematically demonstrate that it leads to a more suitable embedding based on our proposed Local Flatness Assumption. In the experiments, by showing the effectiveness of DLME on downstream classification, clustering, and visualization tasks with three types of datasets (toy, biological, and image), our experimental results show that DLME outperforms SOTA ML \& contrastive learning (CL) methods.
Spatiotemporal predictive learning aims to generate future frames by learning from historical frames. In this paper, we investigate existing methods and present a general framework of spatiotemporal predictive learning, in which the spatial encoder and decoder capture intra-frame features and the middle temporal module catches inter-frame correlations. While the mainstream methods employ recurrent units to capture long-term temporal dependencies, they suffer from low computational efficiency due to their unparallelizable architectures. To parallelize the temporal module, we propose the Temporal Attention Unit (TAU), which decomposes the temporal attention into intra-frame statical attention and inter-frame dynamical attention. Moreover, while the mean squared error loss focuses on intra-frame errors, we introduce a novel differential divergence regularization to take inter-frame variations into account. Extensive experiments demonstrate that the proposed method enables the derived model to achieve competitive performance on various spatiotemporal prediction benchmarks.