This paper describes our system, which placed third in the Multilingual Track (subtask 11), fourth in the Code-Mixed Track (subtask 12), and seventh in the Chinese Track (subtask 9) in the SemEval 2022 Task 11: MultiCoNER Multilingual Complex Named Entity Recognition. Our system's key contributions are as follows: 1) For multilingual NER tasks, we offer an unified framework with which one can easily execute single-language or multilingual NER tasks, 2) for low-resource code-mixed NER task, one can easily enhance his or her dataset through implementing several simple data augmentation methods and 3) for Chinese tasks, we propose a model that can capture Chinese lexical semantic, lexical border, and lexical graph structural information. Finally, our system achieves macro-f1 scores of 77.66, 84.35, and 74.00 on subtasks 11, 12, and 9, respectively, during the testing phase.
The automatic analysis of subtle changes between longitudinal MR images is an important task as it is still a challenging issue in scope of the breast medical image processing. In this paper we propose an effective automatic change detection framework composed of two phases since previously used methods have features with low distinctive power. First, in the preprocessing phase an intensity normalization method is suggested based on Hierarchical Histogram Matching (HHM) that is more robust to noise than previous methods. To eliminate undesirable changes and extract the regions containing significant changes the proposed Extraction Region of Changes (EROC) method is applied based on intensity distribution and Hill-Climbing algorithm. Second, in the detection phase a region growing-based approach is suggested to differentiate significant changes from unreal ones. Due to using proposed Weighted Local Mutual Information (WLMI) method to extract high level features and also utilizing the principle of the local consistency of changes, the proposed approach enjoys reasonable performance. The experimental results on both simulated and real longitudinal Breast MR Images confirm the effectiveness of the proposed framework. Also, this framework outperforms the human expert in some cases which can detect many lesion evolutions that are missed by expert.
In head-to-head racing, performing tightly constrained, but highly rewarding maneuvers, such as overtaking, require an accurate model of interactive behavior of the opposing target vehicle (TV). However, such information is not typically made available in competitive scenarios, we therefore propose to construct a prediction and uncertainty model given data of the TV from previous races. In particular, a one-step Gaussian Process (GP) model is trained on closed-loop interaction data to learn the behavior of a TV driven by an unknown policy. Predictions of the nominal trajectory and associated uncertainty are rolled out via a sampling-based approach and are used in a model predictive control (MPC) policy for the ego vehicle in order to intelligently trade-off between safety and performance when attempting overtaking maneuvers against a TV. We demonstrate the GP-based predictor in closed loop with the MPC policy in simulation races and compare its performance against several predictors from literature. In a Monte Carlo study, we observe that the GP-based predictor achieves similar win rates while maintaining safety in up to 3x more races. We finally demonstrate the prediction and control framework in real-time on hardware experiments.
Model training algorithms which observe a small portion of the training set in each computational step are ubiquitous in practical machine learning, and include both stochastic and online optimization methods. In the vast majority of cases, such algorithms typically observe the training samples via the gradients of the cost functions the samples incur. Thus, these methods exploit are the \emph{slope} of the cost functions via their first-order approximations. To address limitations of gradient-based methods, such as sensitivity to step-size choice in the stochastic setting, or inability to exploit small function variability in the online setting, several streams of research attempt to exploit more information about the cost functions than just their gradients via the well-known proximal framework of optimization. However, implementing such methods in practice poses a challenge, since each iteration step boils down to computing a proximal operator, which may not be easy. In this work we provide efficient algorithms and corresponding implementations of proximal operators in order to make experimentation with incremental proximal optimization algorithms accessible to a larger audience of researchers and practitioners, and in particular to promote additional theoretical research into these methods by closing the gap between their theoretical description in research papers and their use in practice. The corresponding code is published at https://github.com/alexshtf/inc_prox_pt.
Human brains integrate linguistic and perceptual information simultaneously to understand natural language, and hold the critical ability to render imaginations. Such abilities enable us to construct new abstract concepts or concrete objects, and are essential in involving practical knowledge to solve problems in low-resource scenarios. However, most existing methods for Natural Language Understanding (NLU) are mainly focused on textual signals. They do not simulate human visual imagination ability, which hinders models from inferring and learning efficiently from limited data samples. Therefore, we introduce an Imagination-Augmented Cross-modal Encoder (iACE) to solve natural language understanding tasks from a novel learning perspective -- imagination-augmented cross-modal understanding. iACE enables visual imagination with external knowledge transferred from the powerful generative and pre-trained vision-and-language models. Extensive experiments on GLUE and SWAG show that iACE achieves consistent improvement over visually-supervised pre-trained models. More importantly, results in extreme and normal few-shot settings validate the effectiveness of iACE in low-resource natural language understanding circumstances.
In this work, a novel algorithm called SVM with Shape-adaptive Reconstruction and Smoothed Total Variation (SaR-SVM-STV) is introduced to classify hyperspectral images, which makes full use of spatial and spectral information. The Shape-adaptive Reconstruction (SaR) is introduced to preprocess each pixel based on the Pearson Correlation between pixels in its shape-adaptive (SA) region. Support Vector Machines (SVMs) are trained to estimate the pixel-wise probability maps of each class. Then the Smoothed Total Variation (STV) model is applied to denoise and generate the final classification map. Experiments show that SaR-SVM-STV outperforms the SVM-STV method with a few training labels, demonstrating the significance of reconstructing hyperspectral images before classification.
Powerful priors allow us to perform inference with insufficient information. In this paper, we propose an autoregressive prior for 3D shapes to solve multimodal 3D tasks such as shape completion, reconstruction, and generation. We model the distribution over 3D shapes as a non-sequential autoregressive distribution over a discretized, low-dimensional, symbolic grid-like latent representation of 3D shapes. This enables us to represent distributions over 3D shapes conditioned on information from an arbitrary set of spatially anchored query locations and thus perform shape completion in such arbitrary settings (e.g., generating a complete chair given only a view of the back leg). We also show that the learned autoregressive prior can be leveraged for conditional tasks such as single-view reconstruction and language-based generation. This is achieved by learning task-specific naive conditionals which can be approximated by light-weight models trained on minimal paired data. We validate the effectiveness of the proposed method using both quantitative and qualitative evaluation and show that the proposed method outperforms the specialized state-of-the-art methods trained for individual tasks. The project page with code and video visualizations can be found at https://yccyenchicheng.github.io/AutoSDF/.
We proposed a novel trimap free video matting method based on the attention mechanism. By the nature of the problem, most existing approaches use either multiple computational expansive modules or complex algorithms to exploit temporal information fully. We designed a temporal aggregation module to compute the temporal coherence between the current frame and its two previous frames.
The intelligent robotics community usually organizes knowledge into symbolic and sub-symbolic levels. These two levels establish the set of symbols and rules for manipulating knowledge based on their (symbol system - dictionary). Thus, the correspondences -- Grounding or knowledge representation -- require specific software techniques for anchoring continuous and discrete state variables between these two levels. This paper presents the design and evaluation of an Open Source tool called KANT(Knowledge mAnagemeNT) to let different components of the system architecture controlling the robot query, save, edit, and delete the data from the Knowledge Base without having to worry about the type and the implementation of the source data. Using KANT, components managing subsymbolic information can smoothly interact with symbolic components. Besides, implementation mechanisms used in KANT, such as the use of in-memory and non-SQL databases, improve the performance of the knowledge management systems in ROS middleware, as shown by the evaluations presented in this work.
Most methods for conditional video synthesis use a single modality as the condition. This comes with major limitations. For example, it is problematic for a model conditioned on an image to generate a specific motion trajectory desired by the user since there is no means to provide motion information. Conversely, language information can describe the desired motion, while not precisely defining the content of the video. This work presents a multimodal video generation framework that benefits from text and images provided jointly or separately. We leverage the recent progress in quantized representations for videos and apply a bidirectional transformer with multiple modalities as inputs to predict a discrete video representation. To improve video quality and consistency, we propose a new video token trained with self-learning and an improved mask-prediction algorithm for sampling video tokens. We introduce text augmentation to improve the robustness of the textual representation and diversity of generated videos. Our framework can incorporate various visual modalities, such as segmentation masks, drawings, and partially occluded images. It can generate much longer sequences than the one used for training. In addition, our model can extract visual information as suggested by the text prompt, e.g., "an object in image one is moving northeast", and generate corresponding videos. We run evaluations on three public datasets and a newly collected dataset labeled with facial attributes, achieving state-of-the-art generation results on all four.