Context in image is crucial for scene labeling while existing methods only exploit local context generated from a small surrounding area of an image patch or a pixel, by contrast long-range and global contextual information is ignored. To handle this issue, we in this work propose a novel approach for scene labeling by exploring multi-level contextual recurrent neural networks (ML-CRNNs). Specifically, we encode three kinds of contextual cues, i.e., local context, global context and image topic context in structural recurrent neural networks (RNNs) to model long-range local and global dependencies in image. In this way, our method is able to `see' the image in terms of both long-range local and holistic views, and make a more reliable inference for image labeling. Besides, we integrate the proposed contextual RNNs into hierarchical convolutional neural networks (CNNs), and exploit dependence relationships in multiple levels to provide rich spatial and semantic information. Moreover, we novelly adopt an attention model to effectively merge multiple levels and show that it outperforms average- or max-pooling fusion strategies. Extensive experiments demonstrate that the proposed approach achieves new state-of-the-art results on the CamVid, SiftFlow and Stanford-background datasets.
Detecting pedestrian has been arguably addressed as a special topic beyond general object detection. Although recent deep learning object detectors such as Fast/Faster R-CNN [1, 2] have shown excellent performance for general object detection, they have limited success for detecting pedestrian, and previous leading pedestrian detectors were in general hybrid methods combining hand-crafted and deep convolutional features. In this paper, we investigate issues involving Faster R-CNN  for pedestrian detection. We discover that the Region Proposal Network (RPN) in Faster R-CNN indeed performs well as a stand-alone pedestrian detector, but surprisingly, the downstream classifier degrades the results. We argue that two reasons account for the unsatisfactory accuracy: (i) insufficient resolution of feature maps for handling small instances, and (ii) lack of any bootstrapping strategy for mining hard negative examples. Driven by these observations, we propose a very simple but effective baseline for pedestrian detection, using an RPN followed by boosted forests on shared, high-resolution convolutional feature maps. We comprehensively evaluate this method on several benchmarks (Caltech, INRIA, ETH, and KITTI), presenting competitive accuracy and good speed. Code will be made publicly available.
Knowledge representation is a key component to the success of all rule based systems including learning classifier systems (LCSs). This component brings insight into how to partition the problem space what in turn seeks prominent role in generalization capacity of the system as a whole. Recently, knowledge representation component has received great deal of attention within data mining communities due to its impacts on rule based systems in terms of efficiency and efficacy. The current work is an attempt to find a comprehensive and yet elaborate view into the existing knowledge representation techniques in LCS domain in general and XCS in specific. To achieve the objectives, knowledge representation techniques are grouped into different categories based on the classification approach in which they are incorporated. In each category, the underlying rule representation schema and the format of classifier condition to support the corresponding representation are presented. Furthermore, a precise explanation on the way that each technique partitions the problem space along with the extensive experimental results is provided. To have an elaborated view on the functionality of each technique, a comparative analysis of existing techniques on some conventional problems is provided. We expect this survey to be of interest to the LCS researchers and practitioners since it provides a guideline for choosing a proper knowledge representation technique for a given problem and also opens up new streams of research on this topic.
Natural images are often affected by random noise and image denoising has long been a central topic in Computer Vision. Many algorithms have been introduced to remove the noise from the natural images, such as Gaussian, Wiener filtering and wavelet thresholding. However, many of these algorithms remove the fine edges and make them blur. Recently, many promising denoising algorithms have been introduced such as Non-local Means, Fields of Experts, and BM3D. In this paper, we explore Bayesian method of ensemble learning for image denoising. Ensemble methods seek to combine multiple different algorithms to retain the strengths of all methods and the weaknesses of none. Bayesian ensemble models are Non-local Means and Fields of Experts, the very successful recent algorithms. The Non-local Means presumes that the image contains an extensive amount of self-similarity. The approach of the Fields of Experts model extends traditional Markov Random Field model by learning potential functions over extended pixel neighborhoods. The two models are implemented and image denoising is performed on natural images. The experimental results obtained are used to compare with the single algorithm and discuss the ensemble learning and their approaches. Comparing to the results of Non-local Means and Fields of Experts, Ensemble learning showed improvement nearly 1dB.
Proceedings of the Second International Workshop on Domain-Specific Languages and Models for Robotic Systems (DSLRob'11), held in conjunction with the 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2011), September 2011 in San Francisco, USA. The main topics of the workshop were Domain-Specific Languages (DSLs) and Model-driven Software Development (MDSD) for robotics. A domain-specific language (DSL) is a programming language dedicated to a particular problem domain that offers specific notations and abstractions that increase programmer productivity within that domain. Models offer a high-level way for domain users to specify the functionality of their system at the right level of abstraction. DSLs and models have historically been used for programming complex systems. However recently they have garnered interest as a separate field of study. Robotic systems blend hardware and software in a holistic way that intrinsically raises many crosscutting concerns (concurrency, uncertainty, time constraints, ...), for which reason, traditional general-purpose languages often lead to a poor fit between the language features and the implementation requirements. DSLs and models offer a powerful, systematic way to overcome this problem, enabling the programmer to quickly and precisely implement novel software solutions to complex problems
This paper discusses some topics related to the latest trends in the field of evolutionary approaches to iris recognition. It presents the results of an exploratory experimental simulation whose goal was to analyze the possibility of establishing an Interchange Protocol for Digital Identities evolved in different geographic locations interconnected through and into an Intelligent Iris Verifier Distributed System (IIVDS) based on multi-enrollment. Finding a logically consistent model for the Interchange Protocol is the key factor in designing the future large-scale iris biometric networks. Therefore, the logical model of such a protocol is also investigated here. All tests are made on Bath Iris Database and prove that outstanding power of discrimination between the intra- and the inter-class comparisons can be achieved by an IIVDS, even when practicing 52.759.182 inter-class and 10.991.943 intra-class comparisons. Still, the test results confirm that inconsistent enrollment can change the logic of recognition from a fuzzified 2-valent consistent logic of biometric certitudes to a fuzzified 3-valent inconsistent possibilistic logic of biometric beliefs justified through experimentally determined probabilities, or to a fuzzified 8-valent logic which is almost consistent as a biometric theory - this quality being counterbalanced by an absolutely reasonable loss in the user comfort level.
Automatic segmentation of medical images based on multi-modality is an important topic for disease diagnosis. Although the convolutional neural network (CNN) has been proven to have excellent performance in image segmentation tasks, it is difficult to obtain global information. The lack of global information will seriously affect the accuracy of the segmentation results of the lesion area. In addition, there are visual representation differences between multimodal data of the same patient. These differences will affect the results of the automatic segmentation methods. To solve these problems, we propose a segmentation method suitable for multimodal medical images that can capture global information, named TranSiam. TranSiam is a 2D dual path network that extracts features of different modalities. In each path, we utilize convolution to extract detailed information in low level stage, and design a ICMT block to extract global information in high level stage. ICMT block embeds convolution in the transformer, which can extract global information while retaining spatial and detailed information. Furthermore, we design a novel fusion mechanism based on cross attention and selfattention, called TMM block, which can effectively fuse features between different modalities. On the BraTS 2019 and BraTS 2020 multimodal datasets, we have a significant improvement in accuracy over other popular methods.
Human action recognition has recently become one of the popular research topics in the computer vision community. Various 3D-CNN based methods have been presented to tackle both the spatial and temporal dimensions in the task of video action recognition with competitive results. However, these methods have suffered some fundamental limitations such as lack of robustness and generalization, e.g., how does the temporal ordering of video frames affect the recognition results? This work presents a novel end-to-end Transformer-based Directed Attention (DirecFormer) framework for robust action recognition. The method takes a simple but novel perspective of Transformer-based approach to understand the right order of sequence actions. Therefore, the contributions of this work are three-fold. Firstly, we introduce the problem of ordered temporal learning issues to the action recognition problem. Secondly, a new Directed Attention mechanism is introduced to understand and provide attentions to human actions in the right order. Thirdly, we introduce the conditional dependency in action sequence modeling that includes orders and classes. The proposed approach consistently achieves the state-of-the-art (SOTA) results compared with the recent action recognition methods, on three standard large-scale benchmarks, i.e. Jester, Kinetics-400 and Something-Something-V2.
Existing knowledge-grounded dialogue systems typically use finetuned versions of a pretrained language model (LM) and large-scale knowledge bases. These models typically fail to generalize on topics outside of the knowledge base, and require maintaining separate potentially large checkpoints each time finetuning is needed. In this paper, we aim to address these limitations by leveraging the inherent knowledge stored in the pretrained LM as well as its powerful generation ability. We propose a multi-stage prompting approach to generate knowledgeable responses from a single pretrained LM. We first prompt the LM to generate knowledge based on the dialogue context. Then, we further prompt it to generate responses based on the dialogue context and the previously generated knowledge. Results show that our knowledge generator outperforms the state-of-the-art retrieval-based model by 5.8% when combining knowledge relevance and correctness. In addition, our multi-stage prompting outperforms the finetuning-based dialogue model in terms of response knowledgeability and engagement by up to 10% and 5%, respectively. Furthermore, we scale our model up to 530 billion parameters and show that larger LMs improve the generation correctness score by up to 10%, and response relevance, knowledgeability and engagement by up to 10%. Our code is available at: https://github.com/NVIDIA/Megatron-LM.
Machine learning algorithms can produce biased outcome/prediction, typically, against minorities and under-represented sub-populations. Therefore, fairness is emerging as an important requirement for the large scale application of machine learning based technologies. The most commonly used fairness notions (e.g. statistical parity, equalized odds, predictive parity, etc.) are observational and rely on mere correlation between variables. These notions fail to identify bias in case of statistical anomalies such as Simpson's or Berkson's paradoxes. Causality-based fairness notions (e.g. counterfactual fairness, no-proxy discrimination, etc.) are immune to such anomalies and hence more reliable to assess fairness. The problem of causality-based fairness notions, however, is that they are defined in terms of quantities (e.g. causal, counterfactual, and path-specific effects) that are not always measurable. This is known as the identifiability problem and is the topic of a large body of work in the causal inference literature. This paper is a compilation of the major identifiability results which are of particular relevance for machine learning fairness. The results are illustrated using a large number of examples and causal graphs. The paper would be of particular interest to fairness researchers, practitioners, and policy makers who are considering the use of causality-based fairness notions as it summarizes and illustrates the major identifiability results