Monolingual word alignment is important for studying fine-grained editing operations (i.e., deletion, addition, and substitution) in text-to-text generation tasks, such as paraphrase generation, text simplification, neutralizing biased language, etc. In this paper, we present a novel neural semi-Markov CRF alignment model, which unifies word and phrase alignments through variable-length spans. We also create a new benchmark with human annotations that cover four different text genres to evaluate monolingual word alignment models in more realistic settings. Experimental results show that our proposed model outperforms all previous approaches for monolingual word alignment as well as a competitive QA-based baseline, which was previously only applied to bilingual data. Our model demonstrates good generalizability to three out-of-domain datasets and shows great utility in two downstream applications: automatic text simplification and sentence pair classification tasks.
To fully exploit the advantages of massive multiple-input multiple-output (m-MIMO), accurate channel state information (CSI) is required at the transmitter. However, excessive CSI feedback for large antenna arrays is inefficient and thus undesirable in practical applications. By exploiting the inherent correlation characteristics of complex-valued channel responses in the angular-delay domain, we propose a novel neural network (NN) architecture, namely ENet, for CSI compression and feedback in m-MIMO. Even if the ENet processes the real and imaginary parts of the CSI values separately, its special structure enables the network trained for the real part only to be reused for the imaginary part. The proposed ENet shows enhanced performance with the network size reduced by nearly an order of magnitude compared to the existing NN-based solutions. Experimental results verify the effectiveness of the proposed ENet.
While AI has benefited humans, it may also harm humans if not appropriately developed. We conducted a literature review of current related work in developing AI systems from an HCI perspective. Different from other approaches, our focus is on the unique characteristics of AI technology and the differences between non-AI computing systems and AI systems. We further elaborate on the human-centered AI (HCAI) approach that we proposed in 2019. Our review and analysis highlight unique issues in developing AI systems which HCI professionals have not encountered in non-AI computing systems. To further enable the implementation of HCAI, we promote the research and application of human-AI interaction (HAII) as an interdisciplinary collaboration. There are many opportunities for HCI professionals to play a key role to make unique contributions to the main HAII areas as we identified. To support future HCI practice in the HAII area, we also offer enhanced HCI methods and strategic recommendations. In conclusion, we believe that promoting the HAII research and application will further enable the implementation of HCAI, enabling HCI professionals to address the unique issues of AI systems and develop human-centered AI systems.
The concept of super solution is a special type of generalized solutions with certain degree of robustness and stability. In this paper we consider the $(1,1)$-super solutions of the model RB. Using the first moment method, we establish a "threshold" such that as the constraint density crosses this value, the expected number of $(1,1)$-super solutions goes from $0$ to infinity.
The mainstream crowd counting methods usually utilize the convolution neural network (CNN) to regress a density map, requiring point-level annotations. However, annotating each person with a point is an expensive and laborious process. During the testing phase, the point-level annotations are not considered to evaluate the counting accuracy, which means the point-level annotations are redundant. Hence, it is desirable to develop weakly-supervised counting methods that just rely on count level annotations, a more economical way of labeling. Current weakly-supervised counting methods adopt the CNN to regress a total count of the crowd by an image-to-count paradigm. However, having limited receptive fields for context modeling is an intrinsic limitation of these weakly-supervised CNN-based methods. These methods thus can not achieve satisfactory performance, limited applications in the real-word. The Transformer is a popular sequence-to-sequence prediction model in NLP, which contains a global receptive field. In this paper, we propose TransCrowd, which reformulates the weakly-supervised crowd counting problem from the perspective of sequence-to-count based on Transformer. We observe that the proposed TransCrowd can effectively extract the semantic crowd information by using the self-attention mechanism of Transformer. To the best of our knowledge, this is the first work to adopt a pure Transformer for crowd counting research. Experiments on five benchmark datasets demonstrate that the proposed TransCrowd achieves superior performance compared with all the weakly-supervised CNN-based counting methods and gains highly competitive counting performance compared with some popular fully-supervised counting methods. Code is available at https://github.com/dk-liang/TransCrowd.
We propose temporally abstract soft actor-critic (TASAC), an off-policy RL algorithm that incorporates closed-loop temporal abstraction into the soft actor-critic (SAC) framework in a simple manner. TASAC adds a second-stage binary policy to choose between the previous action and the action output by an SAC actor. It has two benefits compared to traditional off-policy RL algorithms: persistent exploration and an unbiased multi-step Q operator for TD learning. We demonstrate its advantages over several strong baselines across 5 different categories of 14 continuous control tasks, in terms of both sample efficiency and final performance. Because of its simplicity and generality, TASAC can serve as a drop-in replacement for SAC when temporal abstraction is needed.
In this paper, we propose a novel map for dense crowd localization and crowd counting. Most crowd counting methods utilize convolution neural networks (CNN) to regress a density map, achieving significant progress recently. However, these regression-based methods are often unable to provide a precise location for each person, attributed to two crucial reasons: 1) the density map consists of a series of blurry Gaussian blobs, 2) severe overlaps exist in the dense region of the density map. To tackle this issue, we propose a novel Focal Inverse Distance Transform (FIDT) map for crowd localization and counting. Compared with the density maps, the FIDT maps accurately describe the people's location, without overlap between nearby heads in dense regions. We simultaneously implement crowd localization and counting by regressing the FIDT map. Extensive experiments demonstrate that the proposed method outperforms state-of-the-art localization-based methods in crowd localization tasks, achieving very competitive performance compared with the regression-based methods in counting tasks. In addition, the proposed method presents strong robustness for the negative samples and extremely dense scenes, which further verifies the effectiveness of the FIDT map. The code and models are available at https://github.com/dk-liang/FIDTM.
Reinforcement learning has been shown to be highly successful at many challenging tasks. However, success heavily relies on well-shaped rewards. Intrinsically motivated RL attempts to remove this constraint by defining an intrinsic reward function. Motivated by the self-consciousness concept in psychology, we make a natural assumption that the agent knows what constitutes itself, and propose a new intrinsic objective that encourages the agent to have maximum control on the environment. We mathematically formalize this reward as the mutual information between the agent state and the surrounding state under the current agent policy. With this new intrinsic motivation, we are able to outperform previous methods, including being able to complete the pick-and-place task for the first time without using any task reward. A video showing experimental results is available at https://youtu.be/AUCwc9RThpk.
In robot sensing scenarios, instead of passively utilizing human captured views, an agent should be able to actively choose informative viewpoints of a 3D object as discriminative evidence to boost the recognition accuracy. This task is referred to as active object recognition. Recent works on this task rely on a massive amount of training examples to train an optimal view selection policy. But in realistic robot sensing scenarios, the large-scale training data may not exist and whether the intelligent view selection policy can be still learned from few object samples remains unclear. In this paper, we study this new problem which is extremely challenging but very meaningful in robot sensing -- Few-shot Active Object Recognition, i.e., to learn view selection policies from few object samples, which has not been considered and addressed before. We solve the proposed problem by adopting the framework of meta learning and name our method "MetaView". Extensive experiments on both category-level and instance-level classification tasks demonstrate that the proposed method can efficiently resolve issues that are hard for state-of-the-art active object recognition methods to handle, and outperform several baselines by large margins.
Recently, particle-based variational inference (ParVI) methods have gained interest because they directly minimize the Kullback-Leibler divergence and do not suffer from approximation errors from the evidence-based lower bound. However, many ParVI approaches do not allow arbitrary sampling from the posterior, and the few that do allow such sampling suffer from suboptimality. This work proposes a new method for learning to approximately sample from the posterior distribution. We construct a neural sampler that is trained with the functional gradient of the KL-divergence between the empirical sampling distribution and the target distribution, assuming the gradient resides within a reproducing kernel Hilbert space. Our generative ParVI (GPVI) approach maintains the asymptotic performance of ParVI methods while offering the flexibility of a generative sampler. Through carefully constructed experiments, we show that GPVI outperforms previous generative ParVI methods such as amortized SVGD, and is competitive with ParVI as well as gold-standard approaches like Hamiltonian Monte Carlo for fitting both exactly known and intractable target distributions.