Rich user behavior data has been proven to be of great value for click-through rate prediction tasks, especially in industrial applications such as recommender systems and online advertising. Both industry and academy have paid much attention to this topic and propose different approaches to modeling with long sequential user behavior data. Among them, memory network based model MIMN proposed by Alibaba, achieves SOTA with the co-design of both learning algorithm and serving system. MIMN is the first industrial solution that can model sequential user behavior data with length scaling up to 1000. However, MIMN fails to precisely capture user interests given a specific candidate item when the length of user behavior sequence increases further, say, by 10 times or more. This challenge exists widely in previously proposed approaches. In this paper, we tackle this problem by designing a new modeling paradigm, which we name as Search-based Interest Model (SIM). SIM extracts user interests with two cascaded search units: (i) General Search Unit acts as a general search from the raw and arbitrary long sequential behavior data, with query information from candidate item, and gets a Sub user Behavior Sequence which is relevant to candidate item; (ii) Exact Search Unit models the precise relationship between candidate item and SBS. This cascaded search paradigm enables SIM with a better ability to model lifelong sequential behavior data in both scalability and accuracy. Apart from the learning algorithm, we also introduce our hands-on experience on how to implement SIM in large scale industrial systems. Since 2019, SIM has been deployed in the display advertising system in Alibaba, bringing 7.1\% CTR and 4.4\% RPM lift, which is significant to the business. Serving the main traffic in our real system now, SIM models user behavior data with maximum length reaching up to 54000, pushing SOTA to 54x.
Pedestrian detection has been heavily studied in the last decade due to its wide applications. Despite incremental progress, several distracting-factors in the aspect of geometry and appearance still remain. In this paper, we first analyze these impeding factors and their effect on the general region-based detection framework. We then present a novel model that is resistant to these factors by incorporating methods that are not solely restricted to pedestrian detection domain. Specifically, to address the geometry distraction, we design a novel coulomb loss as a regulator on bounding box regression, in which proposals are attracted by their target instance and repelled by the adjacent non-target instances. For appearance distraction, we propose an efficient semantic-driven strategy for selecting anchor locations, which can sample informative negative examples at training phase for classification refinement. Our detector can be trained in an end-to-end manner, and achieves consistently high performance on both the Caltech-USA and CityPersons benchmarks. Code will be publicly available upon publication.
Person search by natural language aims at retrieving a specific person in a large-scale image pool that matches the given textual descriptions. While most of the current methods treat the task as a holistic visual and textual feature matching one, we approach it from an attribute-aligning perspective that allows grounding specific attribute phrases to the corresponding visual regions. We achieve success as well as the performance boosting by a robust feature learning that the referred identity can be accurately bundled by multiple attribute visual cues. To be concrete, our Visual-Textual Attribute Alignment model (dubbed as ViTAA) learns to disentangle the feature space of a person into subspaces corresponding to attributes using a light auxiliary attribute segmentation computing branch. It then aligns these visual features with the textual attributes parsed from the sentences by using a novel contrastive learning loss. Upon that, we validate our ViTAA framework through extensive experiments on tasks of person search by natural language and by attribute-phrase queries, on which our system achieves state-of-the-art performances. Code will be publicly available upon publication.
As an important type of reinforcement learning algorithms, actor-critic (AC) and natural actor-critic (NAC) algorithms are often executed in two ways for finding optimal policies. In the first nested-loop design, actor's one update of policy is followed by an entire loop of critic's updates of the value function, and the finite-sample analysis of such AC and NAC algorithms have been recently well established. The second two time-scale design, in which actor and critic update simultaneously but with different learning rates, has much fewer tuning parameters than the nested-loop design and is hence substantially easier to implement. Although two time-scale AC and NAC have been shown to converge in the literature, the finite-sample convergence rate has not been established. In this paper, we provide the first such non-asymptotic convergence rate for two time-scale AC and NAC under Markovian sampling and with actor having general policy class approximation. We show that two time-scale AC requires the overall sample complexity at the order of $\mathcal{O}(\epsilon^{-2.5}\log^3(\epsilon^{-1}))$ to attain an $\epsilon$-accurate stationary point, and two time-scale NAC requires the overall sample complexity at the order of $\mathcal{O}(\epsilon^{-4}\log^2(\epsilon^{-1}))$ to attain an $\epsilon$-accurate global optimal point. We develop novel techniques for bounding the bias error of the actor due to dynamically changing Markovian sampling and for analyzing the convergence rate of the linear critic with dynamically changing base functions and transition kernel.
The actor-critic (AC) algorithm is a popular method to find an optimal policy in reinforcement learning. The finite-sample convergence rate for the AC and natural actor-critic (NAC) algorithms has been established recently, but under independent and identically distributed (i.i.d.) sampling and single-sample update at each iteration. In contrast, this paper characterizes the convergence rate and sample complexity of AC and NAC under Markovian sampling, with mini-batch data for each iteration, and with actor having general policy class approximation. We show that the overall sample complexity for a mini-batch AC to attain an $\epsilon$-accurate stationary point improves the best known sample complexity of AC by an order of $\mathcal{O}(\frac{1}{\epsilon}\log(\frac{1}{\epsilon}))$. We also show that the overall sample complexity for a mini-batch NAC to attain an $\epsilon$-accurate globally optimal point improves the known sample complexity of natural policy gradient (NPG) by $\mathcal{O}(\frac{1}{\epsilon}/\log(\frac{1}{\epsilon}))$. Our study develops several novel techniques for finite-sample analysis of RL algorithms including handling the bias error due to mini-batch Markovian sampling and exploiting the self variance reduction property to improve the convergence analysis of NAC.
Learning the change of statistical dependencies between random variables is an essential task for many real-life applications, mostly in the high dimensional low sample regime. In this paper, we propose a novel differential parameter estimator that, in comparison to current methods, simultaneously allows (a) the flexible integration of multiple sources of information (data samples, variable groupings, extra pairwise evidence, etc.), (b) being scalable to a large number of variables, and (c) achieving a sharp asymptotic convergence rate. Our experiments, on more than 100 simulated and two real-world datasets, validate the flexibility of our approach and highlight the benefits of integrating spatial and anatomic information for brain connectome change discovery and epigenetic network identification.
Since December 2019, the coronavirus disease 2019 (COVID-19) has spread rapidly across China. As at the date of writing this article, the disease has been globally reported in 100 countries, infected over 100,000 people and caused over 3,000 deaths. Avoiding person-to-person transmission is an effective approach to control and prevent the epidemic. However, many daily activities, such as logistics transporting goods in our daily life, inevitably involve person-to-person contact. To achieve contact-less goods transportation, using an autonomous logistic vehicle has become the preferred choice. This article presents Hercules, an autonomous logistic vehicle used for contact-less goods transportation during the outbreak of COVID-19. The vehicle is designed with autonomous navigation capability. We provide details on the hardware and software, as well as the algorithms to achieve autonomous navigation including perception, planning and control. This paper is accompanied by a demonstration video and a dataset, which are available here: https://sites.google.com/view/contact-less-transportation.
In this paper, we attempt to solve the domain adaptation problem for deep stereo matching networks. Instead of resorting to black-box structures or layers to find implicit connections across domains, we focus on investigating adaptation gaps for stereo matching. By visual inspections and extensive experiments, we conclude that low-level aligning is crucial for adaptive stereo matching, since main gaps across domains lie in the inconsistent input color and cost volume distributions. Correspondingly, we design a bottom-up domain adaptation method, in which two particular approaches are proposed, i.e. color transfer and cost regularization, that can be easily integrated into existing stereo matching models. The color transfer enables transferring a large amount of synthetic data to the same color spaces with target domains during training. The cost regularization can further constrain both the lower-layer features and cost volumes to domain-invariant distributions. Although our proposed strategies are simple and have no parameters to learn, they do improve the generalization ability of existing disparity networks by a large margin. We conduct experiments across multiple datasets, including Scene Flow, KITTI, Middlebury, ETH3D and DrivingStereo. Without whistles and bells, our synthetic-data pretrained models achieve state-of-the-art cross-domain performance compared to previous domain-invariant methods, even outperform state-of-the-art disparity networks fine-tuned with target domain ground-truths on multiple stereo matching benchmarks.
Monocular estimation of 3d human pose has attracted increased attention with the availability of large ground-truth motion capture datasets. However, the diversity of training data available is limited and it is not clear to what extent methods generalize outside the specific datasets they are trained on. In this work we carry out a systematic study of the diversity and biases present in specific datasets and its effect on cross-dataset generalization across a compendium of 5 pose datasets. We specifically focus on systematic differences in the distribution of camera viewpoints relative to a body-centered coordinate frame. Based on this observation, we propose an auxiliary task of predicting the camera viewpoint in addition to pose. We find that models trained to jointly predict viewpoint and pose systematically show significantly improved cross-dataset generalization.