Understanding the customers' high level shopping intent, such as their desire to go camping or hold a birthday party, is critically important for an E-commerce platform; it can help boost the quality of shopping experience by enabling provision of more relevant, explainable, and diversified recommendations. However, such high level shopping intent has been overlooked in the industry due to practical challenges. In this work, we introduce Amazon's new system that explicitly identifies and utilizes each customer's high level shopping intents for personalizing recommendations. We develop a novel technique that automatically identifies various high level goals being pursued by the Amazon customers, such as "go camping", and "preparing for a beach party". Our solution is in a scalable fashion (in 14 languages across 21 countries). Then a deep learning model maps each customer's online behavior, e.g. product search and individual item engagements, into a subset of high level shopping intents. Finally, a realtime ranker considers both the identified intents as well as the granular engagements to present personalized intent-aware recommendations. Extensive offline analysis ensures accuracy and relevance of the new recommendations and we further observe an 10% improvement in the business metrics. This system is currently serving online traffic at amazon.com, powering several production features, driving significant business impacts
Multimodal magnetic resonance imaging (MRI) can reveal different patterns of human tissue and is crucial for clinical diagnosis. However, limited by cost, noise and manual labeling, obtaining diverse and reliable multimodal MR images remains a challenge. For the same lesion, different MRI manifestations have great differences in background information, coarse positioning and fine structure. In order to obtain better generation and segmentation performance, a coordination-spatial attention generation adversarial network (CASP-GAN) based on the cycle-consistent generative adversarial network (CycleGAN) is proposed. The performance of the generator is optimized by introducing the Coordinate Attention (CA) module and the Spatial Attention (SA) module. The two modules can make full use of the captured location information, accurately locating the interested region, and enhancing the generator model network structure. The ability to extract the structure information and the detailed information of the original medical image can help generate the desired image with higher quality. There exist some problems in the original CycleGAN that the training time is long, the parameter amount is too large, and it is difficult to converge. In response to this problem, we introduce the Coordinate Attention (CA) module to replace the Res Block to reduce the number of parameters, and cooperate with the spatial information extraction network above to strengthen the information extraction ability. On the basis of CASP-GAN, an attentional generative cross-modality segmentation (AGCMS) method is further proposed. This method inputs the modalities generated by CASP-GAN and the real modalities into the segmentation network for brain tumor segmentation. Experimental results show that CASP-GAN outperforms CycleGAN and some state-of-the-art methods in PSNR, SSMI and RMSE in most tasks.
Integrated sensing and communication (ISAC) unifies sensing and communication, and improves the efficiency of the spectrum, energy, and hardware. In this work, we investigate the ISAC beamforming design to maximize the mutual information between the target response matrix of a point radar target and the echo signals, while ensuring the data rate requirements of communication users. In order to study the impact of the echo interference caused by communication users on sensing performance, we consider three types of echo interference problems caused by a single communication user, including no interference, a point interference, and an extended interference, as well as the problem of an extended interference caused by multiple communication users. To address these problems, we provide a closed-form solution with low complexiy, a semidefinite relaxation (SDR) method, a low-complexity algorithm based on the Majorization-Minimization (MM) method and the successive convex approximation (SCA) method, and an algorithm based on MM method and SCA method, respectively. Numerical results demonstrate that, compared to the ISAC beamforming schemes based on the Cram\'er-Rao bound (CRB) metric and the beampattern metric, the proposed maximizing mutual information metric can bring better beampattern and root mean square error (RMSE) of angle estimation. Apart from this, our proposed schemes based on the mutual information can suppress the echo interference from the communication users effectively.
Histopathology images are gigapixel-sized and include features and information at different resolutions. Collecting annotations in histopathology requires highly specialized pathologists, making it expensive and time-consuming. Self-training can alleviate annotation constraints by learning from both labeled and unlabeled data, reducing the amount of annotations required from pathologists. We study the design of teacher-student self-training systems for Non-alcoholic Steatohepatitis (NASH) using clinical histopathology datasets with limited annotations. We evaluate the models on in-distribution and out-of-distribution test data under clinical data shifts. We demonstrate that through self-training, the best student model statistically outperforms the teacher with a $3\%$ absolute difference on the macro F1 score. The best student model also approaches the performance of a fully supervised model trained with twice as many annotations.
Integrated sensing and communication (ISAC) unifies radar sensing and communication, and improves the efficiency of the spectrum, energy and hardware. In this paper, we investigate the ISAC beamforming design in a downlink system with a communication user and a point radar target to be perceived. The design criterion for radar sensing is the mutual information between the target response matrix and the echo signals, while the single user transmission rate is used as the performance metric of the communication. Two scenarios without and with interference from the communication user are investigated. A closed-form solution with low complexity and a solution based on the semidefinite relaxation (SDR) method are provided to solve these two problems, respectively. Numerical results demonstrate that, compared to the results obtained by SDR based on the CVX toolbox, the closed-form solution is tight in the feasible region. In addition, we show that the SDR method obtains an optimal solution.
Customized keyword spotting (KWS) has great potential to be deployed on edge devices to achieve hands-free user experience. However, in real applications, false alarm (FA) would be a serious problem for spotting dozens or even hundreds of keywords, which drastically affects user experience. To solve this problem, in this paper, we leverage the recent advances in transducer and transformer based acoustic models and propose a new multi-stage customized KWS framework named Cascaded Transducer-Transformer KWS (CaTT-KWS), which includes a transducer based keyword detector, a frame-level phone predictor based force alignment module and a transformer based decoder. Specifically, the streaming transducer module is used to spot keyword candidates in audio stream. Then force alignment is implemented using the phone posteriors predicted by the phone predictor to finish the first stage keyword verification and refine the time boundaries of keyword. Finally, the transformer decoder further verifies the triggered keyword. Our proposed CaTT-KWS framework reduces FA rate effectively without obviously hurting keyword recognition accuracy. Specifically, we can get impressively 0.13 FA per hour on a challenging dataset, with over 90% relative reduction on FA comparing to the transducer based detection model, while keyword recognition accuracy only drops less than 2%.
Industrial recommender systems usually hold data from multiple business scenarios and are expected to provide recommendation services for these scenarios simultaneously. In the retrieval step, the topK high-quality items selected from a large number of corpus usually need to be various for multiple scenarios. Take Alibaba display advertising system for example, not only because the behavior patterns of Taobao users are diverse, but also differentiated scenarios' bid prices assigned by advertisers vary significantly. Traditional methods either train models for each scenario separately, ignoring the cross-domain overlapping of user groups and items, or simply mix all samples and maintain a shared model which makes it difficult to capture significant diversities between scenarios. In this paper, we present Adaptive Domain Interest network that adaptively handles the commonalities and diversities across scenarios, making full use of multi-scenarios data during training. Then the proposed method is able to improve the performance of each business domain by giving various topK candidates for different scenarios during online inference. Specifically, our proposed ADI models the commonalities and diversities for different domains by shared networks and domain-specific networks, respectively. In addition, we apply the domain-specific batch normalization and design the domain interest adaptation layer for feature-level domain adaptation. A self training strategy is also incorporated to capture label-level connections across domains.ADI has been deployed in the display advertising system of Alibaba, and obtains 1.8% improvement on advertising revenue.
Visual-based target tracking is easily influenced by multiple factors, such as background clutter, targets fast-moving, illumination variation, object shape change, occlusion, etc. These factors influence the tracking accuracy of a target tracking task. To address this issue, an efficient real-time target tracking method based on a low-dimension adaptive feature fusion is proposed to allow us the simultaneous implementation of the high-accuracy and real-time target tracking. First, the adaptive fusion of a histogram of oriented gradient (HOG) feature and color feature is utilized to improve the tracking accuracy. Second, a convolution dimension reduction method applies to the fusion between the HOG feature and color feature to reduce the over-fitting caused by their high-dimension fusions. Third, an average correlation energy estimation method is used to extract the relative confidence adaptive coefficients to ensure tracking accuracy. We experimentally confirm the proposed method on an OTB100 data set. Compared with nine popular target tracking algorithms, the proposed algorithm gains the highest tracking accuracy and success tracking rate. Compared with the traditional Sum of Template and Pixel-wise LEarners (STAPLE) algorithm, the proposed algorithm can obtain a higher success rate and accuracy, improving by 0.023 and 0.019, respectively. The experimental results also demonstrate that the proposed algorithm can reach the real-time target tracking with 50 fps. The proposed method paves a more promising way for real-time target tracking tasks under a complex environment, such as appearance deformation, illumination change, motion blur, background, similarity, scale change, and occlusion.
Sequential recommendation aims to model dynamic user behavior from historical interactions. Self-attentive methods have proven effective at capturing short-term dynamics and long-term preferences. Despite their success, these approaches still struggle to model sparse data, on which they struggle to learn high-quality item representations. We propose to model user dynamics from shopping intents and interacted items simultaneously. The learned intents are coarse-grained and work as prior knowledge for item recommendation. To this end, we present a coarse-to-fine self-attention framework, namely CaFe, which explicitly learns coarse-grained and fine-grained sequential dynamics. Specifically, CaFe first learns intents from coarse-grained sequences which are dense and hence provide high-quality user intent representations. Then, CaFe fuses intent representations into item encoder outputs to obtain improved item representations. Finally, we infer recommended items based on representations of items and corresponding intents. Experiments on sparse datasets show that CaFe outperforms state-of-the-art self-attentive recommenders by 44.03% NDCG@5 on average.
This paper studies the item-to-item recommendation problem in recommender systems from a new perspective of metric learning via implicit feedback. We develop and investigate a personalizable deep metric model that captures both the internal contents of items and how they were interacted with by users. There are two key challenges in learning such model. First, there is no explicit similarity annotation, which deviates from the assumption of most metric learning methods. Second, these approaches ignore the fact that items are often represented by multiple sources of meta data and different users use different combinations of these sources to form their own notion of similarity. To address these challenges, we develop a new metric representation embedded as kernel parameters of a probabilistic model. This helps express the correlation between items that a user has interacted with, which can be used to predict user interaction with new items. Our approach hinges on the intuition that similar items induce similar interactions from the same user, thus fitting a metric-parameterized model to predict an implicit feedback signal could indirectly guide it towards finding the most suitable metric for each user. To this end, we also analyze how and when the proposed method is effective from a theoretical lens. Its empirical effectiveness is also demonstrated on several real-world datasets.