Understanding objects is a central building block of artificial intelligence, especially for embodied AI. Even though object recognition excels with deep learning, current machines still struggle to learn higher-level knowledge, e.g., what attributes an object has, and what can we do with an object. In this work, we propose a challenging Object Concept Learning (OCL) task to push the envelope of object understanding. It requires machines to reason out object affordances and simultaneously give the reason: what attributes make an object possesses these affordances. To support OCL, we build a densely annotated knowledge base including extensive labels for three levels of object concept (category, attribute, affordance), and the causal relations of three levels. By analyzing the causal structure of OCL, we present a baseline, Object Concept Reasoning Network (OCRN). It leverages causal intervention and concept instantiation to infer the three levels following their causal relations. In experiments, OCRN effectively infers the object knowledge while following the causalities well. Our data and code are available at https://mvig-rhos.com/ocl.
Cold start is an essential and persistent problem in recommender systems. State-of-the-art solutions rely on training hybrid models for both cold-start and existing users/items, based on the auxiliary information. Such a hybrid model would compromise the performance of existing users/items, which might make these solutions not applicable in real-worlds recommender systems where the experience of existing users/items must be guaranteed. Meanwhile, graph neural networks (GNNs) have been demonstrated to perform effectively warm (non-cold-start) recommendations. However, they have never been applied to handle the cold-start problem in a user-item bipartite graph. This is a challenging but rewarding task since cold-start users/items do not have links. Besides, it is nontrivial to design an appropriate GNN to conduct cold-start recommendations while maintaining the performance for existing users/items. To bridge the gap, we propose a tailored GNN-based framework (GPatch) that contains two separate but correlated components. First, an efficient GNN architecture -- GWarmer, is designed to model the warm users/items. Second, we construct correlated Patching Networks to simulate and patch GWarmer by conducting cold-start recommendations. Experiments on benchmark and large-scale commercial datasets demonstrate that GPatch is significantly superior in providing recommendations for both existing and cold-start users/items.
Recent studies on Click-Through Rate (CTR) prediction has reached new levels by modeling longer user behavior sequences. Among others, the two-stage methods stand out as the state-of-the-art (SOTA) solution for industrial applications. The two-stage methods first train a retrieval model to truncate the long behavior sequence beforehand and then use the truncated sequences to train a CTR model. However, the retrieval model and the CTR model are trained separately. So the retrieved subsequences in the CTR model is inaccurate, which degrades the final performance. In this paper, we propose an end-to-end paradigm to model long behavior sequences, which is able to achieve superior performance along with remarkable cost-efficiency compared to existing models. Our contribution is three-fold: First, we propose a hashing-based efficient target attention (TA) network named ETA-Net to enable end-to-end user behavior retrieval based on low-cost bit-wise operations. The proposed ETA-Net can reduce the complexity of standard TA by orders of magnitude for sequential data modeling. Second, we propose a general system architecture as one viable solution to deploy ETA-Net on industrial systems. Particularly, ETA-Net has been deployed on the recommender system of Taobao, and brought 1.8% lift on CTR and 3.1% lift on Gross Merchandise Value (GMV) compared to the SOTA two-stage methods. Third, we conduct extensive experiments on both offline datasets and online A/B test. The results verify that the proposed model outperforms existing CTR models considerably, in terms of both CTR prediction performance and online cost-efficiency. ETA-Net now serves the main traffic of Taobao, delivering services to hundreds of millions of users towards billions of items every day.
Model order estimation (MOE) is often a pre-requisite for Direction of Arrival (DoA) estimation. Due to limits imposed by array geometry, it is typically not possible to estimate spatial parameters for an arbitrary number of sources; an estimate of the signal model is usually required. MOE is the process of selecting the most likely signal model from several candidates. While classic methods fail at MOE in the presence of coherent multipath interference, data-driven supervised learning models can solve this problem. Instead of the classic MLP (Multiple Layer Perceptions) or CNN (Convolutional Neural Networks) architectures, we propose the application of Residual Convolutional Neural Networks (RCNN), with grouped symmetric kernel filters to deliver state-of-art estimation accuracy of up to 95.2\% in the presence of coherent multipath, and a weighted loss function to eliminate underestimation error of the model order. We show the benefit of the approach by demonstrating its impact on an overall signal processing flow that determines the number of total signals received by the array, the number of independent sources, and the association of each of the paths with those sources . Moreover, we show that the proposed estimator provides accurate performance over a variety of array types, can identify the overloaded scenario, and ultimately provides strong DoA estimation and signal association performance.
Long-tailed image recognition presents massive challenges to deep learning systems since the imbalance between majority (head) classes and minority (tail) classes severely skews the data-driven deep neural networks. Previous methods tackle with data imbalance from the viewpoints of data distribution, feature space, and model design, etc.In this work, instead of directly learning a recognition model, we suggest confronting the bottleneck of head-to-tail bias before classifier learning, from the previously omitted perspective of balancing label space. To alleviate the head-to-tail bias, we propose a concise paradigm by progressively adjusting label space and dividing the head classes and tail classes, dynamically constructing balance from imbalance to facilitate the classification. With flexible data filtering and label space mapping, we can easily embed our approach to most classification models, especially the decoupled training methods. Besides, we find the separability of head-tail classes varies among different features with different inductive biases. Hence, our proposed model also provides a feature evaluation method and paves the way for long-tailed feature learning. Extensive experiments show that our method can boost the performance of state-of-the-arts of different types on widely-used benchmarks. Code is available at https://github.com/silicx/DLSA.
Waterfall Recommender System (RS), a popular form of RS in mobile applications, is a stream of recommended items consisting of successive pages that can be browsed by scrolling. In waterfall RS, when a user finishes browsing a page, the edge (e.g., mobile phones) would send a request to the cloud server to get a new page of recommendations, known as the paging request mechanism. RSs typically put a large number of items into one page to reduce excessive resource consumption from numerous paging requests, which, however, would diminish the RSs' ability to timely renew the recommendations according to users' real-time interest and lead to a poor user experience. Intuitively, inserting additional requests inside pages to update the recommendations with a higher frequency can alleviate the problem. However, previous attempts, including only non-adaptive strategies (e.g., insert requests uniformly), would eventually lead to resource overconsumption. To this end, we envision a new learning task of edge intelligence named Intelligent Request Strategy Design (IRSD). It aims to improve the effectiveness of waterfall RSs by determining the appropriate occasions of request insertion based on users' real-time intention. Moreover, we propose a new paradigm of adaptive request insertion strategy named Uplift-based On-edge Smart Request Framework (AdaRequest). AdaRequest 1) captures the dynamic change of users' intentions by matching their real-time behaviors with their historical interests based on attention-based neural networks. 2) estimates the counterfactual uplift of user purchase brought by an inserted request based on causal inference. 3) determines the final request insertion strategy by maximizing the utility function under online resource constraints. We conduct extensive experiments on both offline dataset and online A/B test to verify the effectiveness of AdaRequest.
User historical behaviors are proved useful for Click Through Rate (CTR) prediction in online advertising system. In Meituan, one of the largest e-commerce platform in China, an item is typically displayed with its image and whether a user clicks the item or not is usually influenced by its image, which implies that user's image behaviors are helpful for understanding user's visual preference and improving the accuracy of CTR prediction. Existing user image behavior models typically use a two-stage architecture, which extracts visual embeddings of images through off-the-shelf Convolutional Neural Networks (CNNs) in the first stage, and then jointly trains a CTR model with those visual embeddings and non-visual features. We find that the two-stage architecture is sub-optimal for CTR prediction. Meanwhile, precisely labeled categories in online ad systems contain abundant visual prior information, which can enhance the modeling of user image behaviors. However, off-the-shelf CNNs without category prior may extract category unrelated features, limiting CNN's expression ability. To address the two issues, we propose a hybrid CNN based attention module, unifying user's image behaviors and category prior, for CTR prediction. Our approach achieves significant improvements in both online and offline experiments on a billion scale real serving dataset.
The recently proposed Graph Convolutional Networks (GCNs) have achieved significantly superior performance on various graph-related tasks, such as node classification and recommendation. However, currently researches on GCN models usually recursively aggregate the information from all the neighbors or randomly sampled neighbor subsets, without explicitly identifying whether the aggregated neighbors provide useful information during the graph convolution. In this paper, we theoretically analyze the affection of the neighbor quality over GCN models' performance and propose the Neighbor Enhanced Graph Convolutional Network (NEGCN) framework to boost the performance of existing GCN models. Our contribution is three-fold. First, we at the first time propose the concept of neighbor quality for both node classification and recommendation tasks in a general theoretical framework. Specifically, for node classification, we propose three propositions to theoretically analyze how the neighbor quality affects the node classification performance of GCN models. Second, based on the three proposed propositions, we introduce the graph refinement process including specially designed neighbor evaluation methods to increase the neighbor quality so as to boost both the node classification and recommendation tasks. Third, we conduct extensive node classification and recommendation experiments on several benchmark datasets. The experimental results verify that our proposed NEGCN framework can significantly enhance the performance for various typical GCN models on both node classification and recommendation tasks.
Human activity understanding is of widespread interest in artificial intelligence and spans diverse applications like health care and behavior analysis. Although there have been advances with deep learning, it remains challenging. The object recognition-like solutions usually try to map pixels to semantics directly, but activity patterns are much different from object patterns, thus hindering another success. In this work, we propose a novel paradigm to reformulate this task in two-stage: first mapping pixels to an intermediate space spanned by atomic activity primitives, then programming detected primitives with interpretable logic rules to infer semantics. To afford a representative primitive space, we build a knowledge base including 26+ M primitive labels and logic rules from human priors or automatic discovering. Our framework, Human Activity Knowledge Engine (HAKE), exhibits superior generalization ability and performance upon canonical methods on challenging benchmarks. Code and data are available at http://hake-mvig.cn/.
Attributes and objects can compose diverse compositions. To model the compositional nature of these concepts, it is a good choice to learn them as transformations, e.g., coupling and decoupling. However, complex transformations need to satisfy specific principles to guarantee rationality. Here, we first propose a previously ignored principle of attribute-object transformation: Symmetry. For example, coupling peeled-apple with attribute peeled should result in peeled-apple, and decoupling peeled from apple should still output apple. Incorporating the symmetry, we propose a transformation framework inspired by group theory, i.e., SymNet. It consists of two modules: Coupling Network and Decoupling Network. We adopt deep neural networks to implement SymNet and train it in an end-to-end paradigm with the group axioms and symmetry as objectives. Then, we propose a Relative Moving Distance (RMD) based method to utilize the attribute change instead of the attribute pattern itself to classify attributes. Besides the compositions of single-attribute and object, our RMD is also suitable for complex compositions of multiple attributes and objects when incorporating attribute correlations. SymNet can be utilized for attribute learning, compositional zero-shot learning and outperforms the state-of-the-art on four widely-used benchmarks. Code is at https://github.com/DirtyHarryLYL/SymNet.