Abstract:Evaluation plays a crucial role in the development of ranking algorithms on search and recommender systems. It enables online platforms to create user-friendly features that drive commercial success in a steady and effective manner. The online environment is particularly conducive to applying causal inference techniques, such as randomized controlled experiments (known as A/B test), which are often more challenging to implement in fields like medicine and public policy. However, businesses face unique challenges when it comes to effective A/B test. Specifically, achieving sufficient statistical power for conversion-based metrics can be time-consuming, especially for significant purchases like booking accommodations. While offline evaluations are quicker and more cost-effective, they often lack accuracy and are inadequate for selecting candidates for A/B test. To address these challenges, we developed interleaving and counterfactual evaluation methods to facilitate rapid online assessments for identifying the most promising candidates for A/B tests. Our approach not only increased the sensitivity of experiments by a factor of up to 100 (depending on the approach and metrics) compared to traditional A/B testing but also streamlined the experimental process. The practical insights gained from usage in production can also benefit organizations with similar interests.




Abstract:Online experimentation platforms collect user feedback at low cost and large scale. Some systems even support real-time or near real-time data processing, and can update metrics and statistics continuously. Many commonly used metrics, such as clicks and page views, can be observed without much delay. However, many important signals can only be observed after several hours or days, with noise adding up over the duration of the episode. When episodical outcomes follow a complex sequence of user-product interactions, it is difficult to understand which interactions lead to the final outcome. There is no obvious attribution logic for us to associate a positive or negative outcome back to the actions and choices we made at different times. This attribution logic is critical to unlocking more targeted and efficient measurement at a finer granularity that could eventually lead to the full capability of reinforcement learning. In this paper, we borrow the idea of Causal Surrogacy to model a long-term outcome using leading indicators that are incrementally observed and apply it as the value function to track the progress towards the final outcome and attribute incrementally to various user-product interaction steps. Applying this approach to the guest booking metric at Airbnb resulted in significant variance reductions of 50% to 85%, while aligning well with the booking metric itself. Continuous attribution allows us to assign a utility score to each product page-view, and this score can be flexibly further aggregated to a variety of units of interest, such as searches and listings. We provide multiple real-world applications of attribution to illustrate its versatility.