In the rapidly evolving domain of Recommender Systems (RecSys), new algorithms frequently claim state-of-the-art performance based on evaluations over a limited set of arbitrarily selected datasets. However, this approach may fail to holistically reflect their effectiveness due to the significant impact of dataset characteristics on algorithm performance. Addressing this deficiency, this paper introduces a novel benchmarking methodology to facilitate a fair and robust comparison of RecSys algorithms, thereby advancing evaluation practices. By utilizing a diverse set of $30$ open datasets, including two introduced in this work, and evaluating $11$ collaborative filtering algorithms across $9$ metrics, we critically examine the influence of dataset characteristics on algorithm performance. We further investigate the feasibility of aggregating outcomes from multiple datasets into a unified ranking. Through rigorous experimental analysis, we validate the reliability of our methodology under the variability of datasets, offering a benchmarking strategy that balances quality and computational demands. This methodology enables a fair yet effective means of evaluating RecSys algorithms, providing valuable guidance for future research endeavors.
This survey aims at providing a comprehensive overview of the recent trends in the field of modeling and simulation (M&S) of interactions between users and recommender systems and applications of the M&S to the performance improvement of industrial recommender engines. We start with the motivation behind the development of frameworks implementing the simulations -- simulators -- and the usage of them for training and testing recommender systems of different types (including Reinforcement Learning ones). Furthermore, we provide a new consistent classification of existing simulators based on their functionality, approbation, and industrial effectiveness and moreover make a summary of the simulators found in the research literature. Besides other things, we discuss the building blocks of simulators: methods for synthetic data (user, item, user-item responses) generation, methods for what-if experimental analysis, methods and datasets used for simulation quality evaluation (including the methods that monitor and/or close possible simulation-to-reality gaps), and methods for summarization of experimental simulation results. Finally, this survey considers emerging topics and open problems in the field.