Lessons Learned From Multi-Armed Bandits In Real-Time Production
By Raphael Steinmann, Alessandro D'Armiento, Adam Kozak, Juan Perez-Sorrosal, Stefano Piazza, Roberto Marturano, Pavel Nesterenko
Abstract
Multi-Armed Bandits for personalization of gaming experience.
We share lessons learned on deploying and running multi-armed bandits in real-time production environment.
In the context of mobile gaming, we illustrate experience personalization to millions of daily users, with latencies below 20–40ms. Specifically, we focus on common challenges arising from real-time production and addressing business goals.
Problem statement
Playtika aims to provide a personalized experience to the users in terms of gameplay (difficulty, daily missions) and content (special quests/events).
This brings numerous technical challenges to operate recommender systems in high-throughput, low-latency, fault-tolerant systems; and business challenges including observability and interpretability of the recommendations, optimization of multiple objectives, learning with delayed feedback and/or under time-limited events.
Approach
Our recommender system, named Smart Actions (Figure 1), operates as follows:
- A user generates a trigger message through game activity event (e.g. login, level up passed via Kafka topics to a Feature Assembler that fetches both information on the user (from a Feature Store and User Profile Service), and a list of relevant actions given the trigger (from an action database).
- The assembled information is dispatched to a recommender system (multi-armed bandits) that will decide which action seems more enjoyable to this specific user.
- A dispatcher component is responsible to implement business conditions (e.g. business or technical constraints), then relay the message to a journey system, responsible to enable this experience on the client (game).
- Action feedback and gameplay are instrumented and enable to derive a ‘reward required to update bandit parameters.
In this diagram, additional predictive models can be used to enrich the Feature Set of a user (e.g. churn prediction).
This infrastructure has two main parts:
- A streaming feed-forward infrastructure: a firm real-time system that allows inferences under a latency 5–20 ms.
- A batch feedback infrastructure that allows updates every few hours.
Recurrent challenges for real-time recommendations
Infrastructure System
Monitoring
In any AI/ML product, two distinct types of monitoring are required: health-check of the infrastructure and understanding about the recommender itself.
Whilst the former can be technically challenging, its design stems from best-practices in architecture and software engineering.
However, the latter, presented significant business challenges to the studio.
Specifically, we needed to explain and understand why the recommender system made such decisions and to whom (amidst millions of daily recommendations).
Classical business dashboards have rapidly demonstrated their limitations. Similarly, given the variety of use-cases, having a one-fits-all solution wasn’t comprehensive enough.
Instead, we rely on both real-time monitoring (with Grafana) about critical KPIs and on batch-scheduled analyses with automated reporting (python scripts scheduled by Airflow and with automated email reports).
Deployment
In Figure 1, we see that there might be multiple deployed bandits for a use-case.
Together with the surrounding microservices and kafka topics, it is cumbersome to manage 50+ microservices to deploy as individual Kubernetes components.
Hence, we invested in deployment automation and pipeline composition, that allowed deploying a full use-case from a single configuration file.
Computational performance
Careful code optimization (algorithmic and parallelization) was required, so that our bandit implementations could be operated in with a latency below 50ms.
Whilst our initial implementation of contextual bandit was providing inference in about 20 seconds, we reduced this to a few milliseconds.
Our pybandit python library is publicly available from github:
https://github.com/PlaytikaOSS/pybandits
Recommender System Solutions
Content Novelties and cold-start
The cold-start problem refers to the difficulty of making decisions when there is little or no information about the action.
This is particularly challenging in gaming whereby new experience or feature are introduced regularly.
To tackle this issue, we rely on Thompson Sampling, a Bayesian approach to multi-armed bandits, allowing to either start with uninformative priors or with estimates derived from similar experience and/or domain expert knowledge:
https://papers.nips.cc/paper_files/paper/2011/hash/e53a0a2978c28872a4505bdb51db06dc-Abstract.html
Multi-objectives and long-term reward
Multi-objective optimization is a significant challenge in real-time production. It involves optimizing multiple conflicting objectives simultaneously.
We address these issues with an implementation the Pareto Thompson sampling [see references: 8,9], extended with a weighing scheme to prioritize over objectives Long-term objectives complicates this further as the true reward of an action may not be immediately available[see references: 3].
Exploratory data analyses and Offline Policy Evaluation[see references: 10] enable to identify short-term reward as proxies for longer-term objectives.
For example, we found that login over the next few days was a strong proxy for retention over the next two weeks.
Short-term use-cases
In short-term use cases, the system needs to quickly adapt to changes (e.g. when introducing daily challenges) to identify the best action.
By design, only short-term rewards should be used (e.g. gameplay within the next hour), combined with frequent updates of bandit parameters.
We implemented the Top-Two Thompson Sampling to solve best-armed identification with pure exploration [see references: 1].
By considering the top two best actions instead of just the highest, the algorithm adds a more explorative strategy for prompt identification of an optimal solution.
Business constraints and cost-control
Recommended actions can take different forms — quantitative or ordinal with high cardinality — and they can have different influence on user’s gameplay.
Balancing fair amount between users and fulfilling gaming experience is often approached with ad-hoc business constraints (e.g. cooldown, hard-set quantity limits).
We delegate such task to our recommender systems. Specifically, we implemented a subsidy factor [see references: 6] to define a set of feasible actions with high reward and optimal items quantities.
This allows tailoring the right level of a beneficial quantity for the optimal reward (gameplay experience).
Future directions
Cannibalization and adversarial effects
Cannibalization refers to the scenario where the introduction of a new recommendation competes with existing recommendations.
Adversarial effects arise when external factors manipulate the reward structure, causing the bandit to make sub-optimal decisions.
Contextual bandits provide great properties to include information regarding ‘external’ context (special game events, daily challenges etc..), and there are several recent implementations tailored for adversarial context and rewards[see references: 4,5].
Cannibalization can be dealt with multi-objectives bandits dealing with both direct reward on the recommendation (e.g. in the game feature), and indirect reward beyond the use-case (e.g. in the whole game).
Users’ Fatigue and Feedback loops
User Fatigue is when users get overwhelmed by repetitive recommendations, reducing engagement.
Feedback Loops occur when past recommendations overly influence future ones, with the risk of a “filter bubble”.
Multiple options are investigated: add new experience to the list of recommendations, maintain a balance/forced exploration (e.g. with some greedy explorations), use non-stationary bandits and sliding-windows approaches[see references: 7].
Acknowledgments
We would like to thank our colleagues for their contribution to the Playtika Recommendation Platform: Raphael Steinmann, Alessandro D’Armiento, Adam Kozak, Juan Perez-Sorrosal, Stefano Piazza, Roberto Marturano, Pavel Nesterenko.
References
- Jean-Yves Audibert, Sebastien Bubeck, and Remi Munos. Best Arm Identification in Multi-Armed Bandits: http://sbubeck.com/COLT10_ABM.pdf
- Olivier Chapelle and Lihong Li. 2011. An Empirical Evaluation of Thompson Sampling. In Advances in Neural Information Processing Systems. Retrieved April 24, 2024 from https://papers.nips.cc/paper_files/paper/2011/hash/e53a0a2978c28872a4505bdb51db06dc-Abstract.html
- Miroslav Dudik, John Langford, and Lihong Li. 2011. Doubly Robust Policy Evaluation and Learning: https://doi.org/10.48550/arXiv.1103.4601
- Yuko Kuroki, Alberto Rumi, Taira Tsuchiya, Fabio Vitale, and Nicolò Cesa-Bianchi. 2024. Best-of-Both-Worlds Algorithms for Linear Contextual Bandits. In Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, 1216–1224. Retrieved May 6, 2024 from https://proceedings.mlr.press/v238/kuroki24a.html
- Haolin Liu, Chen-Yu Wei, and Julian Zimmert. 2023. Bypassing the Simulator: Near-Optimal Adversarial Linear Contextual Bandits. Advances in Neural Information Processing Systems 36: 52086–52131.
- Deeksha Sinha, Karthik Abinav Sankararama, Abbas Kazerouni, and Vashist Avadhanula. 2021. Multi-armed Bandits with Cost Subsidy. https://doi.org/10.48550/arXiv.2011.01488
- Francesco Trovo, Stefano Paladino, Marcello Restelli, and Nicola Gatti. 2020. Sliding-Window Thompson Sampling for Non-Stationary Settings. Journal of Artificial Intelligence Research 68: 311–364. https://doi.org/10.1613/jair.1.11407
- Saba Yahyaa and Bernard Manderick. 2015. Thompson Sampling for Multi-Objective Multi-Armed Bandits Problem. Computational Intelligence.
- 2024. PlaytikaOSS/pybandits. Retrieved April 23, 2024 from https://github.com/PlaytikaOSS/pybandits
- Kato, Masahiro, et al. “A practical guide of off-policy evaluation for bandit problems.” arXiv preprint arXiv:2010.12470 (2020).