The Journey of Building Playtika’s ML Platform — Part 1
By Oded Nadel
Introducing MLody, Playtika’s in-house ML platform, built entirely towards making our sophisticated models available at your fingertips.
To begin our story, we must first discuss our goal and what we are trying to achieve.
Chapter 1: Introduction
Goal
The goal of our ML platform is to facilitate the machine learning model development lifecycle by providing an end-2-end ecosystem for the following processes:
- Data collection & understanding – extraction and exploration of the historical data relevant to our business question.
- Feature engineering and selection.
- Model training and creation through experimentation.
- Model lifecycle management using a centralized model registry.
- Model serving and deployment – in 3 deployment methods: REST, streaming endpoint (online), and batch (offline).
- ML pipeline orchestration and management.
- Model production monitoring – including a feedback loop of labels/actuals collection and processing.
Our Use Cases for the ML platform
The ML platform will be used by a variety of teams within the company, including data engineers, machine learning engineers, and data scientists.
Once we cover our “basics” and onboard our first cycle of users, we will pursue the “democratization of AI” within the company, allowing the second and third cycle of users to proactively take part in the model development process by supplying “AutoML” capabilities and enhanced UI capabilities to “keep the drama under the hood.”
The different capabilities of our platform should allow for:
- Development of new machine learning models for additional new use cases.
- Improvement of existing machine learning models.
- Expansion of existing machine learning solutions to new studios in Playtika. This will allow for onboarding of new games that will be acquired by Playtika in the future.
- Automated deployment of machine learning models to production.
- Performance monitoring of the machine learning models in production.
What are our business aspirations?
Our ML platform has several ambitious goals:
- Supporting Traditional Models: We aim to support our classical models, producing predictions for multiclass classification and regression tasks.
- Unsupervised Learning: The platform should also support our unsupervised models to detect anomalies within our games.
- Generative AI Integration: We envision support of our generative AI initiatives capable of producing images or text based on user prompts.
- Democratization of AI: A core objective is to empower all Playtikans to easily generate AI solutions by supplying out-of-the-box solutions for optimizing various business KPIs.
- Open-Source Contribution: Ultimately, we aspire to release our ML platform back to the open-source community.
Where will our ML platform shine?
The ML platform promises to deliver significant benefits to Playtika, including:
- Enhanced efficiency and productivity of machine learning teams comprised of both data scientists and ML engineers.
- Accelerated time to market for new machine learning solutions
- Improved quality of machine learning models
- Accelerated releases through technological compliance with Playtika’s Production Grade Standards.
- Reduced risk of machine learning models failing in production
- Enhanced visibility into the performance of machine learning models in production
- Increased maturity of our ML solutions deployed to production through visibility on resource utilization and alerts about data or ML failures.
Chapter 2: Our Starting Point
Now that we have a clear understanding of our goals, let’s delve into the “how” of building our ML platform.
To begin this chapter, we first need to understand the constraints that we need to consider.
Our Constraints
Cloud or On-premise?
We are limited to on-premise infrastructures.
Yes, that is correct. Playtika is fully on-premise, managing multiple data centers to support our day-to-day activities and services. This means that managed cloud platforms, such as AWS’s SageMaker, Google’s Vertex AI, or Azure’s ML-Studio are not an option for us.
But this raises additional “hidden” constraints: “How do we handle missing HW?”
To better explain, we need to remember that CPU-based models currently dominate the market, but this is temporary, and the transition to GPU-based models is imminent. If you ask me, the transition has already happened in light of the hype surrounding ChatGPT, Gemini, MidJourney, and foundational models.
Data
Our models need data, a lot (!) of data. For that, we utilize our on-premise storage, but hey, it isn’t S3 (by AWS), which brings additional complexity.
We are using an abstraction layer, written in Java, that removes the need to know the actual storage bucket locations. This is a virtue, but also a setback, because our programming language is mostly Python and the two languages do not work well together natively.
For data computation, we rely extensively on Spark, like many other players in the industry.
The data that our models are digesting is “clean” but not organized well, so we rely on BI developers to explain the context and business meaning of each data point.
To relieve this pain, we are also creating our own “data catalog” with several automated processes.
Automation
Playtika’s existing CI/CD automation solution does not meet the needs of the ML-DLC. therefore, we need to collaborate with our CI/CD developers to create a new process and implement it across all of Playtika’s platforms.
Our existing quality assurance infrastructure allows for automated execution of tests within our CI/CD solution, covering both ML artifacts and datasets.
Regulations
GDPR regulations govern our use of personal information for model training. While we can consume this data, we cannot expose it.
That implies that we are subject to recent privacy changes by Apple (SKAN) and Google (SANDBOX), which our developments must utilize.
Resources and Skills
Our development team consists of data engineers, ML engineers, and data scientists who will guide us in developing this platform effectively.
As for budgeting resources, while this tool is intended for internal use, affecting our budget, the current AI boom is on our side. The growing internal interest in AI should allow for gradual budget increases.
Build or Buy?
This is a classic question, but one that requires a well-considered answer.
Considering the constraints mentioned above, exploring cloud managed platforms like those offered by AWS, Google, or Microsoft seems like a long-shot. We need to address several prerequisites before considering these options.
- Uploading and securing data in the cloud
- Adapting to new cloud storage solutions
- Shifting processing engines to the cloud
- Reorganizing our CI/CD processes
- Enabling interaction with other Playtika systems (still on-premise)
This is just a partial list, and it doesn’t even account for potential cloud costs associated with storage, computation, and managed ML platform services.
What about on-premise managed platforms?
On-premise managed platforms, while investigated, turned out to be less than ideal and presented their own limitations.
They often require significant adjustments to the managed platform’s stack, at the expense of existing Playtika infrastructure, and lengthy integration efforts.
So, does that mean that building the platform ourselves is the obvious solution? Not necessarily.
While we have skilled engineers, replicating products with thousands of development hours invested isn’t realistic. This forces us to explore the market very carefully, testing open-source solutions for compatibility with our existing technological stack, and preparing a detailed integration plan.
Ultimately, we need to select technologies that offer the highest benefit for our users.
This should remain our guiding principle throughout the process. Humans sometimes forget this, so it may be good if machines take over the world. 😉
Chapter 3: A stroll down memory lane
Our journey began 4.5 years ago, predating the current AI explosion across various industries. We weren’t fortune tellers, but we did sense the winds of change in AI a bit earlier 🙂
Recognizing the potential impact of AI on player experience and rapid business growth, we paused to re-evaluate our technology stack and its future needs.
A dedicated “Watchtower” task force was created to map out essential requirements, categorized by subject:
Data
We prioritized data organization and cleaning, emphasizing the removal of personal information. Additionally, robust data documentation, including data ownership, was deemed crucial. Beyond these core principles, data governance, observability, and exploration accessibility were identified as key pillars. A feature catalog was also deemed necessary.
We recognized the need to streamline feature exploration and computation processes, ensuring timely execution (on-demand if necessary).
Experimentation
We acknowledged the lengthy and resource-intensive nature of our existing, unmanaged experiments.
This inefficiency stemmed from our initial focus on speed, prioritizing rapid delivery of ML solutions.
ML Pipelines Authoring
We understood the limitations of “production notebooks” for prediction serving and needed to identify a more appropriate approach.
I would like to remind you that this was nearly 5 years ago, when ML Operations was in its infancy, and we were essentially navigating uncharted territory with many unanswered questions.
Model Management
We recognized that models were more than just “Python scripts or code,” as some stakeholders viewed them. They were the brains of our applications, necessitating a shift in mindset towards comprehensive model management across the entire lifecycle.
This involved version tracking, storage in a shared registry for reusability, regular backups, and a standardized process for deploying new models to production.
Furthermore, metadata tracking, a model retirement process, and robust online testing in our production environment, on real users, were identified as additional requirements.
Production Monitoring
Our existing infrastructure monitoring provided basic information about model and service uptime, latency, and potential outages. However, it lacked the capability to monitor model behavior, accuracy, or even detect training-serving skew. We don’t even know what that means 😉
So, what’s next?
Market research was the obvious first step.
However, with ongoing development we couldn’t really afford to completely halt our progress. Therefore, we decided to tackle the more mature aspects of our needs while dedicating time to explore the best solutions for other areas of our ML platform.
Thus, we began building our feature store and data catalog.
We evaluated both vendor solutions and open-source Feature Store solutions. While several options were very attractive, user needs and their working methods ultimately guided our decision.
Here are the key questions that emerged:
- What tools are familiar to our users based on their previous experience?
- What should user interaction look like? Is an API sufficient, or do we need a dedicated user interface (UI)?
- How comprehensive should the UI be? Should all APIs have a corresponding UI, or just a selected few?
The technical nature of our user base—they’re primarily coders—sparked debate about the need for a UI. Why invest development resources if our users write code daily?
While a suitable open-source solution existed at the time, it had limitations. We conducted a Proof of Concept (POC) test with our users, but the results were disappointing.
Here’s the user feedback that stopped us: It just didn’t work.
Our ML engineers complained that the need to perform multiple dependent API calls to retrieve training data was cumbersome and time-consuming. A simple task became a burden.
Our data scientists had difficulties in understanding existing features, their creators, and purposes. The overall UI experience hindered collaborations, one of our key goals.
We carefully analyzed this feedback. and went back to the drawing board.
Despite having highly talented developers, a proper user interface was necessary to accelerate our delivery process.
Furthermore, we realized that simply providing infrastructure for viewing and sharing feature sets wasn’t enough.
Timely or on-demand feature computation was crucial for implementing feature governance and monitoring techniques.
This led us to a critical realization: we weren’t just looking for a feature store, we needed a feature platform.
A platform encompassing feature creating, storage, visualization, sharing (through the feature sets in our feature catalog), computation (through the feature computation engine), and the entire feature engineering workflow.
Ultimately, we decided to design and develop our own feature platform because we couldn’t find a tool that met our needs without an extensive list of unmet feature requests from the vendor.
You’ve made it this far, so you already know that we’ve identified our gaps, acknowledged learning curve ahead, and are seeking guidance.
Stick around for the next chapters – that’s where things get exciting (at least for me!)