The Journey of Building Playtika’s ML Platform — Part 3
By Oded Nadel
Chapter 5: Playtika’s ML Platform Offering
It’s finally time to delve into what our platform offers.
We’ll revisit the ML Platform modules discussed in the previous chapters and showcase our current capabilities.
The ML Platform Modules
Feature Platform
As I mentioned earlier, numerous options exist for addressing our needs. However, we recognize the importance of a good user interface for increasing user adoption.
Playtika’s internal Feature Platform, designed by our engineers, allows users to view and search for existing feature sets so we can share knowledge of which features are most useful based on model objectives.
Practically, a user predicting a problem like churn, can explore the feature catalog to identify features used in similar problems, and begin their experimentation phase with these “known” features.
This is our offering for a Feature Catalog that fosters collaboration.
For Feature Computation, we leverage Spark to run our data computation tasks. Spark is a well-established and widely-used solution within Playtika’s data teams.
Experiment Platform
As I’ve previously mentioned, during the experimentation phase we noted that our users required several backend capabilities, such as parallel computation, tracking, reproducibility, and others. However, while these are important, their top priority is an intuitive user interface allowing them to compare experiments and select the best model for their needs.
Taking this into account, while commercially available tools offer excellent UIs, they often overlap with our existing infrastructure.
Therefore, we prioritized open-source solutions, such as ML Flow + AimStack, which provide a strong combination of user interface, tracking, and visualization capabilities.
We also offer our users a managed Jupyter Notebook solution that allows them to view work on the same notebook, contribute to each other’s work and brainstorm.
Having said that, these tools do lack some of the backend capabilities such as orchestration of computation jobs, that we will need to develop within Playtika’s eco-system using our existing stack.
I will say that we are actively seeking the right integration to minimize internal development efforts and leverage this combination effectively.
In addition, to address the growing needs of computer vision, image generation, and text generation capabilities, we are conducting another POC in the cloud. This will help us determine the necessary infrastructure additions to our stack to support these endeavors at Playtika’s required scale.
ML Pipeline Engine
Here’s a recap of our POC findings:
Airflow boasts maturity, extensive capabilities, and a strong community (evidenced by its Git stars). However, it caters more towards data engineers and data pipelines, not data scientists, and isn’t specifically designed for machine learning (ML).
While its maturity offers advantages when compared to Kubeflow Pipelines, it comes at the cost of being less tailored for ML practitioners.
Conversely, Kubeflow Pipelines is much more ML-oriented. Its Python base and focus on user freedom through trial and error make it ideal for the day-to-day work of data scientists.
This essentially creates a “pick your poison” scenario: both Airflow and Kubeflow Pipelines have strengths and weaknesses, which made our decision-making process very difficult. We wanted a solution that wouldn’t require us to compromise.
So, we started investigating ZenML’s solution for pipeline orchestration.
ZenML offers an abstraction layer for batch orchestrators, allowing us to leverage either of the pipeline engines mentioned above, while reserving the flexibility to switch in the future.
How does it work?
ZenML achieves this by providing a pipeline DSL (domain-specific language) that can generate either a Kubeflow ML pipeline, or an Airflow DAG, simply by modifying the underlying stack we are using. This offers several advantages:
For rapid development during the authoring phase, ZenML allows us to run pipelines locally. This enables short iteration cycles where individual steps can be quickly created and tested in isolation before integrating them into the complete pipeline.
ZenML’s abstraction layer also facilitates future support for streaming pipelines. Our current streaming applications utilize Kafka and Akka technologies, but lack a unified DSL for pipeline authoring.
We aim to use the same DSL, which can then be translated into a stream pipeline engine developed internally in Playtika.
This creates a single, unified approach for building any ML pipeline, regardless of whether it’s a batch or streaming process.
Model Manager
As mentioned earlier, a comprehensive ML Platform, encompassing the entire workflow from modeling to deployment, requires a critical piece of “glue” to connect these stages. This is THE “Model Registry.”
We recognized that this glue is typically offered as part of experimentation platforms designed for cloud-native integrations with S3 storage. However, Playtika’s unique ecosystem is a bit different and necessitated customization regardless of the chosen platform.
Therefore, we opted to develop an in-house model registry tailored to our infrastructure. This includes dedicated integration with Pure storage and Alluxio as an abstraction layer, ensuring a better fit for our needs.
Currently, our model registry is designed to store the model itself, its metadata, and its deployment requirements in a decoupled manner.
This not only enables back up and deployment of our internal ML models but also grants us the flexibility to leverage the open-source community. We can store gigabytes of ML models (from HuggingFace and others) and fine-tune our LoRA files per use case, making them readily available for deployment.
This allows us to seamlessly integrate this component with both for our “classical” ML models and the frequent releases of “ cutting-edge “ models from the GenAI open-source community.
Inference Platform
I’ll echo a breakdown of the components within the Inference Platform. It consists of two key aspects: the serving framework and the inference computation stack.
The serving framework refers to the selection of the framework used to “wrap” our models for deployment. To jump to the bottom line, we chose BentoML as our serving framework, allowing our models to be deployed in various ways, including batch, stream, and as a REST endpoint.
As for computation stack, we observed the common path taken by many companies in the early stages of their platforms, utilizing Spark to provide “distributed inference” capabilities for handling large datasets.
We also noticed the shift in their newer versions, transitioning away from Spark and favoring Ray.
We conducted a POC to assess Ray’s suitability within our ecosystem. However, we encountered a significant roadblock: integration between Ray and Alluxio, our storage management abstraction layer.
Due to time constraints and user demands, we couldn’t resolve the integration conflict mentioned above, so we opted to leverage Spark temporarily to provide distributed inference capabilities.
This allowed our users to efficiently compute large datasets in an efficient way, while we addressed the integration issues with Ray.
Industry insights also suggest that the performance differences between Spark and Ray for distributed inference computation are less significant compared to performing distributed training, where Ray offers greater advantages.
I will also mention that we also established an integration with run:ai as part of our Inference Platform to utilize GPUs for inference.
However, this integration isn’t bounded just to inference tasks, as GPU management is a requirement across various aspects, including experimentation, training jobs, and other modules like the Pipeline Engine, Jupyter Notebooks, and our Experiment Platform module.
Deployment Manager
To recap our requirements, we sought an automated CI/CD process based on GitOps principles. This process needed to encompass the entire ML solution, including the feature set, the model, the ML pipeline, and the model monitoring configuration.
We opted to leverage ArgoCD in combination with Kubernetes operators to achieve this goal.
Git serves as our single source of truth, so we implemented a “self-recovery” mechanism that compares the state of our production environment with the Git repository.
In case of any discrepancies, the Git state takes precedence, and we rectify the inconsistency in the production environment.
The CI/CD process incorporates a Testing Pyramid and initiates various tests as part of it. These tests act as a strict condition for both deployment and merging code changes into the Git repository.
Additionally, we designed automated deployment rollback behavior in case tests fail, minimizing the likelihood of production incidents.
ML Monitoring
I’ll split this section into two pieces: monitoring classical models and monitoring generative AI models.
For classical models, we partnered with Aporia for production monitoring. Aporia offers several features that cater to our needs, including:
- Calculation of various model metrics
- A user interface that allows for data exploration (“slice and dice”)
- Detection of data behavior issues
- Drift detection
However, for generative AI models that generate images and text, we currently lack a proper infrastructure component for generic monitoring. Our users require functionalities like:
- Labeling and annotation capabilities
- Viewing the generated output of each run
- Marking the best experiment for each use case
- Determining the right parameters for specific use cases
Data scientists and machine learning engineers are currently performing these tedious tasks manually using various tools.
This manual approach needs to be formalized in the near future.
Chapter 6: Playtika’s ML Platform Roadmap
Now that you understand our current offerings, let’s dive into what we plan to offer our users in the future.
Our Roadmap
Rather than copy-pasting my wishlist of features for development, I’ll highlight the key functionalities we aim to deliver to Playtika and its users:
- Experimentation Platform for Generative AI: This platform will offer robust comparison and annotation capabilities.
- Generative AI Model Management: We’ll establish a dedicated system for management of generative AI models used for image and text generation.
- Enhanced Kubernetes Resource Management: We’ll improve our Kubernetes resource management by integrating with industry- standard frameworks. This will enable deployment of multi-model servers and reduce loading times for large models (often weighing several gigabytes). Additionally, we will introduce a new database to store vector outputs and facilitate efficient searching capabilities.
- Production Monitoring for Diffusion and LLM Models: This system will also support labeling production data for retraining future models.
- Distributed Training: — To efficiently train our models without resource limitations, we plan to adopt Ray, the current industry standard for distributed training.
• Integration with Ray: If Ray is chosen for this task, we might revisit our technology selection for distributed inference. Similar to other players in the market, we might migrate from Spark to Ray for both training and inference. - Deployment Automation: We’ll develop functionalities to trigger our CI/CD process for new ML pipelines programmatically, either through REST calls or Kafka messages.
- Streaming (Near Real-Time) Prediction Serving: We are currently offering this, we’re reevaluating the technology stack used by our streaming applications, specifically Akka Streams.
This also encapsulates real-time feature computation capabilities that may utilize Aerospike, Flink, Nuclio, or Storey (by MLrun). - Pre-defined Templates: We’ll provide a library of pre-defined templates for training, batch inference, stream inference, and model explainability.
- Lean “Quick Start” Process: The process will incorporate AutoML capabilities, making our platform accessible to all Playtika employees, regardless of their machine learning background.
- Automatic Model Retraining: We’ll implement functionality to automatically trigger retraining pipelines based on model degradation detection. This will enable continuous training of and elevate our MLOps maturity to level 2.
• This also requires online experimentation capabilities to evaluate the newly-trained models on real users.
Our goal is not only to implement these features, but also contribute back our platform, or at least some modules, to the open-source community for the benefit of the broader ML industry.
Acknowledgements
We would like to thank the following team members of the machine learning platform team at Playtika for their contributions to our ML Platform:
- Pavel Nesterenko,
- Raman Siamionau,
- Yauheni Malashchytski,
- Uladzislau Kiyeuski,
- Adam Balcerek,
- Nikolay Bylnov,
- Pilip Padabedau,
- Adrian Buczyniak,
- Ivan Mazaliuk,
- and Aliaksandr Hamayunau
Additionally, many other infrastructure and platform teams have contributed to Playtika’s ML platform, and we appreciate their support:
- the data infrastructure team,
- Playtika’s data department,
- IT teams,
- Amit Gelber,
- Mikhail Kotov,
- Guy Einav,
- Tsila Bahar,
- Anton Lukashov,
- and Eugene Govzich
Additional thanks to our sponsors in this endeavor:
Last but not least, we would like to thank the early adopters that gave us crucial feedback and helped us improve our systems.
Big kudos to our data science teams and our ML engineering groups.
If you’re interested in working with us in building state-of-the-art ML systems or solving other complex challenges and create the world’s best gaming company, we’d love to hear from you!
Visit https://www.playtika.com/careers/ to see our job openings.