The Journey of Building Playtika’s ML Platform — Part 2
By Oded Nadel
Chapter 4: Market Research and Technologies
Let’s start by echoing the different ML Platform components we discussed, but this time, with proper naming.
The ML Platform Modules: Setting the Terminology Straight
To explain the various modules of our platform and their roles, we delved into the ML Operations domain.
We understood our rank and maturity compared to Google’s MLOps maturity levels, which helped us establish the correct terminology for both internal (AI department) and external (Playtika stakeholders) communication.
We looked at various ML platforms, both cloud and on-premise, to see what “the rest of the world” is doing.
We explored and evaluated a variety of solutions in light of Playtika’s internal AI demands, considering both established use cases (mostly classical and unsupervised models), and future aspirations (generative text and image models) that were in development at that time.
We consulted Gartner specialists to learn about the tools and practices of major AI players. These findings fueled further research by our ML Platform Engineering Team, who assessed functionality, compatibility with Playtika’s existing infrastructure, and adherence to our Production Grade Standard requirements. These were our results:
Feature Platform
Prerequisites: Non-Cloud, Non-Proprietary
Our market research began with this reference:
https://www.featurestore.org/ .
Key Findings:
There are many (!) solutions that address our needs. However, user adoption likely hinges on a robust user interface, which often carries a premium cost.
In a wider view:
Our market research, beginning in 2022, identified several systems offering both a feature catalog and feature computation capabilities.
This realization led us to define our ideal solution as a comprehensive feature platform encompassing both functionalities.
Comparison Table:
Bottom Line:
Since our goal is to open source our platform, we opted against purchasing an external system to address the identified gaps. Instead, we embarked on internal development of our own user interface.
Experiment Platform
Prerequisites: Non-Cloud, Non-Proprietary
Key Findings:
The market offers a wealth of commercial and open-source tools!
Many solutions overlap with our existing infrastructure or are incompatible with our on-premise setup.
In a wider view:
Our research began before the 2023 boom in GenAI tools like ChatGPT and MidJourney. Excluding image generation and large language models (LLMs), common experiment requirements include:
- User collaboration
- Parallelism
- Metric visualization
- Experiment tracking
- Comparison
- Reproducibility
While additional requirements exist, a detailed exploration would be time-consuming. Our research identified suitable solutions that fit Playtika’s ecosystem, detailed in the table below.
However, a critical point emerged: user adoption hinges on a strong user interface (it just won’t cut it for our users).
Comparison Table:
With the recent surge in computer vision, image generation, and text generation capabilities, we’ve initiated a new Proof of Concept (POC) in the cloud. This will help us determine the necessary infrastructure additions to support these endeavors at Playtika’s.
Don’t get me wrong, we are already generating images and textual content using open-source and in-house models. However, our current infrastructure lacks the capacity to accelerate these processes and integrate them seamlessly across Playtika.
ML Pipeline Engine
Prerequisites: Non-Cloud, Non-Proprietary
The market offers a variety of solutions like Ploomber, Dagster.io, Flyte, Argo, Kedro, Kale, Luigi, and others. However, at the outset of our research, Playtika was already using Airflow. Kubeflow Pipelines also emerged as a strong contender, recommended by several external consultants.
It’s worth noting that MLRun hadn’t been released yet, so native integration with Mlflow, our tool of choice at the time, wasn’t possible.
Evaluating Airflow and Kubeflow Pipelines proved challenging due to a lack of comprehensive online comparisons. Therefore, we initiated a POC with Kubeflow Pipelines, focusing on how well each product supports our specific requirements.
Comparison Table:
While running this POC, we also exploredZenML’s solution for pipeline orchestration, which I’ll discuss in more details in the next chapter.
Model Manager
Prerequisites: Non-Cloud, Non-Proprietary
In a comprehensive ML platform encompassing model development, experimentation, and deployment, you quickly grasp that the “glue” connecting both ends is actually your model registry.
This central repository stores user models, their engineering dependencies, production environment metadata, and potentially other seemingly irrelevant information.
Recognizing this central role, we evaluated model registry functionality as part of our experiment tracking tool research.
The high level of similarity across solutions led us to conclude that a separate model registry investigation wasn’t necessary.
Inference Platform
Prerequisites: Non-Cloud, Non-Proprietary
This research combined considerations for model serving frameworks, inference computation solutions, and deployment methods.
What does this mean? Here’s a breakdown of the key questions we addressed:
- Prediction Delivery Mechanisms: How will users access predictions? REST calls, Kafka messages, batch database reads? Our answer: all of the above.
- Data Scale for Predictions: What volume of data requires predictions? Hundreds, thousands, millions, billions? Our answer: all of the above. Solutions vary in their capacity, ranging from 100,000 to 100 million or more data points.
- Industry Trends: What serving frameworks are popular?
Key Findings:
The various serving frameworks share significant similarities, with key differentiators being out-of-the-box deployment methods and support for different model frameworks.
To make it a bit wider and clearer:
Some frameworks excel in batch predictions, while others cater to streaming applications. However, these distinctions were not a major focus.
Our research delved deeper into inference computation and data processing requirements. For large-scale batch predictions involving PBs of data, we were a bit biased and naturally considered leveraging our existing Spark infrastructure.
Moreover, we examined platform choices and user experiences with other companies, like Spotify, Uber and additional companies that initially used Spark, but are now shifting to Ray for their computation.
OpenAI’s reliance on Ray for both training and serving further solidified our interest in this technology.
Comparison Table:
Deployment Manager
Prerequisites: Non-Cloud, Non-Proprietary, Compatible with existing CI/CD Infrastructure
This section reveals the most stringent limitations encountered. Playtika’s well-defined CI/CD process leverages a mature CI/CD infrastructure tailored for traditional software development, not specifically suited for the needs of ML Operations (MLOps).
A key challenge was determining the appropriate service model: Infrastructure as a Service (IaaS), Platform as a Service (PaaS), or Software as a Service (SaaS).
Key Findings:
We understood that maintaining a single source of truth for Kubernetes (k8s) state management, that aligns with MLOps standards, means we cannot use IaaS solutions.
Furthermore, our GitOps approach, currently supported by ArgoCD, only deals with native k8s entities (pods and services), which are only a portion of the ML solution.
Consequently, we evaluated a combination of solutions to address these requirements.
Comparison Table:
ML Monitoring
Prerequisites:
Let’s be upfront: Playtika already had an internal “AI monitoring” in place. However, integrating it to accommodate user requirements for both standard and custom model metrics proved impractical. Implementing the extensive calculations and visualizations (including future segmentation capabilities) for Grafana dashboards would have required significant back-end and front-end development resources.
We recognized that while our existing Prometheus and Grafana setup excels at real-time data monitoring and anomaly detection, it wasn’t well-suited for our broader monitoring needs.
This prompted us to explore the market as detailed in the table below.
Comparison Table:
I will continue by saying that with the recent surge in GenAI in 2023, we are actively exploring effective monitoring strategies for AI-generated images and text, which are starting to gain user interest.
Seeing and supporting the upcoming expansion inside Playtika for both initiatives requires a fully-automated solution for monitoring model predictions (image and text outputs) in our games.
While this chapter has focused on core platform functionalities, there are additional modules tailored to Playtika’s specific ecosystem.
Intrigued by our journey? Stay tuned for the next part of our blog.