Challenges in Applying Generative AI to Mobile Gaming
By Gali Hod, Eyal Regev, Michael Kolomenkin
In this blog post, we delve into the integration of Generative AI into the production of mobile games. The potential of Generative AI to revolutionize the crafting of game assets is undeniable, and the buzz around this revolution is hard to overlook. However, the journey towards implementing this technology in practice comes with its own set of obstacles. Our objective is to shed light on these practical challenges in this article. As we progress through this series, we’ll also explore potential solutions to address these challenges.
Challenge 1 – Creating a usable product
When considering the existing generative AI solutions, they may not align with our clients’ specific needs, potentially requiring a level of understanding or expertise that our clients may not possess.
The initial challenge lies in crafting a product that can be used by real people. These users are not concerned with the details of our technology; their focus is on the product’s simplicity, intuitiveness, quality of the outcomes, and the time required to achieve those outcomes.
Therefore, it’s crucial for us to understand our clients. We dedicate time to comprehend our clients’ workflows and identify areas where they encounter difficulties. This understanding varies between a product tailored for an artist and one tailored for a game operations manager.
Challenge 2 – Product Complexity
Many individuals tend to perceive Generative AI as a “give it a prompt, get a result” approach. They anticipate a single model to easily turn any text into an accurate image. However, in some cases, producing a usable image with just one model call poses challenges for real-world usage.
Let’s see an example of how this works on an in-game promotional image. Crafting a mobile game pop-up is complex. It must convey a concise message with appealing elements including visuals and texts. Backgrounds, characters, and effects collaborate to evoke emotions and context. Despite notable advancements in AI-powered creative generation and the continual emergence of new capabilities, tasks as intricate as these cannot be accomplished through a single text-to-image model call.
Our approach to tackling this challenge revolves around a “divide and conquer” strategy. We deconstruct the complex process into a cascade of interconnected, smaller sub-tasks. Each layer is constructed to synergize with the next. This structure enables the user, the operation manager, to choose between a holistic integration or selectively implementing distinct components, all customized to fulfill their individual requirements.
Challenge 3 – Balancing Speed with Cutting-Edge Research
Data Scientists consistently grapple with the trade-off between delving deeper into research and delivering swift value. Data scientists working on Generative AI in mobile gaming face this trade-off multiplied tenfold.
On one hand, when engaging with customers in the gaming industry, developing an actual product means providing updates every few weeks. Games evolve rapidly, and data scientists cannot afford to be absent for extended periods of time. Goals and needs might shift during this time.
On the other hand, Generative AI is a continuously evolving field of study. Novel technologies emerge daily, and it takes time to assess and validate their efficacy. Moreover, we also need to explore areas where pre-existing solutions are not available.
Challenge 4 – Fine-tuning models
Text-to-image models, such as Stable Diffusion, are trained on extensive internet data and perform well with familiar concepts. But what about unfamiliar subjects? What if we want to generate our game characters in the game style?
In such cases, we need to fine-tune existing models. Several techniques exist for fine tuning, such as DreamBooth, Textual Inversion (TI), and Low-Rank Adaptation (LoRA).
These techniques vary in how they balance quality vs. memory and training time requirements. DreamBooth, for example, fine-tunes the entire model, but it takes more time and demands more powerful computers and greater storage capacity. LoRA, on the other hand, modifies only a subset of the parameters, resulting in smaller models compared to DreamBooth.
Choosing the approach depends on numerous parameters – the number of characters, client requirements, frequency of modifications, screen size, and others.
Challenge 5 – Data organization
To effectively utilize these models, the path is clear – fine-tune them with our proprietary data. Yet, this undertaking presents challenges due to the absence of a standardized approach to organizing creative assets. What exact difficulties do we encounter?
Unified Data: Ensuring data unity is paramount. For characters, it’s essential to have the most recent and accurate version, while outdated iterations must be excluded. Backgrounds and assets need to align with the current relevant style and exhibit consistent image resolution. Does the collection of amassed assets genuinely resonate with the essence of the game? This is a question often beyond the grasp of the data scientist alone.
Layered Complexity: Much of the data consists of layered Photoshop files. Unfortunately, these files frequently lack proper organization, impeding the automated identification of specific layers.
Organization Void: The absence of systematic organization poses difficulties. Data accessibility is hindered, making it arduous to locate information and effectively filter duplicates.
Preprocessing Challenges: conquering the task of extracting characters from diverse, large images requires careful consideration. It’s imperative to avoid cropping essential parts of the image during preprocessing. Given that these models anticipate input images of fixed size, it becomes paramount to ensure that character elements are not inadvertently cropped.
Navigating these obstacles is essential for effectively leveraging text-to-image generative models.
Challenge 6 – Evaluating Quality
How do we determine if the model that we’ve developed is good enough?
CLIP Score and FID are metrics used to assess generative models. CLIP is an OpenAI deep learning model that comprehends images and text. Its score measures the accuracy of associating images with corresponding textual descriptions. FID measures the likeness between real and generated image distributions within a feature space.
Despite the significance of these metrics, evaluating the generative models of our project introduces a distinct challenge. Traditional metrics fall short in capturing our specific objectives. While we emphasize preserving the artistic style consistent with the game’s unique visual language and ensuring character accuracy, these metrics do not completely align with our criteria.
The evaluation of generative models is subjective, particularly in our case. To address this, the potential inclusion of human evaluation from artistic experts emerges as a viable solution. Exploring human-in-the-loop evaluation offers a promising direction to tackle this intricate challenge, tapping into expert insights and bridging the gap between conventional metrics and the subtleties of our evaluation requirements.
Challenge 7 – Balancing simplicity vs. controllability
Striking a balance between the simplicity of the solution and the quality of the results introduces an additional obstacle. In intricate scenarios like designing elaborate pop-ups, user-oriented controllability becomes crucial for achieving top-notch results. However, it’s essential to consider the end user’s experience and avoid introducing overly complex tools that might require excessive expertise.
So, our focus is on enhancing the solution without making it complicated for users. For instance, when characters appear small in the image (zoomed-out), their facial quality might diminish. We can address this using inpainting, but we aim to spare users from additional effort. Our system automatically detects small faces and replaces them if necessary. This way, users don’t have to worry about it.
We also address the control of character positioning in generated images, especially since achieving precise poses is often essential for conveying the intended user message within our domain. It’s not particularly common to achieve such precise results through text prompts alone. Both ControlNet and T2I adapter, neural networks, empower us to influence pre-trained large diffusion models using conditions provided through a reference image. To attain a character posed as requested by the user, we identify the relevant instruction from the prompt and leverage a collection of reference images to guide the generation process using a ControlNet network.
In the ever-evolving realm of mobile gaming, the convergence of generative AI and virtual worlds continues to shape a thrilling new frontier. We’ve embarked on a journey that revealed the symbiotic dance between creativity and technology, unveiling AI’s potential to enhance the artistic visions of operation managers. As we navigate the intricacies of data collection, fine-tuning, evaluation challenges, and controllability, we witness firsthand how AI becomes an indispensable collaborator in delivering unforgettable player experiences. On our next blog post, we will dive deeper into the technical details of our research.