Game to Video: Using Generative AI to Uprender Gameplay

Read on for higher res videos and a how-to

Using gameplay footage to control Stable Diffusion’s output

Generative AI is all the rage right now. One area I’m fascinated by is ControlNet’s ability to control the output of Stable Diffusion, an image generation model, using depth maps and segmentation maps. Examples from ControlNet’s github:

Depth Maps and Segmentation Maps can be generated by game engines, in real-time. I decided to whip up a mini game in Godot, render some frames and movies, and see what uprendering gameplay could look like.

Starting with Image Frames

Below, you’ll see simple first person images of a game with untextured models, rerendered as depth maps and segmentation maps. Scroll to the bottom for higher resolution videos.


ControlNet takes in guiding input, like edges, depth maps, and segmentation maps to guide the generated media. One can even use multiple controls to give better direction, like a depth map and a segmentation map. Focusing on single frames for Multi-ControlNet as a starter, the following images were generated:

Flickering videos when sequencing images

Taking the most promising one and creating a sequence of images for a movie had a lot of flickering:

Incorporating TemporalKit would be the next step. But we’re pressed for time over here. Instead, we turned to Runway ML’s video to video.

Upgrading to Videos with Runway ML Video to Video

Below you’ll see the same first person video rendered in three different modes: untextured, depth, and segmentation.

Taking this input, we explored a few different themes and prompts:

Orthogonal Camera of isometric not so successful in video

It seems many image generation models assume a sky backdrop. After searching for a bit, I found a few promising isometric models to generate the following output. Lackluster compared to first person in my opinion.

Thanks for reading this far!

Stay tuned for part 2, and drop a comment below!

Video Dump

comments powered by Disqus