Style transfer is a computer vision technique that allows us to recompose the content of one image in the style of another. It’s been a popular topic of research in art, and we’ve seen everything from one-click photo editing apps to AI artwork that’s sold for hundreds of thousands of dollars.
The underlying problem is to disentangle style and content, or better yet, to let the user choose their preferred content and style at the same time. This is a challenging task that has inspired a wide variety of algorithms and architectures. The most prevalent methods in the literature use some variation of neural machine translation (NMT), which is a sequence-to-sequence model with an encoder-decoder architecture. This allows models to learn how to translate from one style to another by feeding the same data into the encoder and decoder, and by comparing their outputs.
NMT models can also be used to learn how to stylize images. This is a little less straightforward, as the models must be trained to both recognize the content and generate an appropriate style. Obtaining parallel training datasets for each desired style attribute is challenging, and model architectures need to be robust to variations in feature set size, image dimensions, and training data quality.
Deep learning architectures that excel at language modeling, such as dense convolutional neural networks (CNNs), make excellent candidates for tackling this task. In fact, most TST approaches use variants of these models with different loss functions that measure differences between the features extracted by each layer in a network. The goal is to find a loss function that loses enough weight on the style attributes to produce a output that matches the target image.
In order to determine the best weighting for each style attribute, we must also consider the pixel similarities between the input and output images. This is done using a Gram matrix. A weighting for each style attribute is calculated as the sum of the squares of the coefficients in the corresponding row of the Gram matrix. In other words, if a feature map from the input image has a high coef., then the resulting image with that style will have a high coef.
Some of these style features may be quite subtle – for example, fractal patterns in nature, small brush strokes from a painter, or the emotion of a painting. Other features may be more prominent – for example, the shape of an object or its position in the frame. These are often captured by image classification layers.
As more games are developed on modern, compute-powerful GPUs, we expect to see new applications of this technology that will democratize artistic creation. For example, it’s likely that developers will build video game capabilities for players to apply their own personal styles to virtual worlds, which could broaden the appeal of gaming to a wider audience. At a recent press conference, Google announced Stadia, their cloud-powered gaming service, and demonstrated a gameplay feature that enables users to play in the style of any other artist.