Automatic Selfie Segmentation and Style Transfer

Inspired by Automatic Portrait Segmentation for Image Stylization and Fast Style Transfer.

Selfies are dominating photography, so why not experiment in that space? As I ramp up on machine learning and neural networks, I apply a technique called object segmentation to my face with mixed results but a promising future. This is all heavily inspired by the paper Automatic Portrait Segmentation for Image Stylization by Xiaoyong Shen, et al.


  1. Extract frames from video
  2. Generate matte using portrait segmentation on each frame
    • Uses pixel level classification between two categories: foreground and background
  3. Style transfer on original frame for cartoon effect
  4. Cut out foreground by using the matte as a mask on the styled frame
  5. Composite new video by placing foreground over original video

This road was longer than I thought it would be. The standard neural networks that people are introduced to, referred to as fully connected layers (FC), don’t suffice with images because it cannot scale up to many pixels. There are simply too many nodes in the network. Instead, we use the increasingly popular Convolutional Neural Networks that are particularly good at classifying images. But rather than merely classifying the image as a face, we want to classify each pixel as either foreground or background.

For this, we use a Fully Convolutional Network to perform pixelwise predictions: pixels in, pixels out.

Convolutional Neural Networks (CNNs)

Neural Networks for images. These are connected layers of kernels (or filters) to detect features that a collection of pixels have, such as edges. Imagine a kernel being a 5x5 matrix of values used to detect image properties at a specific section, or receptive field. For more information, read these excellent articles:

Fully Convolutional Networks

CNNs were predominantly used to classify images. Is it a dog or cat? Then the problem of object segmentation came, extracting the pixels that make up the dog or cat. Fully Convolutional Network solves that problem.

Xiaoyong Shen, et al. fine-tuned the reference FCN implementation specifically for portraits, and a reimplementation of that is what you see in this post. It is called the Portrait FCN.


The matting isn’t perfect.

Matte Imperfections

As you can see here, the blotch in the top right is obviously not part of the selfie or the foreground, while the black blotch in the bottom is part of the foreground. This is because our Portrait FCN isn’t doing the best job it could, but there are better solutions out there already, such as Portrait FCN+ that uses a fixed portrait trimap to assist the model when generating the matte.

I plan to take another approach however. More on that in the next experiment.

Style Transfer

Once we have the foreground, we use Logan Engstrom’s style transfer to take the aesthetic from a painting, like the one shown below, and intelligently apply it to a photo, resulting in a cartoon like effect. A style reminiscent of Roger Rabbit.


Fox Udnie

Here’s an example of style transfer on an entire video before matting:

Wrap Up

This gave me fantastic exposure to machine learning on media and the world of Convolutional Neural Networks, I plan to continue experimenting in this space, and even forked over for an external GPU (eGPU):

eGPU 2017 Comparison

Up until now, I’ve been using the amazing Floyd Hub, the Heroku for Deep Learning. Definitely check it out. Even if you have your own hardware, it’s great to have some NVidia K80s at your disposal to speed up experiements.

comments powered by Disqus