The Making of self-contained

A single frame from self-contained II

The imagery in the self-contained series is generated using a type of machine learning, an artificial neural network. In the past decade, machine learning and neural networks have become synonymous with “artificial intelligence.” Machine learning relies on enormous amounts of data and computing power to learn patterns in data, rather than explicit instructions from a programmer. From what it’s learned from the data it was fed, a machine learning algorithm can then make “intelligent” judgements or predictions about data it’s never seen before.

There are a variety of machine learning and neural network models used for different tasks, but there is a specific class of neural networks often used to generate things like sound, video or images. These neural networks are called Generative Adversarial Networks, or GANs. The specific GAN used for this project is an open source algorithm called pix2pixHD–a higher resolution variant of the original pix2pix algorithm.

pix2pix is used to synthesize images through a process of image translation: present an input image to a trained pix2pix model, and pix2pix will interpret that input and translate it into an output image. You “teach” pix2pix how to translate one image into another with data in the form of image pairs. The more data you use, often the more “accurate” the image translation will be. Of the examples below, for instance, the Labels to Street Scene (top left) was trained with a data set of around 3,000 image pairs. The Edges to Handbags instance, however, used a data set of approximately  137,000 pairs. After what can be more than a hundred hours of analyzing these image pairs over and over again, pix2pix will eventually learn how to translate the input into a somewhat accurate output.


Some example uses of pix2pix (source:


For self-contained, I was interested in seeing how an artificially intelligent algorithm would learn to interpret images of abstract body structures as fully-realized photographic versions of my body.


A sample input and a synthesized output image from my data set


Could an AI make learn to make sense of these simple dot patterns as me? Could it learn to create photo-realistic images of me from such a sparse amount of input data? Maybe the question isn’t whether or not it “could,” but how it would do such a thing. How will this AI, using pix2pix, learn to make sense of a highly abstracted version of my body?

In order to train pix2pix to do this, I needed to create a data set. To get the dot-skeletons I needed for my input data, I used a Kinect V2 with skeleton tracking, and with some tweaks to the code, I was able to extract X+Y coordinate data for each of the fourteen dots. I wrote an openFrameworks program to turn that coordinate data into colored dots and export them as image files.


preparing to record the first data set

A Sony Alpha A7S II rises just above a Kinect V2 so the two devices can record from nearly the same perspective. The camera is connected to a monitor that allowed me to watch myself to make sure my body stayed within the bounds of the black backdrop. This self-observation during the recording process heavily (and, admittedly, unintentionally) influenced my movements.


While recording my skeleton data, I recorded regular video of myself in order to capture the images of my body that correspond to dot-pattern images. Recording video rather than taking a series of still portraits allowed me to easily generate tons of images–tens of thousands, which was plenty for pix2pix to learn how to construct my body from fourteen colored dots.


excerpts from the data set

Though more outfits and sets of movements were recorded, seven made it into the final data set



















I. Q: Can a machine interpret biological motion?

At this point in machine learning field, often a good place to start for both scientific inquiry and creative exploration is with some version of the question: can a machine do X? This leads to the more interesting question of how a machine might (or might not) do X.

So I asked the question: can a machine interpret biological motion the way a human does? The above GIFs are often used to demonstrate our ability to detect biological motion from a small amount of information. It only takes a split second for you brain to perceive human motion in various states of activity from just the movement of a few dots. In the same way we might imagine a fully-realized human from these animations, I wanted to see if a machine could be taught to imagine human forms based on this motion.

I decided to use a generative adversarial neural network called pix2pix. Basically, pix2pix allows you to teach a machine any sort of image-to-image relationship. For instance if you show pix2pix 30,000 image-pairs of a black and white photo and its corresponding color version, pix2pix would learn how to colorize new black and white photos it has never seen before. Alternatively, you could show pix2pix photos of cats or shoes or handbags with their corresponding line drawing versions, and pix2pix will learn how to turn your crude line drawing into a photorealistic representation. This project by Memo Akten is one of my favorite works using pix2pix. You could think of it as a way to engineer a highly acute bias in the way a machine sees the world; a visual filter of sorts.

Could pix2pix be used to make woke sunglasses?


II. Q: Can a machine interpret biological motion? A: Yes, but not with pix2pix

I settled on the image-to-image relationship to teach pix2pix: dots-to-bodies. But at this point, any questions relating to the concept of the perception of biological motion were no longer relevant, given that pix2pix could only process single images at a time. pix2pix, unlike some neural networks, has no concept of time. I had to revise my question:

If I showed a machine enough images of my body, with its corresponding image of dots, would it be able to construct new bodies from new dot patterns?


III. Creating a dataset



Machine learning requires enormous datasets in order to adequately detect patterns in data. So I needed lots of images of dots and bodies.

The first dataset I created: 3,636 image pairs









self-contained utilizes the Pix2PixHD neural network to generate speculative physiologies. I trained image pairs of motion capture dots captured using a Kinect V2, and video frames of me in various outfits, adopting various personae through movement. The system has been trained to associate multiple personae with these simply dot patterns. Once the system is trained, feeding my original movements back into the system forces it to make decisions about which of my selves it must present, and how it should connect my limbs.


The system was trained with approximately 34,000 image pairs, so 68,000 images in total. Using openFrameworks, I have written custom code to manipulate the orientations of the dots that form my skeleton, allowing me to create entirely new forms of human movement only possible in the eye of the neural network.

After training, my original movements are fed back into the system
An excerpt from the “yellow” dataset
A excerpt from the “green” dataset
green dataset
An excerpt from the “orange” dataset


2018-10-07: Initial results

Live-testing the trained model using the pix2pix example in the ofxMSATensorFlow addon 

Generating results by testing single-dot deviations from a model I know generates something coherent.

Trying novel input from the ground up. Absolutely demonic!

Based on a training set of 3,636 images, The model isn’t all that great yet at creating new body forms with novel input (novel input being arrangements of white dots not used in the training data). Still, with a more extensive training set, the sparse white-dot input could do a pretty good job of generating coherent bodies.

Above, you’ll see I’m only manipulating one dot at a time, seeing how far a new dot can be from a previous dot before the imagery becomes complete spaghetti. Right now, the margin is pretty tight.

The training data looks like this:


Video of the complete data set. 6x speed. 3,636 512×256 images. 

Not exactly comprehensive of the range of my motion, but enough for the network to learn what arrangements of dots make up what human forms.

2018-10-05: Live testing novel input?

Not yet. The training was a success, in that I was able to feed in my data, and the model it spat out seems decently trained!

But in order to test it, I need to figure out how to feed my model in to the openFrameworks pix2pix example included in ofxMSATensorFlow

Here are some links I’ve gone through to try to get things working in openFrameworks.

This page is important for setting things up:

This page is important for preparing my pre-trained models:

This paragraph I somehow missed while thoroughly skimming (can on thoroughly skim?) this page ended up being incredibly crucial in exporting the frozen graph model I need to feed in to openFrameworks. Spent hours trying to figure this out. Turns out I was reading Christopher Hesse’s original pix2pix-tensorflow page, not Memo’s fork, which had this invaluable code-snippet.

2018-10-01: Beginning training!

After a lengthy setup process on the computers, making sure nvidia-docker2 was correctly installed (thanks Kyle Werle), it’s time to try out the training.

I think this project depends on getting the right kind of results from my training data. Whatever that might look like, the first hurdle is, of course, getting the training up and running. I’m going to post a list of links here that I accessed/used at various points to get this going.

  1. The neural network I’m using, Pix2Pix-tensorflow (courtesy of Memo Akten, courtesy of [add original torch implementation author here]):
  2. Notes from Christopher Baker on getting this up and running on Linux (thanks Chris!):
  3. (haven’t tried this yet) High-resolution (1024×1024) adaptation of standard Pix2Pix:×1024-37b90c1ca7e8,


2018-09-03: The first steps!

So, the first part of this project is getting pix2pix to output what I need. Well,



Could I ask an AI to interpret biological motion from a sparse set of dots the say we humans are able to?