The Making of self-contained



I. Q: Can a machine interpret biological motion?

At this point in machine learning field, often a good place to start for both scientific inquiry and creative exploration is with some version of the question: can a machine do X? This leads to the more interesting question of how a machine might (or might not) do X.

So I asked the question: can a machine interpret biological motion the way a human does? The above GIFs are often used to demonstrate our uncanny ability to detect biological motion from an incredibly small amount of information. It only takes a split second for you brain to see humans in various states of activity from just the movement of a few dots. In the same way we might imagine a fully-realized human from these animations, I wanted to see if a machine could be taught to imagine human forms based on this motion.

I decided to use a generative adversarial neural network called pix2pix. Basically, pix2pix allows you to teach a machine any sort of image-t0-image relationship. For instance if you show a pix2pix 30,000 image pairs of a black and white photo and its corresponding color version, pix2pix would learn how to colorize new black and white photos it has never seen before. Alternatively, you could show pix2pix photos of cats or shoes or handbags with their corresponding line drawing versions, and pix2pix will learn how to turn your crude line drawing into a photorealistic representation. This project by Memo Akten is one of my favorite works using pix2pix. You could think of it as a way to engineer a highly acute bias in the way a machine sees the world; a visual filter of sorts.

Could pix2pix be used to make woke sunglasses?


II. Q: Can a machine interpret biological motion? A: Yes, but not with pix2pix

I settled on the image-to-image relationship to teach pix2pix: dots-to-bodies. But at this point, any questions relating to the concept of the perception of biological motion were no longer relevant, given that pix2pix could only process single images at a time. pix2pix, unlike some neural networks, has no concept of time. I had to revise my question:

If I showed a machine enough images of my body, with its corresponding image of dots, would it be able to construct new bodies from new dot patterns?


III. Creating a dataset


Machine learning requires enormous datasets in order to adequately detect patterns in data. So I needed lots of images of dots and bodies.

The first dataset I created: 3,636 image pairs








self-contained utilizes the Pix2PixHD neural network to generate speculative physiologies. I trained image pairs of motion capture dots captured using a Kinect V2, and video frames of me in various outfits, adopting various personae through movement. The system has been trained to associate multiple personae with these simply dot patterns. Once the system is trained, feeding my original movements back into the system forces it to make decisions about which of my selves it must present, and how it should connect my limbs.


The system was trained with approximately 34,000 image pairs, so 68,000 images in total. Using openFrameworks, I have written custom code to manipulate the orientations of the dots that form my skeleton, allowing me to create entirely new forms of human movement only possible in the eye of the neural network.

After training, my original movements are fed back into the system
An excerpt from the “yellow” dataset
A excerpt from the “green” dataset
green dataset
An excerpt from the “orange” dataset


2018-10-07: Initial results

Live-testing the trained model using the pix2pix example in the ofxMSATensorFlow addon 

Generating results by testing single-dot deviations from a model I know generates something coherent.

Trying novel input from the ground up. Absolutely demonic!

Based on a training set of 3,636 images, The model isn’t all that great yet at creating new body forms with novel input (novel input being arrangements of white dots not used in the training data). Still, with a more extensive training set, the sparse white-dot input could do a pretty good job of generating coherent bodies.

Above, you’ll see I’m only manipulating one dot at a time, seeing how far a new dot can be from a previous dot before the imagery becomes complete spaghetti. Right now, the margin is pretty tight.

The training data looks like this:


Video of the complete data set. 6x speed. 3,636 512×256 images. 

Not exactly comprehensive of the range of my motion, but enough for the network to learn what arrangements of dots make up what human forms.

2018-10-05: Live testing novel input?

Not yet. The training was a success, in that I was able to feed in my data, and the model it spat out seems decently trained!

But in order to test it, I need to figure out how to feed my model in to the openFrameworks pix2pix example included in ofxMSATensorFlow

Here are some links I’ve gone through to try to get things working in openFrameworks.

This page is important for setting things up:

This page is important for preparing my pre-trained models:

This paragraph I somehow missed while thoroughly skimming (can on thoroughly skim?) this page ended up being incredibly crucial in exporting the frozen graph model I need to feed in to openFrameworks. Spent hours trying to figure this out. Turns out I was reading Christopher Hesse’s original pix2pix-tensorflow page, not Memo’s fork, which had this invaluable code-snippet.

2018-10-01: Beginning training!

After a lengthy setup process on the computers, making sure nvidia-docker2 was correctly installed (thanks Kyle Werle), it’s time to try out the training.

I think this project depends on getting the right kind of results from my training data. Whatever that might look like, the first hurdle is, of course, getting the training up and running. I’m going to post a list of links here that I accessed/used at various points to get this going.

  1. The neural network I’m using, Pix2Pix-tensorflow (courtesy of Memo Akten, courtesy of [add original torch implementation author here]):
  2. Notes from Christopher Baker on getting this up and running on Linux (thanks Chris!):
  3. (haven’t tried this yet) High-resolution (1024×1024) adaptation of standard Pix2Pix:×1024-37b90c1ca7e8,


2018-09-03: The first steps!

So, the first part of this project is getting pix2pix to output what I need. Well,



Could I ask an AI to interpret biological motion from a sparse set of dots the say we humans are able to?