The Making of self-contained


A single frame from self-contained II


“Thinking with tools, and in this case, working with the fixed capital of advanced technologies, is a good thing. It is part of the genesis of our species. It is how we mediate the world and are mediated by it; we become what we are by making that which in turn makes us.”


-Benjamin Bratton, The Stack.



self-contained is an exploration of becoming; an examination of the process of self-formation through understanding the tools we use; how we make them and in turn how they make us. It is difficult to discuss the meaning of this work without also foregrounding the process of its development, which presents a dilemma, since the way the work has been shown and presented has excluded any sort of explanation of how it was made. But that’s part of it, I think.



Excerpts from self-contained II as it was installed for the 2019 SAIC MFA Show



I. Machine Learning (briefly)

The imagery in the self-contained series is created using a type of machine learning: an artificial neural network. In the past decade, machine learning and neural networks have become synonymous with “artificial intelligence.” Machine learning describes general learning algorithms that rely on enormous amounts of data and computing power to learn patterns in data, rather than explicit instructions from a programmer. From what it learns about the data it is fed, a machine learning algorithm can then make “intelligent” judgements or predictions about data it’s never seen before.


There are a variety of machine learning and neural network models used for different tasks, but there is a specific class of neural networks often used to generate things like sound, video or images. These neural networks are called Generative Adversarial Networks, or GANs. GANs are the architecture behind deepfakes and these photo-real faces of non-existent humans. In the case of these image-generating neural networks, we mere mortals cannot control what images these algorithms create. We can set the stage by providing these algorithms with data we curate, but the specific aesthetic choices made by these neural networks are beyond our direct control. In that sense, working with this technology can sometimes feel like collaboration as much as it can feel like just using another digital tool.


The specific GAN used for this project is an open source algorithm called pix2pixHD—a higher resolution variant of the original pix2pix. pix2pix is used to synthesize images through a process of image translation: present an input image to a trained pix2pix model, and pix2pix will interpret that input and translate it into an output image. With pix2pix, you “teach” a computer how to see the world. This is done by feeding pix2pix tons of image pairs, so that it can learn to translate one into the other. The more data you use, often the more “accurate” the image translation will be. The Edges to Handbags example below used a data set of approximately 137,000 image pairs of handbag photos and their corresponding black and white line-drawn versions. My data set used about 30,000 image pairs. After what can be hundreds hours of analyzing these image pairs over and over again, pix2pix will (hopefully) learn how to translate the input into a somewhat accurate output.



Some example uses of pix2pix (source:




II. Do they think like us?

For self-contained, I was interested in seeing how an artificially intelligent algorithm would learn to interpret images of abstract body structures as fully-realized photographic versions of my body.



A sample input and a synthesized output image from my data set



Could an AI learn to see these simple dot patterns as me? Could it learn to create photo-realistic images of me from such a sparse amount of input data? Maybe the question isn’t whether or not it “could,” but how it would do such a thing. How will this AI, using pix2pix, learn to make sense of a highly abstracted version of my body? Further still, how would it learn to create my body if I intentionally “confused” the system by putting different versions of my appearance in the data set?



Animations used to explain human perception of biological motion. Source: 


At this point in machine learning field, I think often a good place to start for both scientific inquiry and creative exploration is with some version of the question: can a machine do X? This leads to the more interesting question of how a machine might (or might not) do X. The kernel of inspiration for this work came when I remembered having seen the above animations in a cognitive science course I took on human perception. These “point-light displays” were used to demonstrate that it takes only a moment for us to recognize human figures in motion. We don’t need much information to fill in the blanks.


I wanted to know if a computer could learn to perceive biological motion in the same way a person could. Unfortunately, this would be impossible from the start, as pix2pix can’t deal with images through time, so it has no concept motion. But even as I moved away from questions concerning biological motion, I was still interested in exploring this notion of going from low-detail abstractions to full-detail representations of bodies. I wanted to know how well, if presented with just an image of some dots, could a machine learning algorithm learn to “fill in” the dots convincingly as a human.


In order to train pix2pix to do this, I needed to create a data set.



III. Creating a data set

To get the dot-skeletons I needed for my input data, I used the Kinect V2’s built-in skeleton tracking, and with some tweaks to that code, I extracted the X+Y coordinate data for each of the fourteen dots that made up my input images. I wrote an openFrameworks program to render that coordinate data as colored dots and export them as image files.


preparing to record the first data set



A Sony Alpha A7S II rests just above a Kinect V2, allowing the two devices to record from nearly the same perspective. The camera was connected to a monitor that allowed me to watch myself to make sure my body stayed within the bounds of the frame. This self-observation during the recording influenced my choreography.


While recording my skeleton data, I recorded video of myself in order to capture the images of my body that correspond to dot-pattern images. Recording video rather than taking a series of still portraits allowed me to easily generate tons of images–tens of thousands from just 10-15 minutes of video. For the initial tests, I wore an outfit with distinct colors for different parts of my body, so in case the test results presented nonsensical imagery, I thought I might at least be able to tell what parts of my body were going where. For example, the system should only ever generate images where the only pink in the image is where my feet should be.


The first dataset I created: 3,636 image pairs of 256×256 images



IV. The first training results

Since pix2pix only dealt with 256×256 images, the training period was relatively painless. On a computer in SAIC’s Art and Technology Studies department with dual Nvidia GTX 1080TIs, 3,636 image pairs took about 5 hours to run through 100 epochs. Some snippets of the first generated images are below, which were all running in real-time in openFrameworks using Memo Akten’s ofxMSATensorflow addon.


Test #1: modifying one dot at a time from a dot-frame in the data set.


Test #2: drawing a dot skeleton from scratch


Test #3: randomly interpolating between consecutive dot-skeletons.



V. Evolution: self-contained II

After additional experimentation and feedback from advisors, it was time to expand the project to the next phase. One outfit/costume/persona in the same data set grew to seven; uniform white dots turned to unique colors for each dot, thus providing a more learnable skeletal structure. The low-resolution 256×256-pixel images of pix2pix increased in resolution to 1024×512 with pix2pixHD.


excerpts from the video recordings and corresponding dot data that were extracted as images to make up the pix2pixHD data set



Though more costumes and choreographies were recorded than what is seen here, seven made it into the final data set



Excerpts from the data set, with the generated output after training for 45 epochs (about 110 hours of training)



By incorporating multiple “costumes” into the data set, I wanted to see how the system would deal with multiplicity. What happens when a single dot skeleton happens to correspond to multiple variations of my body simultaneously?


Six consecutive frames. The dot images hardly change from frame to frame, but the generated outputs vary wildly.


As nuanced as these machine learning systems are in terms of what they’re capable of learning about the world, their “thinking” inevitably reduces to a binary structure. When a single dot pattern maps equally to the same pose in a blue, orange and yellow costume, it can’t learn any sort of rigid one-to-one relationship between the inputs and outputs. It’s as though I’m asking the system to accept this logic:


A = B

A = C

B =/= C


By the transitive property, it should be the case that if A equals B, and A equals C, then B had better C. But that’s often not the case with my data. Since pix2pix is making a decision about how translate a dot skeleton thirty times per second, the result is a perpetually flickering, writhing, unstable entity that is never a perfect translation. Forcing this system to deal with these contradictory data is a useful practice for aesthetic experimentation and meaning making, but it is worth noting that in the real world, where machine learning algorithms are analyzing our data, the consequences of their one-track thinking processing our complex personal data can be dire.


Once the system was trained, I used some custom code in openFrameworks to generate novel dot images to feed into the system. In the code, I created a particle system that would allow smooth transition between rendering original dot-frames from the data set, and a physics-based particle system. With this particle system, I could make the dots “explode” and “melt”, or appear/disappear one by one. Even though the data set only contained images with exactly fourteen dots each, the power in these artificial intelligence technologies lies in how they interpret entire types of data they’ve never seen before. Feeding a pattern of 14 dots into the system it’s never seen before is simple enough for the system to translate faithfully, but it’s another thing entirely to feed it images of 1 or 2 dots, or maybe 28 or 42 dots at once! What happens if I don’t even use dots, for that matter? There are many experiments I have yet to try. 


The “growing” process at the beginning of the loop. Even with just one dot, the system fills in a significant amount of photographic imagery.



At the other end of the spectrum (i.e. more dots than usual) the system performs quite well with 28 dots—two bodies worth of dots. Since I never fed the system dot images with more than 14 dots, it never learned to render two (or more) bodies to the screen simultaneously. Because of this, multiple bodies in proximity to each other always join to form a whole.


When there is too much visual information in the input images (i.e., too many dots), the system has difficulty rendering bodies to the screen, and introduces these checkered artifacts.


One happy family.


VI. self-contained III?

Stay tuned.