Diffusing towards a cute kitten

The image sequence above shows AI generation by a so-called diffusion model, specifically runwayml/stable-diffusion-v1-5 (obtained from Hugging Face). the Google Colab notebook that generated it is here. The final image – usually the only one you see – is on the far right. On the far left is an image that is near the start of the image-generation process. Image generation starts with just noise, along with the prompt to specify what you want the image to be: “a cute British short hair kitten, looking directly at you”. Then an iterative procedure is applied, in this case 48 times, to somehow subtract the noise to leave the kitten.

To be honest I don’t understand the maths underlying this* but it does to be, in part, based on the maths of random (stochastic) processes similar to that used to model, for example, diffusion of molecules in liquids and gases – stuff that I teach. Extremely roughly speaking I think the diffusion AI image-generation model is trained on a huge set of images with labels, eg image of a tabby kitten labelled “tabby kitten” etc etc.

Training means that for each image, noise is added in lots of little steps, i.e., bit like the reverse of above, and at each stage a neural net or something is “trained” to remove this small amount of noise from an image with this label. This training is encoded in the values of an enormous array of parameters (weights), and these values are in part controlled by a vector that encodes the label (eg “tabby kitten”). This is done for millions of labelled images.

Then once the model with its millions of parameters is trained, you come along with a prompt, eg “a cute British short hair kitten, looking directly at you”. This is encoded in a vector, which in part controls an enormous sets of weights. Then a random input image just of noise is generated, and this huge neural net is called into to action, to remove a small amount of noise, and so creating the ghost of an outline. It is then called over and over again, to generate a pattern controlled by your prompt.

Hopefully this pattern is an image you like. If not you can try a different prompt or a different model.

* One of the papers with the math is here.

Leave a Comment