Probing Compositional Understanding of ChatGPT with SVG

I ask ChatGPT to draw and explain programmatic SVG images

Evan Pu
5 min readDec 28, 2022

Foundational models can generate realistic images from prompts, but do these models understand their own drawings? Generating SVG (Scalable Vector Graphics) gives us a unique opportunity to ask this question. SVG is programmatic, consisting of circles, rectangles, and lines. Therefore, the model must schematically decompose the target object into meaningful parts, approximating each part using simple shapes, then arrange the parts together in a meaningful way.

Compared generating a drawing of a bicycle using a pixel-based representation (e.g. a diffusion model), generating SVG forces the model to explain what it is drawing using code (i.e. symbolically) first.

tl;dr: I asked ChatGPT to draw 100+ objects in SVG with explanations. Browse them all at this url. Raw data .tsv here. Just look at them! Aside from being fun, they inform us quite a lot about LLMs.

Procedure

For a prior distribution of objects, I used objects common in icons and emojis. I used variants of the following prompt to ask ChatGPT for SVGs.

“Using the SVG format, output a drawing of a _______. Put the output in the code block. Put it on a 200x200 canvas. Don’t forget xmlns=”http://www.w3.org/2000/svg" after svg tag. Use different colors if the parts are overlapping so everything shows. Comment and explain why this is a _______”

I manually inspected the rendered SVGs to make sure it is valid and interesting.

Analysis

For most drawings, you cannot tell what they are supposed to be. However, after reading the explanation, you can really appreciate the efforts. For instance, let’s take a look at the drawing and explanation of a bicycle.

GPT is a convincing salesman !

We can conceptualize the SVG generation in these steps:

  1. Symbolically decompose the object into parts using language
  2. Approximating each part using appropriate shape and color
  3. Putting these parts together in a cohesive way

1. Symbolically decomposing the object using language

To draw an object as SVG, GPT first use language to decompose it into parts. This is evident from the comments of the code:

<svg xmlns="http://www.w3.org/2000/svg" width="200" height="200">
<!-- Bicycle frame -->
...
<!-- Wheels -->
...
<!-- Pedals -->
...
</svg>

How good is this decomposition? We can ask ChatGPT whether it can recover the original object from these parts.

I tried the recovery task on a few other objects: clock (hands, face, body, pendulum), satellite (antenna, dish, body), helicopter (body, a tail, main rotor, tail rotor, and a cockpit). Tentatively, I’d say the decomposition is robust in recovering the original concept.

2. Approximating each part

We can use comments (i.e. rendering the code block immediately below <!--wheels-->) to see if GPT is drawing each part well.

These are objects whose parts are well approximated

GPT understands that a bike pedal can be a circle and a line, and umbrella ribs forms a cone shape, and pretzel arms fan out in arcs.

Some parts are poorly approximated

GPT fails to draw hexagons on the soccer ball, thinks a no2 pencil is a gray box, and fails to generate more complex curves for the trumpet and tentacles of the octopus (GPT overtly confident in drawing splines haha).

3. Putting the parts together

Like most neural networks, ChatGPT struggles with putting the details together in an hierarchically cohesive way.

Conclusion

GPT, born from a large textual corpus, can generates SVG:

  1. knows in language how common objects can be symbolically decomposed into sub-parts.
  2. is reasonable at drawing smaller parts, presumably learning from existing SVG (or similar) program fragments online.
  3. like most neural networks, GPT still struggles to compose these parts together in a cohesive way.

Like how large models become better at forming consistent abstractions — from generating cohesive nonsense to generating 3 to 7 fingers per hand — we should expect the ability to generate reasonably hierarchically consistent SVGs for common objects soon. If anything, the following drawing of a horse-carriage should give you ample hope.

we are very close to having wheels on the ground !

Fun Remark: Is GPT from ancient EGYPT ?

[The Egyptians] did not set out to sketch nature as it appeared to them from any fortuitous angle. They drew from memory, according to strict rules which ensured that everything that had to go into the picture would stand out in perfect clarity. Everything had to be represented from its most characteristic angle.
— The Story of Art by Gombrich E.H.

Symbolically decomposing an object into parts then recombining the parts together has an interesting consequence — The most representative renditions of the parts may be inconsistent. Let’s take a look at the GPT generated dumbell svg.

Note how the two weights are represented as circles, consistent with the statement “A dumbell has two circular weights”. However a top-view of a dumbell would not show weights as circular. Here are a few more of these Egyptian drawings.

These kinds of bugs (or features?) will be extremely difficult to reproduce using a diffusion model trained on naturalistic images. That’s cool!

The Egyptian Pond

One of our favorite ancient Egypt drawing is that of a royal pond, here’s the original and GPT’s reproduction side-by-side

If we had to draw such a motif we might wonder from which angle to approach it. The shape and character of the trees could be seen clearly only from the sides, the form of the pond would be visible only if seen from above. The Egyptians had no compunction about this problem. They would simply draw the pond as if it were seen from above, and the trees from the side. The fishes and birds in the pond, on the other hand, would hardly look recognizable as seen from above, so they were drawn in profile — The Story of Art

Phew that’s it ! Thanks for reading this far and I hope you enjoyed it. Check out this related blogpost on the language of modifications.

follow my twitter for more stuff like this

— evan 2022–12–27

self-note : logistics. 6 hours to play with ChatGPT and curate dataset in google spreadsheet. 2 hours to make simple interactive website with copilot help. 8 hours to write this blog (including figures).

--

--

Evan Pu

Research Scientist (Autodesk). PhD (MIT 2019). I work on Program Synthesis in the context of Human-Machine Communications