Probing Compositional Understanding of ChatGPT with SVG
Foundational models can generate realistic images from prompts, but do these models understand their own drawings? Generating SVG (Scalable Vector Graphics) gives us a unique opportunity to ask this question. SVG is programmatic, consisting of circles, rectangles, and lines. Therefore, the model must schematically decompose the target object into meaningful parts, approximating each part using simple shapes, then arrange the parts together in a meaningful way.
Compared generating a drawing of a bicycle using a pixel-based representation (e.g. a diffusion model), generating SVG forces the model to explain what it is drawing using code (i.e. symbolically) first.
tl;dr: I asked ChatGPT to draw 100+ objects in SVG with explanations. Browse them all at this url. Raw data .tsv here. Just look at them! Aside from being fun, they inform us quite a lot about LLMs.
Procedure
For a prior distribution of objects, I used objects common in icons and emojis. I used variants of the following prompt to ask ChatGPT for SVGs.
“Using the SVG format, output a drawing of a _______. Put the output in the code block. Put it on a 200x200 canvas. Don’t forget xmlns=”http://www.w3.org/2000/svg" after svg tag. Use different colors if the parts are overlapping so everything shows. Comment and explain why this is a _______”
I manually inspected the rendered SVGs to make sure it is valid and interesting.
Analysis
For most drawings, you cannot tell what they are supposed to be. However, after reading the explanation, you can really appreciate the efforts. For instance, let’s take a look at the drawing and explanation of a bicycle.
We can conceptualize the SVG generation in these steps:
- Symbolically decompose the object into parts using language
- Approximating each part using appropriate shape and color
- Putting these parts together in a cohesive way
1. Symbolically decomposing the object using language
To draw an object as SVG, GPT first use language to decompose it into parts. This is evident from the comments of the code:
<svg xmlns="http://www.w3.org/2000/svg" width="200" height="200">
<!-- Bicycle frame -->
...
<!-- Wheels -->
...
<!-- Pedals -->
...
</svg>
How good is this decomposition? We can ask ChatGPT whether it can recover the original object from these parts.
I tried the recovery task on a few other objects: clock (hands, face, body, pendulum), satellite (antenna, dish, body), helicopter (body, a tail, main rotor, tail rotor, and a cockpit). Tentatively, I’d say the decomposition is robust in recovering the original concept.
2. Approximating each part
We can use comments (i.e. rendering the code block immediately below <!--wheels-->
) to see if GPT is drawing each part well.
These are objects whose parts are well approximated
GPT understands that a bike pedal can be a circle and a line, and umbrella ribs forms a cone shape, and pretzel arms fan out in arcs.
Some parts are poorly approximated
GPT fails to draw hexagons on the soccer ball, thinks a no2 pencil is a gray box, and fails to generate more complex curves for the trumpet and tentacles of the octopus (GPT overtly confident in drawing splines haha).
3. Putting the parts together
Like most neural networks, ChatGPT struggles with putting the details together in an hierarchically cohesive way.
Conclusion
GPT, born from a large textual corpus, can generates SVG:
- knows in language how common objects can be symbolically decomposed into sub-parts.
- is reasonable at drawing smaller parts, presumably learning from existing SVG (or similar) program fragments online.
- like most neural networks, GPT still struggles to compose these parts together in a cohesive way.
Like how large models become better at forming consistent abstractions — from generating cohesive nonsense to generating 3 to 7 fingers per hand — we should expect the ability to generate reasonably hierarchically consistent SVGs for common objects soon. If anything, the following drawing of a horse-carriage should give you ample hope.
Fun Remark: Is GPT from ancient EGYPT ?
[The Egyptians] did not set out to sketch nature as it appeared to them from any fortuitous angle. They drew from memory, according to strict rules which ensured that everything that had to go into the picture would stand out in perfect clarity. Everything had to be represented from its most characteristic angle.
— The Story of Art by Gombrich E.H.
Symbolically decomposing an object into parts then recombining the parts together has an interesting consequence — The most representative renditions of the parts may be inconsistent. Let’s take a look at the GPT generated dumbell svg.
Note how the two weights are represented as circles, consistent with the statement “A dumbell has two circular weights”. However a top-view of a dumbell would not show weights as circular. Here are a few more of these Egyptian drawings.
These kinds of bugs (or features?) will be extremely difficult to reproduce using a diffusion model trained on naturalistic images. That’s cool!
The Egyptian Pond
One of our favorite ancient Egypt drawing is that of a royal pond, here’s the original and GPT’s reproduction side-by-side
Phew that’s it ! Thanks for reading this far and I hope you enjoyed it. Check out this related blogpost on the language of modifications.
follow my twitter for more stuff like this
— evan 2022–12–27
self-note : logistics. 6 hours to play with ChatGPT and curate dataset in google spreadsheet. 2 hours to make simple interactive website with copilot help. 8 hours to write this blog (including figures).