Using “underdrawings” for accurate text and numbers

(samcollins.blog)

181 points | by samcollins 3 days ago ago

52 comments

  • danpalmer 6 hours ago ago

    I'm glad that we're making progress towards a deeper understanding of what LLMs are inherently good at and what they're inherently bad at (not to say incapable of doing, but stuff that is less likely to work due to fundamental limitations).

    There's similarity here with, for example, defining the architecture of software, but letting an LLM write the functions. Or asking an LLM to write you the SQL query for your data analysis, rather than asking it to do your data analysis for you.

    What I'd really like to see is a more well defined taxonomy of work and studies on which bits work well with LLMs and which don't. I understand some of this intuitively, but am still building my intuition, and I see people tripping up on this all the time.

    • locknitpicker 3 hours ago ago

      > There's similarity here with, for example, defining the architecture of software, but letting an LLM write the functions.

      Not so long ago, this was how early adopters of LLM coding assistants claimed was the right way to use them in coding tasks: prompt to draft the outline, and then prompt to implement each function. There were even a few posts in HN on blogposts showing off this approach with terms inspired in animation work.

      • danpalmer 2 hours ago ago

        I'm not necessarily suggesting always getting down to literally the function level, although I think that gives you excellent quality control, but having a code-level understanding is clearly an important factor.

    • p-e-w 2 hours ago ago

      > due to fundamental limitations

      People keep throwing this phrase around in relation to LLMs, when not a single “fundamental limitation” has been rigorously demonstrated to exist, and many tasks that were claimed to be impossible for LLMs two years ago supposedly due to “fundamental limitations” (e.g. character counting or phonetics) are non-issues for them today even without tools.

      • coldtea 16 minutes ago ago

        >People keep throwing this phrase around in relation to LLMs, when not a single “fundamental limitation” has been rigorously demonstrated to exist

        Some limitations are not rigorously demonstrated to be fundamental, but continuously present from the first early LLMs yes. Shouldn't the burden of proof be on those who say it can be done?

        And some limitations are fundamental, and have been rigorously demonstrated, e.g.:

        https://arxiv.org/abs/2401.11817?utm_source=chatgpt.com

      • dijit 2 hours ago ago

        Character counting remains a huge issue without tools.

        Are you using only frontier models that are gated behind openai/anthropic/google APIs? Those use tools to help them out behind the scenes. It remains no less impressive, but I think we should be clear.

      • danpalmer an hour ago ago

        This is kind of my point, we need to get better at describing the limitations and study them. It seems extremely clear that there are limitations, and not just temporary ones, but structural limitations that existed at the beginning and continue to persist.

      • rimliu 2 hours ago ago

        of course, if you choose to ignore all the limitations they indeed have no limitations.

        • mkbosmans 2 hours ago ago

          Nobody says they have no limitations. The question is are those limitation fundamental, i.e. can we expect improvement, say within a year.

          • coldtea 13 minutes ago ago

            As a general architecture, an LLM also has limitations that can't be improved unless we switch to another, fundamentally different AI design that's non LLM based.

            There are also limitations due to maths and/or physics that aren't fixable under any design. Outside science fiction, there is no technology whose limitations are all fixable.

            Here's one: https://arxiv.org/abs/2401.11817?utm_source=chatgpt.com

          • danpalmer an hour ago ago

            When I talk about fundamental limitations, I mean limitations that can't be solved, even if they could be improved.

            We have improved hallucinations significantly, and yet it seems clear that they are inherent to the technology and so will always exist to some extent.

  • samcollins 3 days ago ago

    I found a simple technique to get reliable text and numbers in AI generated images.

    I’m surprised the image models aren’t already doing this, so wanted to share since I’m finding this so useful

    • samcollins 7 hours ago ago

      TLDR: use SVG to outline image correctly first, then send that image with your text prompt to get Gemini 3.0 Pro to render with correct numbers and text

  • elil17 44 minutes ago ago

    I wonder whether this could be used to fine-tune image models to provide better outputs. Something like this:

    1. Algorithmically generate a underdrawing (e.g. place numbers and shapes randomly in the underdrawing)

    2. Algorithmically generate a description of the underdrawing (e.g. for each shape, output text like "there is a square with the number three in the top left corner). You might fuzz this by having an LLM rewrite the descriptions in a variety of ways.

    3. Generate a "ground truth" image using the underdrawing and an image+text-to-image model.

    4. Use the generated description and the generated "ground truth" image as training data for a text-to-image model.

    • hirako2000 23 minutes ago ago

      That would complexity the architecture of a model, to solve a finite set of cases. That's an argument for specialised/fine tuned models though.

  • cheekyant 21 minutes ago ago

    Has anyone built a platform which has image to image pipelines and lets you use prompt to SVG generation from SOTA LLMs?

  • dllu an hour ago ago

    I was thinking about doing the opposite for the common task of "SVG of a pelican riding a bike". Obviously, directly spitting out the SVG is gonna be bad. But image gen can produce a really stunning photorealistic image easily. Probably a good way to get an LLM to produce a decent bike-pelican SVG is to generate an image first and then get the model to trace it into an SVG. After all, few human beings can generate SVG works of art by just typing out numbers into Notepad. At the core of it, we still rely on looking at it and thinking about it as an image.

  • smusamashah 4 hours ago ago

    This is just img2img where first image with correct structure was generated by code.

    • vunderba 2 hours ago ago

      Yup, that’s exactly what this is. If you’ve been using generative models since the early Stable Diffusion days, it’s a pretty common (and useful!) technique: using a sketch (SVG, drawn, etc) as an ad-hoc "controlnet" to guide the generative model’s output.

      Example: In the past I'd use a similar approach to lay out architectural visualizations. If you wanted a couch, chair, or other furniture in a very specific location, you could use a tool like Poser to build a simple scene as an approximation of where you wanted the major "set pieces". From there, you could generate a depth map and feed that into the generative model, at the time SDXL, to guide where objects should be placed.

    • jasonjmcghee 4 hours ago ago

      Pretty much what the author said- just gave some context for the uninitiated

    • philsnow 3 hours ago ago

      Right, but you can use a different (codegen) model to make that code.

  • sparuchuri 3 days ago ago

    This hack definitely falls in the “duh, why didn’t I think of that” category of tricks, but glad to now have it next time imagegen comes up short

    • manmal 4 hours ago ago

      Even the original stable diffusion app had image 2 image. It just didn’t work as well. I‘m not sure why this is supposed to be novel.

      • ludwik 3 hours ago ago

        It’s obviously not a new model capability. But using this well-known, existing capability to solve this particular issue is only obvious after the fact.

        It’s a useful trick to have in one’s toolbox, and I’m grateful to the author for sharing it.

      • Finbel 3 hours ago ago

        It's not novel in the sense that nobody knew about img2img. It's novel in the sense that nobody thought of using img2img to solve this problem in this way.

  • xigoi 2 hours ago ago

    The standard objection: if the LLM is supposedly intelligent, why can’t it figure out on its own that this two-step process would achieve a better result?

    • jstanley an hour ago ago

      I think we all agree that humans are intelligent, and not every human has independently thought of this process.

      You're holding AI to a higher standard than humanity.

      • xigoi an hour ago ago

        Every decent human artist knows to draw a sketch before painting something.

        • jrapdx3 10 minutes ago ago

          Of course many, even most, painters do sketch what they intend to paint, likely that's the predominant technique.

          But it's not universally true, particularly among artists working in the last 100 years or so. Certainly Jackson Pollock (whether one regards his work as good or not) didn't sketch out how he was going to distribute paint onto canvas. Another example is Morris Luis (and other "stain painters") who didn't sketch out how he applied paint to canvas.

          You're comment is largely correct, just pointing out that more than a few "decent artists" didn't (or don't) work that way.

        • hirako2000 22 minutes ago ago

          Humans even have the creativity to come up with sketching.

          Models don't have intelligence, even less so creative thinking.

    • pyrolistical 36 minutes ago ago

      You don’t know what you don’t know

    • nine_k 2 hours ago ago

      Nobody asked it to!

      • xigoi an hour ago ago

        If it’s asked to generate an image, it should to everything in its powers to make the image good.

        • andruby 37 minutes ago ago

          > it should do everything in its powers

          That's a scary thought.

          Hey Claude, why haven't you finished yet? ... Because the human I'm holding hostage hasn't finished the drawing yet.

    • cubefox an hour ago ago

      Part of the problem is that it isn't the LLM making the image directly itself, it's the LLM repeatedly prompting edits for an edit diffusion model. The Gemini reasoning summary shows part of this. The style of some of the images makes it also clear that it uses an Imagen 4 derived diffusion model underneath.

  • nine_k 2 hours ago ago

    It's normal to first create a plan, then allow agents to write code. But it seems to be surprising for many to first create a draft / outline of a picture, then go for a final render.

  • nottorp an hour ago ago

    LLMs are like a box of chocolates...

  • globular-toast an hour ago ago

    Wait, where did it get the "Sweet Path//Trail of treats" thing from in the SVG? It wasn't about sweets at that point. Something missing here, I think.

  • BobbyTables2 5 hours ago ago

    How is it that LLMs aren’t good at rendering the sequence of numbers but can reliably put the supplied pieces all in the right order?

    • mk_stjames 5 hours ago ago

      Because the image generation is powered by a diffusion model that is only guided by the transformer model and still has somewhat vague spatial representation especially when it comes to coupling things like counting and complex positioning.

      But by using the LLM to generate code like an SVG graphic is made up of, and then using a rasterized image of that SVG as an input to the diffusion model, this takes place of the raw noise input and guides the denoising process of the diffusion model to put the numerical parts in the right spots.

      The LLM is putting the SVG in the right order because the code that drives the SVG is just that - code - and the numerical order is easily defined there, even if it has to follow something like a spiral.

      Edit: although LLMs now also may be using thinking modes with their feedback during generation to help with complex positioning when drawing something like an SVG, as I just asked claude to generate me one such spiral number SVG and it did so interactively via thinking, and the code generated is incredibly explicit with positions, so, that must help. But the underlaying idea to two-step SVG-to-diffusion model is the real key here.

  • choeger 3 hours ago ago

    Transformers are great translators. So, yeah, starting with structured output like SVG is probably the best way to start.

    It should be fairly trivial to fix any logic errors in the structured output, too.

  • SomaticPirate 2 hours ago ago

    inb4 this technique is subsumed into the next MoE model release

    LLMs are evolving so fast I wouldn’t be surprised if this technique was not needed in <6 months

    • krackers 2 hours ago ago

      I don't think the MoE part has anything to do with it, but the current gen of multimoddal models can do thinking interleaved with autoregressive(?*) image-gen so it's probably not long before they bake this into the RL process, same way native thought obviated need for "think carefully step by step" prompts.

    • rimliu 2 hours ago ago

      LLMs are rather devolving at this point.

  • wg0 2 hours ago ago

    Has anyone had good luck with making consistent game art and assets?

  • foxes 40 minutes ago ago

    I feel sorry for the recipient.

  • Melamune 2 hours ago ago

    I wondered why I was losing all passion for creating. These tips and tricks are part of the answer.

  • tracerbulletx 6 hours ago ago

    Ive been doing charts for slides like this for a while. Noticed html viz was super reliable, but I could style it with diffusion model. Its very useful for data viz.

  • jeffrallen 3 hours ago ago

    I wish the opposite was true: that when I tell Gemini I want "a diagram of X" that it immediately breaks out Python and mathplotlib, instead of wasting my time with Nano Banana.

  • nullc 4 hours ago ago

    Inpainting/guiding from a sketch is how I've always used diffusion models. I thought everyone did that, or at least everyone who wasn't just trying to get some arbitrary filler material without much care of what the output looked like.

  • psychoslave an hour ago ago

    A few months ago I tried to make Le-chat Mistral output a French poetry in Alexandrin (12 vowels). Disastrous at first. Then adding in specifications that each line had to also be transposed in IPA and each syllable counted, it went better.

    Still emotionally unrelatable, but definitely was providing something that match the specifications of there are explicit and systematically enforced through deterministitic means. For now I retain that LLM limitations are thus that they can't seize the ineffable and so untrustworthy they can only be employed under very clear and inescapable constraints or they will go awry just as sure as water is wet.

  • gwern 5 hours ago ago

    tldr: do a standard img2img workflow where you lay out a skeleton or skeleton or low-res version, and then turn it into the final high-quality photorealistic version, instead of trying to zeroshot it purely from a text prompt.