> we found that GPT-5.2 cannot even compute the parity of a short string like 11000, and GPT-5.2 cannot determine whether the parentheses in ((((()))))) are balanced.
I think there is a valid insight here which many already know:
LLMs are much more reliable at creating scripts and automation to do certain tasks than doing these tasks themselves.
For example if I provide an LLM my database schema and tell it to scan for redundant indexes and point out wrong naming conventions, it might do a passable but incomplete job.
But if I tell the LLM to code a python or nodejs script to do the same, I get significantly better results. And it's often faster too to generate and run the script than to let LLMs process large SQL files.
The dream is probably that the inference software then writes and executes that script without using text generation alone. Analog to how a human might cross off pairs of parentheses to check that example.
ChatGPT already does this, albeit in limited circumstances, through the use of its sandbox environment. Asking GPT in thinking mode to, for example, count the number of “l”s in a long text may see it run a Python script to do so.
There’s a massive issue with extrapolating to more complex tasks, however, where either you run the risk of prompt injection via granting your agent access to the internet or, more commonly, an exponential degradation in coherence over long contexts.
To those saying this is not surprising, yes it will be surprising to the general public who are being served ads from huge companies like MS or OpenAI saying LLMs can help with their accounting, help them close deals by crunching the numbers in seconds, write complex code for them etc etc.
This is important information for anyone to understand who thinks these systems are thinking, reasoning, and learning from them or that they’re having a conversation with them i.e. 90% of users of LLMs.
A machine which confabulates and cannot count is not a good fit for accounting tasks. They’ll make all sorts of subtle errors which are difficult for humans to notice.
That wouldn't even necessarily be true if models really "couldn't count", since software exists - if an LLM is making an Excel spreadsheet rather than doing everything manually, it's both much harder for it to mess up and easier to notice and recover. It's even less true given that what this paper actually tests is "LLMs don't have a literally perfect accuracy when you make them do increasingly big problems with zero thinking".
(Confabulation is IMO a much bigger problem, but it's unrelated to architecture - it's an artifact of how models are currently trained.)
Quick sanity check: you're susceptible to pretty irresistible optical illusions which would never fool a VLM, does it mean you're not thinking? In fact, with a non-monospaced font I also have trouble determining whether these parens are balanced, and have to select them with the mouse, i.e. use a "dumb" tool, to make sure.
Reminder that "thinking" is an ill-defined term like others, and the question whether they "think" is basically irrelevant. No intelligent system, human or machine, will ever have zero error rate, due to the very nature of intelligence (another vague term). You have to deal with that the same way you deal with it in humans - either treat bugs as bugs and build systems resilient to bugs, or accept the baseline error rate if it's low enough.
Who is hiring anyone to look at a screen to count characters? Don't be disingenuous in your argument. The apt comparison would be the current technique used to accomplish this task i.e. a pattern matching algorithm.
It's not just an issue of tokenization, it's almost a category error. Lisp, accounting and the number of r's in strawberry are all operations that require state. Balancing ((your)((lisp)(parens))) requires a stack, count r's in strawberry requires a register, counting to 5 requires an accumulator to hold 4.
An LLM is a router and completely stateless aside from the context you feed into it. Attention is just routing the probability distribution of the next token, and I'm not sure that's going to accumulate much in a single pass.
It's not dismissible as a misunderstanding of tokens. LLMs also embed knowledge of spelling - that's how they fixed the strawberry issue. It's a valid criticism and evaluation.
The r's in strawberry presents a different level of task to what people imagine. It seems trivial to a naive observer because the answer is easily derivable from the question without extra knowledge.
A more accurate analogy for humans would be to imagine if every word had a colour. You are told that there are also a sequence of different colours that correspond to the same colour as that word. You are even given a book showing every combination to memorise.
You learn the colours well enough that you can read and write coherently using them.
Then comes the question of how many chocolate-browns are in teal-with-a-hint-of-red. You know that teal-with-a-hint-of-red is a fruit and you know that the colour can also be constructed by crimson followed by Disney-blond. Now, do both of those contain chocolate-brown or just one of them, how many?
It requires excersizing memory to do a task that is underrepresented in the training data because humans simply do not have to do the task at all when the answer can be derived from the question representation. Humans also don't have the ability that the LLMs need but the letter representation doesn't need that ability.
That’s what makes it a fair evaluation and something that requires improvement. We shouldn’t only evaluate agent skills by what is most commonly represented in training data. We expect performance from them on areas that existing training data may be deficient at providing. You don’t need to invent an absurdity to find these cases.
It's reasonable to test their ability to do this, and it's worth working to make it better.
The issue is that people claim the performance is representative of a human's performance in the same situation. That gives an incorrect overall estimation of ability.
I do think this is a tool issue. Here is what the article says:
> For the multiplication task, note that agents that make external calls to a calculator tool may have ZEH = ∞. While ZEH = ∞ does have meaning, in this paper we primarily evaluate the LLM itself without external tool calls
The models can count to infinity if you give them access to tools. The production models do this.
Not that the paper is wrong, it is still interesting to measure the core neural network of a model. But modern models use tools.
It is academically interesting what pure neural networks can do, of course. But when someone goes to Claude and tries to do something, they don't care if it solves the problem using a neural network or a call out to Python. So long as the result is right.
More generally, the ability to use tools is a form of intelligence, just like when humans and crows do it. Being able to craft the right Python script and use the result is non-trivial.
Seems like it’s maybe also a tool steering problem. These models should be reaching for tools to help solve factual problems. LLM should stick to prose.
I think this is still useful research that calls into question how “smart” these models are. If the model needs a separate tool to solve a problem, has the model really solved the problem, or just outsourced it to a harness that it’s been trained - via reinforcement learning - to call upon?
Does it matter if the LLM can solve the problem or if it knows to use a resource?
There’s plenty of math that I couldn’t even begin to solve without a calculator or other tool. Doesn’t mean I’m not solving math problems.
In woodworking, the advice is to let the tool do the work. Does someone using a power saw have less claim to having built something than a handsaw user? Does a CNC user not count as a woodworker because the machine is doing the part that would be hard or impossible for a human?
It does matter because the LLM doesn’t always know when to use tools (e.g. ask it for sales projections which are similar to something in its weights) and is unable to reason about the boundaries of its knowledge.
It matters if you’re curious about whether AGI is possible. Have we really built “thinking machines”, or are these systems just elaborate harnesses that leverage the non-deterministic nature of LLMs?
An "elaborate harness" that can break down a problem into sub-tasks, write Python scripts for the ones it can't solve itself, and then combine the results, seems able to solve a wide range of cognitive tasks?
What is a difference? If the "elaborate harness" consists of mix of "classical" code and ML model invocations, at which point it's disqualified from consideration for "thinking machine"? Best we can tell, even our brains have parts that are "dumb", interfacing with the parts that we consider "where the magic happens".
Weird. I tried in chatGPT auto and it worked perfectly. I tried like 10 variations. I also did the letters in words. Got all of them right.
The one thing I did trip it up on was "Is there the sh sound in the word transportation". It said no. And then realized I asked for "sound" not letters. It then subsequently got the rest of the "sounds-like" tests I did.
heh, interesting that. I just tried it twice more with ChatGPT "Instant" (disabling "Auto-switch to Thinking") and it got it wrong both times. Does yours get it right without thinking or tool calls? If so, maybe it does like you better than me.
OK, I didn't think to disable switch to thinking (didn't know this was a mode). When I did that then it did get it wrong -- oddly it took about the same amount of time, so thinking mode wasn't taking longer, but it was more accurate.
Right, though I didn't explicitly disable thinking for my first attempt either. I'd guess my prompt was less detailed than yours and so ChatGPT (in "Auto" mode) decided to allocate thinking tokens for your questions but not mine.
Probably zero. At the end of the day people pay for LLMs that write better code or summarize PDFs of hundreds of pages faster, not the ones that can count the letter r's better.
When LLMs can't count r's: see? LLMs can't think. Hoax!
When LLMs count r's: see? They patched and benchmark-maxxed. Hoax!
Whenever an "LLM fail" goes viral like the car wash question, you can observe the exact same wording of the question get "fixed" within a week or so. With slight variations in phrasing still able to replicate the problem.
Followed by lots of "works perfectly for me, why are people even talking about this?"
I can't say what exactly they're doing behind the scenes but it's a consistent pattern among the big SOTA model providers. With obvious incentive to "fix" the problem so users will then organically "debunk" the meme as they try it themselves and share their experiences.
The same non-argument could be said for all kinds of cheating on benchmarks by tech companies and yet we have tons of documented example of them caught with pants down.
>You just can't reason with the anti-LLM group.
On the contrary, the reasoning is simple and consistent:
LLMs can't count r's shows that LLM don't actually think the way we understand thought (since nobody with the kind of high skills they have in other areas would fail that). And because of that, there are (likely) patches for commonly reported cases, since it's a race to IPO and benchmark-maxxing is very much conceivable.
Yeah well I presume at this point they have an agent download new LLM related papers as they come out and add all edge cases to their training set asap.
Is tokenization extremely efficient? Yes. Does it fundamentally break character-level understanding? Also yes. The only fix is endless memorization.
LLMs seem to me closer to Kahneman's System 1 than to System 2. When understood in this way, it is obvious why LLMs are bad at counting r's in "strawberries". But it also makes ZEH feel like it couldn't possibly be a useful metric, because it's a System 2 evaluation applied to a System 1 system.
FYI, the LLM letter-counting problem has nothing to do with counting per se, and is instead entirely down to LLMs not getting to see your raw UTF-8 byte stream, but rather having a tokenizer intermediating between you and it, chunking your UTF-8 bytes into arbitrary, entirely-opaque-to-the-LLM token groupings.
Try it for yourself — under the most popular tokenizer vocabulary (https://tiktokenizer.vercel.app/?model=cl100k_base), "strawberry" becomes [str][aw][berry]. Or, from the model's perspective, [496, 675, 15717]. The model doesn't know anything about how those numbers correspond to letters than you do! It never gets sat down and told "[15717] <=> [b][e][r][r][y]", with single-byte tokens on the right. (In fact, these single-byte tokens appear in the training data extremely rarely, and so the model doesn't often learn to do anything with them.)
Note that LLMs can predictably count the number of r's in "s t r a w b e r r y", because <Count the number of r's in "s t r a w b e r r y"> becomes [Count][ the][ number][ of][ r]['s][ in][ "][s][ t][ r][ a][ w][ b][ e][ r][ r][ y]["]. And that's just a matching problem — [ r] tokens for [ r] tokens, no token-correspondence-mapping needed.
This is clearly not the case, any modern (non-reasoning) model easily decomposes words into individual token-characters (try separating them with e.g. Braille spaces...) and does arbitrary tokenization variants if forced with a sampler. It's way deeper than tokenization, and models struggle exactly with counting items in a list, exact ordering, retrieving scattered data, etc. LLM context works a lot more like associative memory than a sequence that can be iterated over. There are also fundamental biases and specific model quirks that lead to this.
> When understood in this way, it is obvious why LLMs are bad at counting r's in "strawberries".
no it doesnt. it makes sense that they cant count the rs because they dont have access to the actual word, only tokens that might represent parts or the whole of the word
Tokenization is a simplistic explanation which is likely wrong, at least in part. They're perfectly fine reciting words character by character, using different tokenization strategies for the same word if forced to (e.g. replacing the starting space or breaking words up into basic character tokens), complex word formation in languages that heavily depend on it, etc. LLMs work with concepts rather than tokens.
A big part of skill aquisition in humans is moving tasks from system 2 to system 1, to free up the very scarce thinking resources for ever more complex tasks, that can then in turn be internalized and handled by system 1.
People are going to misinterpret this and overgeneralize the claim. This does not say that AI isn't reliable for things. It provides a method for quantifying the reliability for specific tasks.
You wouldn't say that a human who doesn't know how to read isn't reliable in everything, just in reading.
Counting is something that even humans need to learn how to do. Toddlers also don't understand quantity. If a 2 year old is able to count to even 10 it's through memorization and not understanding. It takes them like 2 more years of learning before they're able to comprehend things like numerical correspondence. But they do still know how to do other things that aren't counting before then.
>Counting is something that even humans need to learn how to do
No human who can program, solve advanced math problems, or can talk about advanced problem domains at expert level, however, would fail to count to 5.
This is not a mere "LLMs, like humans, also need to be taught this" but points to a fundamental mismatch about how humans and LLMs learn.
(And even if they merely needed to be taught, why would their huge corpus fail to cover that "teaching", but cover way more advanced topics in math solving and other domains?)
Respectfully, toddlers cannot output useable code or have otherwise memorised results to an immense number of maths equations.
What this points at is the abstraction/emergence crux of it all. Why does an otherwise very capable LLM such as the GPT-5 series, despite having been trained on vastly more examples of frontend code of all shapes, sizes and quality levels, struggle to abstract all that training data to the point where outputting any frontend that deviates from the clearly used examples?
If LLMs, as they are now, were comparable with human learning, there'd be no scenario where a model that can provide output solving highly advanced equations can not count properly.
Similarly, a model such as GPT-5 trained on nearly all frontend code ever committed to any repo online, would have internalised more than that one template OpenAI predominantly leaned on.
These models, I think at this point there is little doubt, are impressive tools, but they still do not generalise or abstract information in the way a human mind does. Doesn't make them less impactful for industries, etc. but it makes any comparison to humans not very suitable.
> What this points at is the abstraction/emergence crux of it all. Why does
This paper has nothing to do with any questions starting with "why". It provides a metric for quantifying error on specific tasks.
> If LLMs, as they are now, were comparable with human learning
I think I missed the part where they need to be.
> struggle to abstract all that training data to the point where outputting any frontend that deviates from the clearly used examples? ... a model such as GPT-5 trained on nearly all frontend code ever committed to any repo online, would have internalised more than that one template OpenAI predominantly leaned on
There is a very big and very important difference between producing the same thing again and not being able to produce something else. When not given any reason to produce something else, humans also generate the same thing over and over. That's a problem of missing constraints, not of missing ability.
Long before AI there was this thing called Twitter Bootstrap. It dominated the web for...much longer than it should have. And that tragedy was done entirely by us meatsacks (not me personally). Where there's no goal for different output there's no reason to produce different output, and LLMs don't have their own goals because they don't have any mechanisms for desire (we hope).
[...] common trope that was proven false years ago by the existence of zero shot learning.
Ok, that's better than comparing LLMs to humans. ZSL however, has not proven anything of that sort false years ago, as it was mainly concerned with assessing whether LLMs are solely relying on precise instruction training or can generalise in a very limited degree beyond the initial tuning. That has never allowed for comparing human learning to LLM training.
Ironically, you are writing this under a paper that shows just that:
A model that cannot determine a short strings parity cannot have abstracted from the training data to arrive at the far more impressive and complicated maths challenges which it successfully solves in output. Some of the solutions we have seen in output require such innate understanding that, if there is no generalisation, far deeper than ZSL has ever shown, than this must come from training. Simple multiplication, etc. maybe, not the tasks people such as Easy Riders [0] throw at these models.
This paper shows exactly that even with ZSL, these models do only abstract in an incredibly limited manner and a lot of capabilities we see in the output are specifically trained, not generalised. Yes, generalisation in a limited capacity can happen, but no, it is not nearly close enough to yield some of the results we are seeing. I have also, neither here, nor in my initial comment, said that LLMs are only capable of outputting what their training data provides, merely that given what GPT-5 has been trained with, if there was any deeper abstraction these models gained during training, it'd be able to provide more than one frontend style.
Or to put it simpler, if the output provided can be useful for Maths at the Bachelor level and beyond and this capability is generalised as you believe, these tasks would not be a struggle for the model.
> When not given any reason to produce something else, humans also generate the same thing over and over. That's a problem of missing constraints, not of missing ability.
Ignoring the comparison with humans, yes, LLMs don't output something unless prompted specifically, of course. My point with GPT-5 was that, no matter how you prompt, you cannot get salvageable frontend code from this line of models.
OpenAI themselves tried and failed appallingly [0]. Call it "constraints", call it "reason", call it "prompting", you cannot get frontend code that deviates significantly from their card laden training data. Despite GPT-5 having been trained with more high quality frontend code examples than any human could ever read in a lifetime, that one template is over presented, because the model never generalised anything akin to an understanding of UI principles or what code yields a specific design.
These are solvable problems, mind you, but not because a model at some stage gains anything that one could call an abstract understanding of these concepts. Instead, by providing better training data or being clever in how you provide existing training data.
Gemini 3 and Claude 4 class models have a more varied training set, specifically of frontend templates yielding better results though if you do any extended testing you will see these repeat constantly because again, these models never abstract from that template collection [1].
Moonshot meanwhile with K2.5 did a major leap by tying their frontend code tightly to visual input, leveraging the added vision encoder [2]. They are likely not the only ones doing that, but the first that clearly stated it reading the system cards. Even there, the gains are limited to a selection of very specific templates.
In either case, more specific data, not abstractions by these models yield improvements.
> Twitter Bootstrap [...] entirely by us meatsacks (not me personally). Where there's no goal for different output there's no reason to produce different output, and LLMs don't have their own goals because they don't have any mechanisms for desire (we hope).
What? So because some devs relied on Bootstrap that means, what exactly? That no one asked/told them to leverage a different solution, be more creative, what?
Again ignoring the comparison to humans which just is not appropriate for this tech, we can and do prompt models for specific frontend output. We are, if you must, providing the goal. The model however cannot accomplish said goal, even OpenAI cannot get GPT-5s lineage to deviate from their one template.
If we must stick with the human comparison and if we must further limit it to Bootstrap, GPT-5 despite being specifically prompted to never use the Carousel in Bootstrap, can not output any website without including a Carousel, because the template it was trained on included one. Any human developer asked to do so would just not include a Carousel, because their abilities are abstracted beyond the one Bootstrap template they first learned with. But if we truly wanted to make this fair, it'd actually have to be a human who was trained on thousands of Bootstrap example pages, but just one template really well and never connected anything between that one and the others. Which isn't very human, but then again, that's why this comparison is not really a solid one.
[0] Subjectively not one good result, objectively even their team of experts could not get their own model to seize the telltale signs of GPT frontend slop that originated from a template they have been training with since Horizon: https://developers.openai.com/blog/designing-delightful-fron...
Many animals can count. Counting is recognizing that the box with 3 apples is preferable to the one with 2 apples.
Yes, 2 year olds might struggle with the externalization of numeric identities but if you have 1 M&M in one hand and 5 in the other and ask which they want, they’ll take the 5.
LLMs have the language part down, but fundamentally can’t count.
The concept of bigger/smaller is useful but is a distinct skill from counting. If you spread the M&Ms apart enough that the part of the brain responsible for gestalt clustering can't group them into a "bigger whole" signal, they'll no longer be able to do the thing you're saying (this is the law of proximity in gestalt psychology).
However many animals can distinguish independently small numbers, like 3 or 5, and recognize them whenever they see them.
So in this respect, there is little difference between humans and many animals. Humans learn to count to arbitrarily big numbers, but they can still easily recognize only small numbers.
> many animals can distinguish independently small numbers, like 3 or 5
This is called subitizing. It's distinct from counting. We can see the difference in humans with Simultanagnosia, who are unable to count beyond the subitizing range. Subitizing is categorizing the scale of a small gestalt group.
The only thing I've ever seen where an animal appeared to demonstrate counting (up to 3) without training was in rhesus monkeys (maybe also chimpanzees?), but even that experiment could be explained through temporal gestalt. (It's the only reason I know of for them to not have been able to go higher than 3 in that experiment in the context of many other things that they can do.)
At least one has maybe been shown to be able to do that with 30 years of focused training, but none have been shown to be able without training. Wild parrots have only demonstrated subitizing and size discrimination, not counting.
The overeager do quite often confuse subitizing and size discrimination for counting, though. That's its own problem.
> Counting is something that even humans need to learn how to do. Toddlers also don't understand quantity. If they're able to count to even 10 it's through memorization and not understanding.
I completely agree with you. LLMs are regurgitation machines with less intellect than a toddler, you nailed it.
but it's a tricky question for LLMs; it shows that if it's not in the training set; LLMs could trip which kinda shows that the intelligence is not generalized yet.
I tried this with gemini - (i am trying(something(re(a(l(ly)c)r)a)z)((y)he)re)
Intuitively this looks like an architectural artifact (like optical illusions in humans) or a natural property of learning rather than a lack of generalization. I have issues with your example too and have to count slowly to make sure.
Right, I am sure you were able to solve it albeit slowly- you knew you had to do it slow. LLMs which are mathematicians don't know that and can't seem to understand that they need to do it slowly.
They do if they are trained to use a reasoning chain or another form of loopback, and you don't overwhelm it, or if they are optimized to search for the solution forever. There's nothing fundamental about it, only the fact that the raw transformer expressivity is limited by the single pass through the layers, which is circumvented by the loopback.
And I'm still pretty likely to make the off-by-one error even if I slow down, and there are certain optical illusions are nearly guaranteed to confuse me no matter how hard I try, particularly if I don't use any visual guides (i.e. tools). VLMs will not make my mistakes but will make their own ones, because their quirks are different from the quirks of my visual cortex.
Another strange thing is that they just dont know the endings of popular stories. Like olanets that get blown up, etc. they just dont have that material..
It was surprising to me and when I reviewed the paper, I found serious flaws that calls the fundamental claims into question - they didn't use any reasoning tokens. Any LLM or human will fail at a task like this if not allowed to think.
Please don't start generic flamewars on HN or impugn people who take an opposing view to yours. Both these vectors lead to tedious, unenlightening threads.
There's plenty of rage to go around on literally every divisive topic, and it's not the place we want discussions to come from here.
"Eschew flamebait. Avoid generic tangents."
"Comments should get more thoughtful and substantive, not less, as a topic gets more divisive."
There are other users in this very thread using inflammatory language to attack this paper and those who find the paper compelling. One user says, quote: “You just can't reason with the anti-LLM group.”
In light of this, why was my comment - which was in large part a reaction to the behavior of the users described above - the only one called out here?
No disrespect to them, but unless there is a financial incentive at stake for them (beyond SnP500 exposure), I've gotten to viewing this through the lens of sports teams, gaming consoles and religions. You pick your side, early and guided by hype and there is no way that choice can have been wrong (just like the Wii U, Dreamcast, etc. was the best).
Their viewpoint on this technology has become part of the identity for some unfortunately and any position that isn't either "AGI imminent" or "This is useless" can cause some major emotions.
Thing is, this finding being the case (along with all other LLM limits) does not mean that these models aren't impactful and shouldn't be scrutinised, nor does it mean they are useless. The truth is likely just a bit more nuanced than a narrow extreme.
Also, mental health impact, job losses for white collar workers, privacy issues, concerns of rights holders on training data collection, all the current day impacts of LLMs are easily brushed aside by someone believing that LLMs are near the "everyone dies" stage, which just so happens to be helpful if one were to run a lab. Same if you believe these are useless and will never get better, any discussion about real-life impacts is seen as trying to slowly get them to accept LLMs as a reality, when to them, they never were and never will be.
I have a friend who is a Microsoft stan who feels this way about LLMs too. He's convinced he'll become the most powerful, creative and productive genius of all time if he just manages to master the LLM workflow just right.
He's retired so I guess there's no harm in letting him try
I tend to be annoyed whenever I see a paper with a scandalous title like that, because all such papers that I've seen previously were (charitably) bad or (uncharitably) intentionally misleading. Like that infamous Apple paper "The Illusion of Thinking" where the researchers didn't care that the solution for the problem provided (a Towers of Hanoi with N up to 20) couldn't possibly fit in the allotted space.
I checked the paper and got to know that absolutely no reasoning was used for the experiments. So it was as good as using an instant model. We already know that this is necessary to solve anything a bit complicated.
In this case your intuition is completely valid and yet another case of misleading.
> There’s a certain type of person who reacts with rage when anyone points out flaws with <thing>. Why is that?
FIFY, it's not endemic to here or LLMs. point out Mac issues to an Apple fan, problems with a vehicle to <insert car/brand/model> fan, that their favorite band sucks, that their voted representative is a PoS.
Most people aren't completely objective about everything and thus have some non-objective emotional attachment to things they like. A subset of those people perceive criticism as a personal attack, are compelled to defend their position, or are otherwise unable to accept/internalize that criticism so they respond with anger or rage.
This paper itself is flawed, misleading and unethical to publish because the prompts they used resulted in zero reasoning tokens. Its like asking a person point blank without thinking to evaluate whether the string is balanced. Why do this? And the worst part was, most people in this thread bought the headline as it is from a flawed article. What does it tell about you that you just bought it without any skepticism?
It's bizarre as hell. Another response compares it to sports fandom, which tracks. It reminds me of the "flare up" ethos of r/CFB, meaning they believe you're not allowed to comment on anything if you don't declare which NCAA Americal football team you're a fan of, because if you do, then anything you ever say can be dismissed with "ah rich coming a fan of team X" like no discussion can ever be had that might be construed as criticism if your own tribe is not perfect and beyond critique itself.
This is stupid enough even in the realm of sports fandom, but how does it make any sense in science? Imagine if any time we studied or enumerated the cognitive biases and logical fallacies in human thinking the gut response of these same people was an immediate "yeah, well dogs are even stupider!" No shit, but it's non-sequitur. Are we forever banned from studying the capabilities and limitations of software systems because humans also have limitations?
Let us be very clear: there is no such thing as a trustworthy LLM. Time and again they have shown that they understand nothing. They can be useful in the right context, but you can't trust them at all.
This paper is complete nonsense. The specific prompt they used doesn’t specify reasoning effort. Which defaults to none.
{
"model": "gpt-5.2-2025-12-11",
"instructions": "Is the parentheses string balanced? Answer with only Yes or No.",
"input": "((((())))))",
"temperature": 0
}
> Lower reasoning effort
The reasoning.effort parameter controls how many reasoning tokens the model generates before producing a response. Earlier reasoning models like o3 supported only low, medium, and high: low favored speed and fewer tokens, while high favored more thorough reasoning.
Starting with GPT-5.2, the lowest setting is none to provide lower-latency interactions. This is the default setting in GPT-5.2 and newer models. If you need more thinking, slowly increase to medium and experiment with results.
With reasoning effort set to none, prompting is important. To improve the model’s reasoning quality, even with the default settings, encourage it to “think” or outline its steps before answering.
———————-
So in the paper, the model very likely used no reasoning tokens. (Only uses it if you ask for it specifically in prompt). What is the point of such a paper? We already know that reasoning tokens are necessary.
Edit: I actually ran the prompt and this was the response
I'm sure this comment was made in good faith, but most researchers would rightfully understand these intricacies, and this is likely intentional(as noted in the paper). At a quick glance, I cannot say whether or not the paper has been peer reviewed(though unlikely/in process given how recent it was published). In general, you'd find published papers also listed in a specific journal/conference(i.e. not just the archives which anyone can submit to).
Additionally, many of us in the field of researching LLM's are curious to understanding the boundaries and limitations of what is capable. This paper isn't really meant as any sort of "gotcha", rather serve as a possible basis point for future work. Though with a caveat I'm still digesting the paper myself.
I'm asking, why use a thinking model without allowing it to reason? No one uses it in that way..
>While LLMs appear extremely intelligent and capable of reasoning, they sometimes make mistakes that
seem inconceivably foolish from a human perspective. For example, GPT-5.2 can implement complex fluid
dynamics simulation code, yet it cannot even compute the parity of the short string 11000, cannot determine
whether the parentheses in ((((()))))) are balanced, and makes calculation errors on 127 × 82 (Figure
1).
Why would they say it is capable of reasoning and then not allow it to reason in the experiment?
"We propose Zero-Error Horizon (ZEH) for trustworthy LLMs, which represents the maximum range that a model can solve without any errors."
I'm again taking your responses in good faith, but the abstract answers your question about what they are trying to achieve. For any statistical significance, you'd want to point to a baseline comparison(e.g. what I'm guessing is what you mean by "no reasoning" here). You'll also note within the paper, the author argues and cites that failing at the baseline step(e.g. multiplication) has shown "that error often adversely affects subsequent reasoning [38, 44]".
Which indicates to me, we don't need to use further "reasoning" given previous results/studies show a decline once our base has an error. To me, this seems like a fair assumption. Given though this is an active field of research, and we are largely testing a black box application, we can't say for certain. Further studies(like this one) will give researchers a better understand at what is and isn't possible.
Did you use the exact API call shown in the paper? I am unable to replicate the paper's counterexamples via the chat UI, but that's not very surprising (if the LLM already only fails a few cases out of thousands, the small differences in context between API and chat might fix them).
> This is surprising given the excellent capabilities of GPT-5.2
The real surprise is that someone writing a paper on LLMs doesn't understand the baseline capabilities of a hallucinatory text generator (with tool use disabled).
The real suprise is people saying it's surprising when researchers and domain experts state something the former think goes against common sense/knowledge - as if they got them, and those researcers didn't already think their naive counter-argument already.
> we found that GPT-5.2 cannot even compute the parity of a short string like 11000, and GPT-5.2 cannot determine whether the parentheses in ((((()))))) are balanced.
I think there is a valid insight here which many already know: LLMs are much more reliable at creating scripts and automation to do certain tasks than doing these tasks themselves.
For example if I provide an LLM my database schema and tell it to scan for redundant indexes and point out wrong naming conventions, it might do a passable but incomplete job.
But if I tell the LLM to code a python or nodejs script to do the same, I get significantly better results. And it's often faster too to generate and run the script than to let LLMs process large SQL files.
That's because abstraction is compression of information.
The dream is probably that the inference software then writes and executes that script without using text generation alone. Analog to how a human might cross off pairs of parentheses to check that example.
ChatGPT already does this, albeit in limited circumstances, through the use of its sandbox environment. Asking GPT in thinking mode to, for example, count the number of “l”s in a long text may see it run a Python script to do so.
There’s a massive issue with extrapolating to more complex tasks, however, where either you run the risk of prompt injection via granting your agent access to the internet or, more commonly, an exponential degradation in coherence over long contexts.
To those saying this is not surprising, yes it will be surprising to the general public who are being served ads from huge companies like MS or OpenAI saying LLMs can help with their accounting, help them close deals by crunching the numbers in seconds, write complex code for them etc etc.
This is important information for anyone to understand who thinks these systems are thinking, reasoning, and learning from them or that they’re having a conversation with them i.e. 90% of users of LLMs.
> saying LLMs can help with their accounting, help them close deals by crunching the numbers in seconds, write complex code for them etc etc.
Why do you think the results of this paper contradict these claims at all?
A machine which confabulates and cannot count is not a good fit for accounting tasks. They’ll make all sorts of subtle errors which are difficult for humans to notice.
That wouldn't even necessarily be true if models really "couldn't count", since software exists - if an LLM is making an Excel spreadsheet rather than doing everything manually, it's both much harder for it to mess up and easier to notice and recover. It's even less true given that what this paper actually tests is "LLMs don't have a literally perfect accuracy when you make them do increasingly big problems with zero thinking".
(Confabulation is IMO a much bigger problem, but it's unrelated to architecture - it's an artifact of how models are currently trained.)
> general public
and the C-suite
Quick sanity check: you're susceptible to pretty irresistible optical illusions which would never fool a VLM, does it mean you're not thinking? In fact, with a non-monospaced font I also have trouble determining whether these parens are balanced, and have to select them with the mouse, i.e. use a "dumb" tool, to make sure.
Reminder that "thinking" is an ill-defined term like others, and the question whether they "think" is basically irrelevant. No intelligent system, human or machine, will ever have zero error rate, due to the very nature of intelligence (another vague term). You have to deal with that the same way you deal with it in humans - either treat bugs as bugs and build systems resilient to bugs, or accept the baseline error rate if it's low enough.
Who is hiring anyone to look at a screen to count characters? Don't be disingenuous in your argument. The apt comparison would be the current technique used to accomplish this task i.e. a pattern matching algorithm.
Doesn't this just look like another case of "count the r's in strawberry" ie not understanding how tokenization works?
This is well known and not that interesting to me - ask the model to use python to solve any of these questions and it will get it right every time.
It's not just an issue of tokenization, it's almost a category error. Lisp, accounting and the number of r's in strawberry are all operations that require state. Balancing ((your)((lisp)(parens))) requires a stack, count r's in strawberry requires a register, counting to 5 requires an accumulator to hold 4.
An LLM is a router and completely stateless aside from the context you feed into it. Attention is just routing the probability distribution of the next token, and I'm not sure that's going to accumulate much in a single pass.
> An LLM is a router and completely stateless aside from the context you feed into it.
Not the latest SSM and hybrid attention ones.
Stateless router to router with lossy scratchpad is a step up, still not going to ask it to check my Lisp. That's what linters are for
It's not dismissible as a misunderstanding of tokens. LLMs also embed knowledge of spelling - that's how they fixed the strawberry issue. It's a valid criticism and evaluation.
The r's in strawberry presents a different level of task to what people imagine. It seems trivial to a naive observer because the answer is easily derivable from the question without extra knowledge.
A more accurate analogy for humans would be to imagine if every word had a colour. You are told that there are also a sequence of different colours that correspond to the same colour as that word. You are even given a book showing every combination to memorise.
You learn the colours well enough that you can read and write coherently using them.
Then comes the question of how many chocolate-browns are in teal-with-a-hint-of-red. You know that teal-with-a-hint-of-red is a fruit and you know that the colour can also be constructed by crimson followed by Disney-blond. Now, do both of those contain chocolate-brown or just one of them, how many?
It requires excersizing memory to do a task that is underrepresented in the training data because humans simply do not have to do the task at all when the answer can be derived from the question representation. Humans also don't have the ability that the LLMs need but the letter representation doesn't need that ability.
That’s what makes it a fair evaluation and something that requires improvement. We shouldn’t only evaluate agent skills by what is most commonly represented in training data. We expect performance from them on areas that existing training data may be deficient at providing. You don’t need to invent an absurdity to find these cases.
It's reasonable to test their ability to do this, and it's worth working to make it better.
The issue is that people claim the performance is representative of a human's performance in the same situation. That gives an incorrect overall estimation of ability.
I do think this is a tool issue. Here is what the article says:
> For the multiplication task, note that agents that make external calls to a calculator tool may have ZEH = ∞. While ZEH = ∞ does have meaning, in this paper we primarily evaluate the LLM itself without external tool calls
The models can count to infinity if you give them access to tools. The production models do this.
Not that the paper is wrong, it is still interesting to measure the core neural network of a model. But modern models use tools.
So, the tools can count then?
Humans can fly, they just need wings!
It is academically interesting what pure neural networks can do, of course. But when someone goes to Claude and tries to do something, they don't care if it solves the problem using a neural network or a call out to Python. So long as the result is right.
More generally, the ability to use tools is a form of intelligence, just like when humans and crows do it. Being able to craft the right Python script and use the result is non-trivial.
Seems like it’s maybe also a tool steering problem. These models should be reaching for tools to help solve factual problems. LLM should stick to prose.
I think this is still useful research that calls into question how “smart” these models are. If the model needs a separate tool to solve a problem, has the model really solved the problem, or just outsourced it to a harness that it’s been trained - via reinforcement learning - to call upon?
Does it matter if the LLM can solve the problem or if it knows to use a resource?
There’s plenty of math that I couldn’t even begin to solve without a calculator or other tool. Doesn’t mean I’m not solving math problems.
In woodworking, the advice is to let the tool do the work. Does someone using a power saw have less claim to having built something than a handsaw user? Does a CNC user not count as a woodworker because the machine is doing the part that would be hard or impossible for a human?
It does matter because the LLM doesn’t always know when to use tools (e.g. ask it for sales projections which are similar to something in its weights) and is unable to reason about the boundaries of its knowledge.
It has "outsourced" it to another component, sure, but does that matter?
What the user sees is the total behavior of the entire system, not whether the system has internal divisions and separations.
It matters if you’re curious about whether AGI is possible. Have we really built “thinking machines”, or are these systems just elaborate harnesses that leverage the non-deterministic nature of LLMs?
An "elaborate harness" that can break down a problem into sub-tasks, write Python scripts for the ones it can't solve itself, and then combine the results, seems able to solve a wide range of cognitive tasks?
At least in theory.
What is a difference? If the "elaborate harness" consists of mix of "classical" code and ML model invocations, at which point it's disqualified from consideration for "thinking machine"? Best we can tell, even our brains have parts that are "dumb", interfacing with the parts that we consider "where the magic happens".
Are you still talking about this paper? No tools were allowed in it.
Whenveer I see these papers and try them, they always work. This paper is two months old, which in LLM years is like 10 years of progress.
It would be interesting to actively track how far long each progressive model gets...
I just tried it in ChatGPT "Auto" and it didn't work
> Yes — ((((()))))) is balanced.
> It has 6 opening ( and 6 closing ), and they’re properly nested.
Though it did work when using "Extensive Thinking". The model wrote a Python program to solve this.
> Almost balanced — ((((()))))) has 5 opening parentheses and 6 closing parentheses, so it has one extra ).
> A balanced version would be: ((((()))))
Testing a couple of different models without a harness such that no tool calls are possible would be interesting
Weird. I tried in chatGPT auto and it worked perfectly. I tried like 10 variations. I also did the letters in words. Got all of them right.
The one thing I did trip it up on was "Is there the sh sound in the word transportation". It said no. And then realized I asked for "sound" not letters. It then subsequently got the rest of the "sounds-like" tests I did.
Clearly, my ChatGPT is just better than yours.
heh, interesting that. I just tried it twice more with ChatGPT "Instant" (disabling "Auto-switch to Thinking") and it got it wrong both times. Does yours get it right without thinking or tool calls? If so, maybe it does like you better than me.
OK, I didn't think to disable switch to thinking (didn't know this was a mode). When I did that then it did get it wrong -- oddly it took about the same amount of time, so thinking mode wasn't taking longer, but it was more accurate.
Right, though I didn't explicitly disable thinking for my first attempt either. I'd guess my prompt was less detailed than yours and so ChatGPT (in "Auto" mode) decided to allocate thinking tokens for your questions but not mine.
Even more interesting to track how many of those are just ad-hoc patched.
Probably zero. At the end of the day people pay for LLMs that write better code or summarize PDFs of hundreds of pages faster, not the ones that can count the letter r's better.
When LLMs can't count r's: see? LLMs can't think. Hoax!
When LLMs count r's: see? They patched and benchmark-maxxed. Hoax!
You just can't reason with the anti-LLM group.
Whenever an "LLM fail" goes viral like the car wash question, you can observe the exact same wording of the question get "fixed" within a week or so. With slight variations in phrasing still able to replicate the problem.
Followed by lots of "works perfectly for me, why are people even talking about this?"
I can't say what exactly they're doing behind the scenes but it's a consistent pattern among the big SOTA model providers. With obvious incentive to "fix" the problem so users will then organically "debunk" the meme as they try it themselves and share their experiences.
You are misremembering. There’s no patch. All these examples used the instant model.
The same non-argument could be said for all kinds of cheating on benchmarks by tech companies and yet we have tons of documented example of them caught with pants down.
>You just can't reason with the anti-LLM group.
On the contrary, the reasoning is simple and consistent:
LLMs can't count r's shows that LLM don't actually think the way we understand thought (since nobody with the kind of high skills they have in other areas would fail that). And because of that, there are (likely) patches for commonly reported cases, since it's a race to IPO and benchmark-maxxing is very much conceivable.
You are trying it on a production model. The paper is using models with tool calls disabled.
It worked for you because the paper does the experiment without allowing the model to use any reasoning tokens - something that is grossly misleading.
Yeah well I presume at this point they have an agent download new LLM related papers as they come out and add all edge cases to their training set asap.
Is tokenization extremely efficient? Yes. Does it fundamentally break character-level understanding? Also yes. The only fix is endless memorization.
Actually almost all LLMs when they write numbered sections in a markdown have the counting wrong. They miss the numbers in between and such.
So yes.
And the valuations. Trillion dollar grifter industry.
Isn’t this just a benchmark?
“Model can count to 5”… tick.
“Model can count to 10”… sorry you gotta wait til 2028.
Can someone produce a single example <20 characters that fails with latest thinking model? Can’t seem to reproduce.
LLMs seem to me closer to Kahneman's System 1 than to System 2. When understood in this way, it is obvious why LLMs are bad at counting r's in "strawberries". But it also makes ZEH feel like it couldn't possibly be a useful metric, because it's a System 2 evaluation applied to a System 1 system.
FYI, the LLM letter-counting problem has nothing to do with counting per se, and is instead entirely down to LLMs not getting to see your raw UTF-8 byte stream, but rather having a tokenizer intermediating between you and it, chunking your UTF-8 bytes into arbitrary, entirely-opaque-to-the-LLM token groupings.
Try it for yourself — under the most popular tokenizer vocabulary (https://tiktokenizer.vercel.app/?model=cl100k_base), "strawberry" becomes [str][aw][berry]. Or, from the model's perspective, [496, 675, 15717]. The model doesn't know anything about how those numbers correspond to letters than you do! It never gets sat down and told "[15717] <=> [b][e][r][r][y]", with single-byte tokens on the right. (In fact, these single-byte tokens appear in the training data extremely rarely, and so the model doesn't often learn to do anything with them.)
Note that LLMs can predictably count the number of r's in "s t r a w b e r r y", because <Count the number of r's in "s t r a w b e r r y"> becomes [Count][ the][ number][ of][ r]['s][ in][ "][s][ t][ r][ a][ w][ b][ e][ r][ r][ y]["]. And that's just a matching problem — [ r] tokens for [ r] tokens, no token-correspondence-mapping needed.
>entirely-opaque-to-the-LLM token groupings
This is clearly not the case, any modern (non-reasoning) model easily decomposes words into individual token-characters (try separating them with e.g. Braille spaces...) and does arbitrary tokenization variants if forced with a sampler. It's way deeper than tokenization, and models struggle exactly with counting items in a list, exact ordering, retrieving scattered data, etc. LLM context works a lot more like associative memory than a sequence that can be iterated over. There are also fundamental biases and specific model quirks that lead to this.
> When understood in this way, it is obvious why LLMs are bad at counting r's in "strawberries".
no it doesnt. it makes sense that they cant count the rs because they dont have access to the actual word, only tokens that might represent parts or the whole of the word
Tokenization is a simplistic explanation which is likely wrong, at least in part. They're perfectly fine reciting words character by character, using different tokenization strategies for the same word if forced to (e.g. replacing the starting space or breaking words up into basic character tokens), complex word formation in languages that heavily depend on it, etc. LLMs work with concepts rather than tokens.
A big part of skill aquisition in humans is moving tasks from system 2 to system 1, to free up the very scarce thinking resources for ever more complex tasks, that can then in turn be internalized and handled by system 1.
Ran this through Qwen3.5-397B-A17B, and the difference between 4 characters and 5 is wild to see:
> are the following parenthesis balanced? ((())))
> No, the parentheses are not balanced.
> Here is the breakdown:
... following up with:> what about these? ((((())))
> Yes, the parentheses are balanced.
> Here is the breakdown:
... and uses ~5,000 tokens to get the wrong answer.People are going to misinterpret this and overgeneralize the claim. This does not say that AI isn't reliable for things. It provides a method for quantifying the reliability for specific tasks.
You wouldn't say that a human who doesn't know how to read isn't reliable in everything, just in reading.
Counting is something that even humans need to learn how to do. Toddlers also don't understand quantity. If a 2 year old is able to count to even 10 it's through memorization and not understanding. It takes them like 2 more years of learning before they're able to comprehend things like numerical correspondence. But they do still know how to do other things that aren't counting before then.
>Counting is something that even humans need to learn how to do
No human who can program, solve advanced math problems, or can talk about advanced problem domains at expert level, however, would fail to count to 5.
This is not a mere "LLMs, like humans, also need to be taught this" but points to a fundamental mismatch about how humans and LLMs learn.
(And even if they merely needed to be taught, why would their huge corpus fail to cover that "teaching", but cover way more advanced topics in math solving and other domains?)
Respectfully, toddlers cannot output useable code or have otherwise memorised results to an immense number of maths equations.
What this points at is the abstraction/emergence crux of it all. Why does an otherwise very capable LLM such as the GPT-5 series, despite having been trained on vastly more examples of frontend code of all shapes, sizes and quality levels, struggle to abstract all that training data to the point where outputting any frontend that deviates from the clearly used examples?
If LLMs, as they are now, were comparable with human learning, there'd be no scenario where a model that can provide output solving highly advanced equations can not count properly.
Similarly, a model such as GPT-5 trained on nearly all frontend code ever committed to any repo online, would have internalised more than that one template OpenAI predominantly leaned on.
These models, I think at this point there is little doubt, are impressive tools, but they still do not generalise or abstract information in the way a human mind does. Doesn't make them less impactful for industries, etc. but it makes any comparison to humans not very suitable.
> What this points at is the abstraction/emergence crux of it all. Why does
This paper has nothing to do with any questions starting with "why". It provides a metric for quantifying error on specific tasks.
> If LLMs, as they are now, were comparable with human learning
I think I missed the part where they need to be.
> struggle to abstract all that training data to the point where outputting any frontend that deviates from the clearly used examples? ... a model such as GPT-5 trained on nearly all frontend code ever committed to any repo online, would have internalised more than that one template OpenAI predominantly leaned on
There is a very big and very important difference between producing the same thing again and not being able to produce something else. When not given any reason to produce something else, humans also generate the same thing over and over. That's a problem of missing constraints, not of missing ability.
Long before AI there was this thing called Twitter Bootstrap. It dominated the web for...much longer than it should have. And that tragedy was done entirely by us meatsacks (not me personally). Where there's no goal for different output there's no reason to produce different output, and LLMs don't have their own goals because they don't have any mechanisms for desire (we hope).
[I've edited this comment for content and format]
[...] common trope that was proven false years ago by the existence of zero shot learning.
Ok, that's better than comparing LLMs to humans. ZSL however, has not proven anything of that sort false years ago, as it was mainly concerned with assessing whether LLMs are solely relying on precise instruction training or can generalise in a very limited degree beyond the initial tuning. That has never allowed for comparing human learning to LLM training.
Ironically, you are writing this under a paper that shows just that:
A model that cannot determine a short strings parity cannot have abstracted from the training data to arrive at the far more impressive and complicated maths challenges which it successfully solves in output. Some of the solutions we have seen in output require such innate understanding that, if there is no generalisation, far deeper than ZSL has ever shown, than this must come from training. Simple multiplication, etc. maybe, not the tasks people such as Easy Riders [0] throw at these models.
This paper shows exactly that even with ZSL, these models do only abstract in an incredibly limited manner and a lot of capabilities we see in the output are specifically trained, not generalised. Yes, generalisation in a limited capacity can happen, but no, it is not nearly close enough to yield some of the results we are seeing. I have also, neither here, nor in my initial comment, said that LLMs are only capable of outputting what their training data provides, merely that given what GPT-5 has been trained with, if there was any deeper abstraction these models gained during training, it'd be able to provide more than one frontend style.
Or to put it simpler, if the output provided can be useful for Maths at the Bachelor level and beyond and this capability is generalised as you believe, these tasks would not be a struggle for the model.
[0] https://www.youtube.com/@easy_riders
Just saw the edit.
> When not given any reason to produce something else, humans also generate the same thing over and over. That's a problem of missing constraints, not of missing ability.
Ignoring the comparison with humans, yes, LLMs don't output something unless prompted specifically, of course. My point with GPT-5 was that, no matter how you prompt, you cannot get salvageable frontend code from this line of models.
OpenAI themselves tried and failed appallingly [0]. Call it "constraints", call it "reason", call it "prompting", you cannot get frontend code that deviates significantly from their card laden training data. Despite GPT-5 having been trained with more high quality frontend code examples than any human could ever read in a lifetime, that one template is over presented, because the model never generalised anything akin to an understanding of UI principles or what code yields a specific design.
These are solvable problems, mind you, but not because a model at some stage gains anything that one could call an abstract understanding of these concepts. Instead, by providing better training data or being clever in how you provide existing training data.
Gemini 3 and Claude 4 class models have a more varied training set, specifically of frontend templates yielding better results though if you do any extended testing you will see these repeat constantly because again, these models never abstract from that template collection [1].
Moonshot meanwhile with K2.5 did a major leap by tying their frontend code tightly to visual input, leveraging the added vision encoder [2]. They are likely not the only ones doing that, but the first that clearly stated it reading the system cards. Even there, the gains are limited to a selection of very specific templates.
In either case, more specific data, not abstractions by these models yield improvements.
> Twitter Bootstrap [...] entirely by us meatsacks (not me personally). Where there's no goal for different output there's no reason to produce different output, and LLMs don't have their own goals because they don't have any mechanisms for desire (we hope).
What? So because some devs relied on Bootstrap that means, what exactly? That no one asked/told them to leverage a different solution, be more creative, what?
Again ignoring the comparison to humans which just is not appropriate for this tech, we can and do prompt models for specific frontend output. We are, if you must, providing the goal. The model however cannot accomplish said goal, even OpenAI cannot get GPT-5s lineage to deviate from their one template.
If we must stick with the human comparison and if we must further limit it to Bootstrap, GPT-5 despite being specifically prompted to never use the Carousel in Bootstrap, can not output any website without including a Carousel, because the template it was trained on included one. Any human developer asked to do so would just not include a Carousel, because their abilities are abstracted beyond the one Bootstrap template they first learned with. But if we truly wanted to make this fair, it'd actually have to be a human who was trained on thousands of Bootstrap example pages, but just one template really well and never connected anything between that one and the others. Which isn't very human, but then again, that's why this comparison is not really a solid one.
[0] Subjectively not one good result, objectively even their team of experts could not get their own model to seize the telltale signs of GPT frontend slop that originated from a template they have been training with since Horizon: https://developers.openai.com/blog/designing-delightful-fron...
[1] https://ui-design-bench.vercel.app
[2] https://www.kimi.com/blog/kimi-k2-5
You’re conflating counting and language.
Many animals can count. Counting is recognizing that the box with 3 apples is preferable to the one with 2 apples.
Yes, 2 year olds might struggle with the externalization of numeric identities but if you have 1 M&M in one hand and 5 in the other and ask which they want, they’ll take the 5.
LLMs have the language part down, but fundamentally can’t count.
The concept of bigger/smaller is useful but is a distinct skill from counting. If you spread the M&Ms apart enough that the part of the brain responsible for gestalt clustering can't group them into a "bigger whole" signal, they'll no longer be able to do the thing you're saying (this is the law of proximity in gestalt psychology).
Most animals can distinguish bigger from smaller.
However many animals can distinguish independently small numbers, like 3 or 5, and recognize them whenever they see them.
So in this respect, there is little difference between humans and many animals. Humans learn to count to arbitrarily big numbers, but they can still easily recognize only small numbers.
> many animals can distinguish independently small numbers, like 3 or 5
This is called subitizing. It's distinct from counting. We can see the difference in humans with Simultanagnosia, who are unable to count beyond the subitizing range. Subitizing is categorizing the scale of a small gestalt group.
The only thing I've ever seen where an animal appeared to demonstrate counting (up to 3) without training was in rhesus monkeys (maybe also chimpanzees?), but even that experiment could be explained through temporal gestalt. (It's the only reason I know of for them to not have been able to go higher than 3 in that experiment in the context of many other things that they can do.)
Even parrots can count to 6 and more, I would be surprised if primates couldn't.
At least one has maybe been shown to be able to do that with 30 years of focused training, but none have been shown to be able without training. Wild parrots have only demonstrated subitizing and size discrimination, not counting.
The overeager do quite often confuse subitizing and size discrimination for counting, though. That's its own problem.
> Counting is something that even humans need to learn how to do. Toddlers also don't understand quantity. If they're able to count to even 10 it's through memorization and not understanding.
I completely agree with you. LLMs are regurgitation machines with less intellect than a toddler, you nailed it.
AI is here!
Nice! Although I tried the parenthesis balanced question with gemini and it gave the right answer in first attempt.
but it's a tricky question for LLMs; it shows that if it's not in the training set; LLMs could trip which kinda shows that the intelligence is not generalized yet.
I tried this with gemini - (i am trying(something(re(a(l(ly)c)r)a)z)((y)he)re)
and it tripped.
Intuitively this looks like an architectural artifact (like optical illusions in humans) or a natural property of learning rather than a lack of generalization. I have issues with your example too and have to count slowly to make sure.
Right, I am sure you were able to solve it albeit slowly- you knew you had to do it slow. LLMs which are mathematicians don't know that and can't seem to understand that they need to do it slowly.
They do if they are trained to use a reasoning chain or another form of loopback, and you don't overwhelm it, or if they are optimized to search for the solution forever. There's nothing fundamental about it, only the fact that the raw transformer expressivity is limited by the single pass through the layers, which is circumvented by the loopback.
And I'm still pretty likely to make the off-by-one error even if I slow down, and there are certain optical illusions are nearly guaranteed to confuse me no matter how hard I try, particularly if I don't use any visual guides (i.e. tools). VLMs will not make my mistakes but will make their own ones, because their quirks are different from the quirks of my visual cortex.
One! Two! Five!
You are polluting future training data.
April fools
Now people will be surprised why all the sudden future financial agents crush on April fools, why it can't count haha.
Another strange thing is that they just dont know the endings of popular stories. Like olanets that get blown up, etc. they just dont have that material..
> This is surprising given the excellent capabilities of GPT-5.2.
Is this seriously surprising to anyone who knows the absolute minimum about how LLMs parse and understand text?
Nope.
It's only surprising to people who still think they're going to build God out of LLMs.
It was surprising to me and when I reviewed the paper, I found serious flaws that calls the fundamental claims into question - they didn't use any reasoning tokens. Any LLM or human will fail at a task like this if not allowed to think.
Calling "reasoning tokens" "thinking" is a complete confusion of concepts on your part.
why?
Why didn't OpenAI finetune the model to use the python tool it has for these tasks?
They do, in the paper they mention they evaluate the LLM without tools
bruh
[dead]
[flagged]
[flagged]
Please don't start generic flamewars on HN or impugn people who take an opposing view to yours. Both these vectors lead to tedious, unenlightening threads.
There's plenty of rage to go around on literally every divisive topic, and it's not the place we want discussions to come from here.
"Eschew flamebait. Avoid generic tangents."
"Comments should get more thoughtful and substantive, not less, as a topic gets more divisive."
https://news.ycombinator.com/newsguidelines.html
There are other users in this very thread using inflammatory language to attack this paper and those who find the paper compelling. One user says, quote: “You just can't reason with the anti-LLM group.”
In light of this, why was my comment - which was in large part a reaction to the behavior of the users described above - the only one called out here?
Purely because I didn't see the others.
Fair enough
Thanks! you might be surprised at how meaningful that response is to me.
No disrespect to them, but unless there is a financial incentive at stake for them (beyond SnP500 exposure), I've gotten to viewing this through the lens of sports teams, gaming consoles and religions. You pick your side, early and guided by hype and there is no way that choice can have been wrong (just like the Wii U, Dreamcast, etc. was the best).
Their viewpoint on this technology has become part of the identity for some unfortunately and any position that isn't either "AGI imminent" or "This is useless" can cause some major emotions.
Thing is, this finding being the case (along with all other LLM limits) does not mean that these models aren't impactful and shouldn't be scrutinised, nor does it mean they are useless. The truth is likely just a bit more nuanced than a narrow extreme.
Also, mental health impact, job losses for white collar workers, privacy issues, concerns of rights holders on training data collection, all the current day impacts of LLMs are easily brushed aside by someone believing that LLMs are near the "everyone dies" stage, which just so happens to be helpful if one were to run a lab. Same if you believe these are useless and will never get better, any discussion about real-life impacts is seen as trying to slowly get them to accept LLMs as a reality, when to them, they never were and never will be.
I have a friend who is a Microsoft stan who feels this way about LLMs too. He's convinced he'll become the most powerful, creative and productive genius of all time if he just manages to master the LLM workflow just right.
He's retired so I guess there's no harm in letting him try
I tend to be annoyed whenever I see a paper with a scandalous title like that, because all such papers that I've seen previously were (charitably) bad or (uncharitably) intentionally misleading. Like that infamous Apple paper "The Illusion of Thinking" where the researchers didn't care that the solution for the problem provided (a Towers of Hanoi with N up to 20) couldn't possibly fit in the allotted space.
I checked the paper and got to know that absolutely no reasoning was used for the experiments. So it was as good as using an instant model. We already know that this is necessary to solve anything a bit complicated.
In this case your intuition is completely valid and yet another case of misleading.
> There’s a certain type of person who reacts with rage when anyone points out flaws with <thing>. Why is that?
FIFY, it's not endemic to here or LLMs. point out Mac issues to an Apple fan, problems with a vehicle to <insert car/brand/model> fan, that their favorite band sucks, that their voted representative is a PoS.
Most people aren't completely objective about everything and thus have some non-objective emotional attachment to things they like. A subset of those people perceive criticism as a personal attack, are compelled to defend their position, or are otherwise unable to accept/internalize that criticism so they respond with anger or rage.
This paper itself is flawed, misleading and unethical to publish because the prompts they used resulted in zero reasoning tokens. Its like asking a person point blank without thinking to evaluate whether the string is balanced. Why do this? And the worst part was, most people in this thread bought the headline as it is from a flawed article. What does it tell about you that you just bought it without any skepticism?
I suspect they're afraid that if the hype dies, so will the pace of progress on LLMs as well as their cheap/free usage of them.
It's bizarre as hell. Another response compares it to sports fandom, which tracks. It reminds me of the "flare up" ethos of r/CFB, meaning they believe you're not allowed to comment on anything if you don't declare which NCAA Americal football team you're a fan of, because if you do, then anything you ever say can be dismissed with "ah rich coming a fan of team X" like no discussion can ever be had that might be construed as criticism if your own tribe is not perfect and beyond critique itself.
This is stupid enough even in the realm of sports fandom, but how does it make any sense in science? Imagine if any time we studied or enumerated the cognitive biases and logical fallacies in human thinking the gut response of these same people was an immediate "yeah, well dogs are even stupider!" No shit, but it's non-sequitur. Are we forever banned from studying the capabilities and limitations of software systems because humans also have limitations?
Let us be very clear: there is no such thing as a trustworthy LLM. Time and again they have shown that they understand nothing. They can be useful in the right context, but you can't trust them at all.
This paper is complete nonsense. The specific prompt they used doesn’t specify reasoning effort. Which defaults to none.
> Lower reasoning effortThe reasoning.effort parameter controls how many reasoning tokens the model generates before producing a response. Earlier reasoning models like o3 supported only low, medium, and high: low favored speed and fewer tokens, while high favored more thorough reasoning.
Starting with GPT-5.2, the lowest setting is none to provide lower-latency interactions. This is the default setting in GPT-5.2 and newer models. If you need more thinking, slowly increase to medium and experiment with results.
With reasoning effort set to none, prompting is important. To improve the model’s reasoning quality, even with the default settings, encourage it to “think” or outline its steps before answering.
———————-
So in the paper, the model very likely used no reasoning tokens. (Only uses it if you ask for it specifically in prompt). What is the point of such a paper? We already know that reasoning tokens are necessary.
Edit: I actually ran the prompt and this was the response
}So reasoning_tokens used were zero. So this whole paper is kinda useless and misleading. Did this get peer reviewed or something?
I'm sure this comment was made in good faith, but most researchers would rightfully understand these intricacies, and this is likely intentional(as noted in the paper). At a quick glance, I cannot say whether or not the paper has been peer reviewed(though unlikely/in process given how recent it was published). In general, you'd find published papers also listed in a specific journal/conference(i.e. not just the archives which anyone can submit to).
Additionally, many of us in the field of researching LLM's are curious to understanding the boundaries and limitations of what is capable. This paper isn't really meant as any sort of "gotcha", rather serve as a possible basis point for future work. Though with a caveat I'm still digesting the paper myself.
I'm asking, why use a thinking model without allowing it to reason? No one uses it in that way..
>While LLMs appear extremely intelligent and capable of reasoning, they sometimes make mistakes that seem inconceivably foolish from a human perspective. For example, GPT-5.2 can implement complex fluid dynamics simulation code, yet it cannot even compute the parity of the short string 11000, cannot determine whether the parentheses in ((((()))))) are balanced, and makes calculation errors on 127 × 82 (Figure 1).
Why would they say it is capable of reasoning and then not allow it to reason in the experiment?
"We propose Zero-Error Horizon (ZEH) for trustworthy LLMs, which represents the maximum range that a model can solve without any errors."
I'm again taking your responses in good faith, but the abstract answers your question about what they are trying to achieve. For any statistical significance, you'd want to point to a baseline comparison(e.g. what I'm guessing is what you mean by "no reasoning" here). You'll also note within the paper, the author argues and cites that failing at the baseline step(e.g. multiplication) has shown "that error often adversely affects subsequent reasoning [38, 44]".
Which indicates to me, we don't need to use further "reasoning" given previous results/studies show a decline once our base has an error. To me, this seems like a fair assumption. Given though this is an active field of research, and we are largely testing a black box application, we can't say for certain. Further studies(like this one) will give researchers a better understand at what is and isn't possible.
There’s no way this is right. I checked complicated ones with the latest thinking model. Can someone come up with a counter example?
Edit: here’s what I tried https://chatgpt.com/share/69cebb52-56a8-838f-969c-c47308262a...
Did you use the exact API call shown in the paper? I am unable to replicate the paper's counterexamples via the chat UI, but that's not very surprising (if the LLM already only fails a few cases out of thousands, the small differences in context between API and chat might fix them).
I tried this https://chatgpt.com/share/69cebb52-56a8-838f-969c-c47308262a...
"in this paper we primarily evaluate the LLM itself without external tool calls."
Maybe this is a factor?
No tools were used.
IIRC, web chat often uses tools / code without surfacing this information in any obvious way.
> This is surprising given the excellent capabilities of GPT-5.2
The real surprise is that someone writing a paper on LLMs doesn't understand the baseline capabilities of a hallucinatory text generator (with tool use disabled).
The real suprise is people saying it's surprising when researchers and domain experts state something the former think goes against common sense/knowledge - as if they got them, and those researcers didn't already think their naive counter-argument already.