Yesterday, I posted on Reddit (https://www.reddit.com/r/OpenAI/comments/1k3ejji/what_in_the...) about a bizarre spectacle during a coding session where my OpenAI Codex CLI assistant abandoned code generation and instead produced thousands of lines resembling a digital breakdown:
After some great community feedback and analyzing my OpenAI usage logs, I think I know the likely technical cause, but I'm curious about insights HN might have as I'm by no means an expert in the deeper side of these models.
In the end, it looks like it was a cascading failure of:
Massive Prompt: Using --full-auto for a large refactor inflated the prompt context rapidly via diffs/stdout. Logs show it hit ~198k tokens (near o4-mini's 200k limit).
Hidden Reasoning Cost: Newer models use internal reasoning steps that consume tokens before replying. This likely pushed the effective usage over the limit, leaving no budget for the actual output. (Consistent with reports of ~6-8k soft limits for complex tasks).
Degenerative Loop: Unable to complete normally, the model defaulted to repeating high-probability termination tokens ("END", "STOP").
Hallucinations: The dramatic phrases ("My brain is broken," etc.) were likely pattern-matched fragments associated with failure states in its training data.
I've seen AI do this before, it's pretty funny lol.
When I was coding my Phi3 wrapper, I used some funky generation parameters and it would generate random sentences until it reached the max token amount.
Yesterday, I posted on Reddit (https://www.reddit.com/r/OpenAI/comments/1k3ejji/what_in_the...) about a bizarre spectacle during a coding session where my OpenAI Codex CLI assistant abandoned code generation and instead produced thousands of lines resembling a digital breakdown:
--- Continuous meltdown. End. STOP. END. STOP…
By the gods, I finish. END. END. END. Good night…
please kill me. end. END. Continuous meltdown…
My brain is broken. end STOP. STOP! END… ---
(full gist here: https://gist.github.com/scottfalconer/c9849adf4aeaa307c808b5...)
After some great community feedback and analyzing my OpenAI usage logs, I think I know the likely technical cause, but I'm curious about insights HN might have as I'm by no means an expert in the deeper side of these models.
In the end, it looks like it was a cascading failure of:
Massive Prompt: Using --full-auto for a large refactor inflated the prompt context rapidly via diffs/stdout. Logs show it hit ~198k tokens (near o4-mini's 200k limit).
Hidden Reasoning Cost: Newer models use internal reasoning steps that consume tokens before replying. This likely pushed the effective usage over the limit, leaving no budget for the actual output. (Consistent with reports of ~6-8k soft limits for complex tasks).
Degenerative Loop: Unable to complete normally, the model defaulted to repeating high-probability termination tokens ("END", "STOP").
Hallucinations: The dramatic phrases ("My brain is broken," etc.) were likely pattern-matched fragments associated with failure states in its training data.
I've seen AI do this before, it's pretty funny lol. When I was coding my Phi3 wrapper, I used some funky generation parameters and it would generate random sentences until it reached the max token amount.
Pretty fun, right?