GPT-5.3-Codex

(openai.com)

782 points | by meetpateltech 4 hours ago ago

293 comments

  • Rperry2174 2 hours ago ago

    Whats interesting to me is that these gpt-5.3 and opus-4.6 are diverging philosophically and really in the same way that actual engineers and orgs have diverged philosophically

    With Codex (5.3), the framing is an interactive collaborator: you steer it mid-execution, stay in the loop, course-correct as it works.

    With Opus 4.6, the emphasis is the opposite: a more autonomous, agentic, thoughtful system that plans deeply, runs longer, and asks less of the human.

    that feels like a reflection of a real split in how people think llm-based coding should work...

    some want tight human-in-the-loop control and others want to delegate whole chunks of work and review the result

    Interested to see if we eventually see models optimize for those two philosophies and 3rd, 4th, 5th philosophies that will emerge in the coming years.

    Maybe it will be less about benchmarks and more about different ideas of what working-with-ai means

    • karmasimida 2 hours ago ago

      > With Codex (5.3), the framing is an interactive collaborator: you steer it mid-execution, stay in the loop, course-correct as it works.

      > With Opus 4.6, the emphasis is the opposite: a more autonomous, agentic, thoughtful system that plans deeply, runs longer, and asks less of the human.

      Ain't the UX is the exact opposite? Codex thinks much longer before gives you back the answer.

      • WilcoKruijer 43 minutes ago ago

        Yes, you’re right for 4.5 and 5.2. Hence they’re focusing on improving the opposite thing and thus are actually converging.

      • xd1936 2 hours ago ago

        I've also had the exact opposite experience with tone. Claude Code wants to build with me, and Codex wants to go off on its own for a while before returning with opinions.

        • mrkstu an hour ago ago

          Its likely that both are steering towards the middle from their current relative extremes and converging to nearly the same place.

          • gervwyk an hour ago ago

            also my experience in using these two models. they are trying to recover from oversteer perhaps.

    • ghosty141 an hour ago ago

      I'm personally 100% convinced (assuming prices stay reasonable) that the Codex approach is here to stay.

      Having a human in the loop eliminates all the problems that LLMs have and continously reviewing small'ish chunks of code works really well from my experience.

      It saves so much time having Codex do all the plumbing so you can focus on the actual "core" part of a feature.

      LLMs still (and I doubt that changes) can't think and generalize. If I tell Codex to implement 3 features he won't stop and find a general solution that unifies them unless explicitly told to. This makes it kinda pointless for the "full autonomy" approach since effecitly code quality and abstractions completely go down the drain over time. That's fine if it's just prototyping or "throwaway" scripts but for bigger codebases where longevity matters it's a dealbreaker.

      • _zoltan_ 30 minutes ago ago

        I'm personally 100% convinced of the opposite, that it's a waste of time to steer them. we know now that agentic loops can converge given the proper framing and self-reflectiveness tools.

        • sealeck 21 minutes ago ago

          Converge towards what though... I think the level of testing/verification you need to have an LLM output a non-trivial feature (e.g. Paxos/anything with concurrency, business logic that isn't just "fetch value from spreadsheet, add to another number and save to the database") is pretty high.

    • utilize1808 2 hours ago ago

      I think it's the opposite. Especially considering Codex started out as a web app that offers very little interactivity: you are supposed to drop a request and let it run automatously in a containerized environment; you can then follow up on it via chat --- no interactive code editing.

      • Rperry2174 2 hours ago ago

        Fair I agree that was true of early codex and my perception too.. but today there are two announcements that came out and thats what im referring to.

        specifically, the GPT-5.3 post explicitly leans into "interactive collaborator" langauge and steering mid execution

        OpenAI post: "Much like a colleague, you can steer and interact with GPT-5.3-Codex while it’s working, without losing context."

        OpenAI post: "Instead of waiting for a final output, you can interact in real time—ask questions, discuss approaches, and steer toward the solution"

        Claude post: "Claude Opus 4.6 is designed for longer-running, agentic work — planning complex tasks more carefully and executing them with less back-and-forth from the user."

    • mcintyre1994 2 hours ago ago

      This kind of sounds like both of them stepping into the other’s turf, to simplify a bit.

      I haven’t used Codex but use Claude Code, and the way people (before today) described Codex to me was like how you’re describing Opus 4.6

      So it sounds like they’re converging toward “both these approaches are useful at different times” potentially? And neither want people who prefer one way of working to be locked to the other’s model.

    • rozumbrada 12 minutes ago ago

      I read this exact comment with I would say completely the same words several times in X and I would bet my money it's LLM generated by someone who has not even tried both the tools. This AI slop even in the site like this without direct monetisation implications from fake engagement is making me sick...

    • giancarlostoro an hour ago ago

      > With Opus 4.6, the emphasis is the opposite: a more autonomous, agentic, thoughtful system that plans deeply, runs longer, and asks less of the human.

      This feels wrong, I can't comment on Codex, but Claude will prompt you and ask you before changing files, even when I run it in dangerous mode on Zed, I can still review all the diffs and undo them, or you know, tell it what to change. If you're worried about it making too many decisions, you can pre-prompt Claude Code (via .claude/instructions.md) and instruct it to always ask follow up questions regarding architectural decisions.

      Sometimes I go out of my way to tell Claude DO NOT ASK ME FOR FOLLOW UPS JUST DO THE THING.

      • Rperry2174 an hour ago ago

        yeah I'm mostly just talking about how they're framing it: "Claude Opus 4.6 is designed for longer-running, agentic work — planning complex tasks more carefully and executing them with less back-and-forth from the user"

        I guess its also quite interesting that how they are framing these projects are opposite from how people currently perceive them and I guess that may be a conscious choice...

        • giancarlostoro an hour ago ago

          I get what you mean now, I like that to be fair, sometimes I want Claude to tell me some architectural options, so I ask it so I can think about what my options are, sometimes I rethink my problem if I like Claudes conclusion.

    • techbro_1a an hour ago ago

      > With Codex (5.3), the framing is an interactive collaborator: you steer it mid-execution, stay in the loop, course-correct as it works.

      This is true, but I find that Codex thinks more than Opus. That's why 5.2 Codex was more reliable than Opus 4.5

    • cchance 2 hours ago ago

      Just because you can inject steering doesn't mean they stered away from long running...

      Theres hundreds of people who upload Codex 5.2 running for hours unattended and coming back with full commits

    • d--b 2 hours ago ago

      I am definitely using Opus as an interactive collaborator that I steer mid-execution, stay in the loop and course correct as it works.

      I mean Opus asks a lot if he should run things, and each time you can tell it to change. And if that's not enough you can always press esc to interrupt.

  • granzymes 4 hours ago ago

    I think Anthropic rushed out the release before 10am this morning to avoid having to put in comparisons to GPT-5.3-codex!

    The new Opus 4.6 scores 65.4 on Terminal-Bench 2.0, up from 64.7 from GPT-5.2-codex.

    GPT-5.3-codex scores 77.3.

    • the_duke 4 hours ago ago

      I do not trust the AI benchmarks much, they often do not line up with my experience.

      That said ... I do think Codex 5.2 was the best coding model for more complex tasks, albeit quite slow.

      So very much looking forward to trying out 5.3.

      • NitpickLawyer 3 hours ago ago

        Just some anecdata++ here but I found 5.2 to be really good at code review. So I can have something crunched by cheaper models, reviewed async by codex and then re-prompt with the findings from the review. It finds good things, doesn't flag nits (if prompted not to) and the overall flow is worth it for me. Speed loss doesn't impact this flow that much.

        • kilroy123 3 hours ago ago

          Personally, I have Claude do the coding. Then 5.2-high do the reviewing.

          • _zoltan_ 29 minutes ago ago

            I have Opus 4.5 do everything then review it with Gemini 3.

          • seunosewa 2 hours ago ago

            Then I pass the review back to Claude Opus to implement it.

            • VladVladikoff 2 hours ago ago

              Just curious is this a manual process or you guys have automated these steps?

              • ricketycricket an hour ago ago

                I have a `codex-review` skill with a shell script that uses the Codex CLI with a prompt. It tells Claude to use Codex as a review partner and to push back if it disagrees. They will go through 3 or 4 back-and-forth iterations some times before they find consensus. It's not perfect, but it does help because Claude will point out the things Codex found and give it credit.

              • _zoltan_ 29 minutes ago ago

                zen-mcp (now called pal-mcp I think) and then claude code can actually just pass things to gemini (or any other model)

        • StephenHerlihyy 2 hours ago ago

          I don’t use OpenAI too much, but I follow a similar work flow. Use Opus for design/architecture work. Move it to Sonnet for implementation and build out. Then finally over to Gemini for review, QC and standards check. There is an absolute gain in using different models. Each has their own style and way of solving the problem just like a human team. It’s kind of awesome and crazy and a bit scary all at once.

          • readyforbrunch 20 minutes ago ago

            How do you orchestrate this workflow? Do you define different skills that all use different models, or something else?

      • aurareturn 3 hours ago ago

        5.2 Codex became my default coding model. It “feels” smarter than Opus 4.5.

        I use 5.2 Codex for the entire task, then ask Opus 4.5 at the end to double check the work. It's nice to have another frontier model's opinion and ask it to spot any potential issues.

        Looking forward to trying 5.3.

        • koakuma-chan 3 hours ago ago

          Opus 4.5 is more creative and better at making UIs

      • fooker 3 hours ago ago

        Yeah, these benchmarks are bogus.

        Every new model overfits to the latest overhyped benchmark.

        Someone should take this to a logical extreme and train a tiny model that scores better on a specific benchmark.

        • bunderbunder an hour ago ago

          All shared machine learning benchmarks are a little bit bogus, for a really “machine learning 101” reason: your test set only yields an unbiased performance metric if you agree to only use it once. But that just isn’t a realistic way to use a shared benchmark. Using them repeatedly is kind of the whole point.

          But even an imperfect yardstick is better than no yardstick at all. You’ve just got to remember to maintain a healthy level of skepticism is all.

          • abustamam 35 minutes ago ago

            Is an imperfect yardstick better than no yardstick? It reminds me of documentation — the only thing worse than no documentation is wrong documentation.

        • mrandish 2 hours ago ago

          > Yeah, these benchmarks are bogus.

          It's not just over-fitting to leading benchmarks, there's also too many degrees of freedom in how a model is tested (harness, etc). Until there's standardized documentation enabling independent replication, it's all just benchmarketing .

          • fooker 2 hours ago ago

            For the current state of AI, the harness is unfortunately part of the secret sauce.

        • scoring1774 an hour ago ago
      • nerdsniper 2 hours ago ago

        Opus 4.5 still worked better for most of my work, which is generally "weird stuff". A lot of my programming involves concepts that are a bit brain-melting for LLMs, because multiple "99% of the time, assumption X is correct" are reversed for my project. I think Opus does better at not falling into those traps. Excited to try out 5.3

        • nubg 2 hours ago ago

          what do you do?

      • jahsome 3 hours ago ago

        Another day, another hn thread of "this model changes everything" followed immediately by a reply stating "actually I have the literal opposite experience and find competitor's model is the best" repeated until it's time to start the next day's thread.

        • StephenHerlihyy 2 hours ago ago

          What amazes me the most is the speed at which things are advancing. Go back a year or even a year before that and all these incremental improvements have compounded. Things that used to require real effort to consistently solve, either with RAGs, context/prompt engineering, have become… trivial. I totally agree with your point that each step along the way doesn’t necessarily change that much. But in the aggregate it’s sort of insane how fast everything is moving.

        • SatvikBeri 2 hours ago ago

          I use Claude Code every day, and I'm not certain I could tell the difference between Opus 4.5 and Opus 4.0 if you gave me a blind test

        • clhodapp 2 hours ago ago

          And of course the benchmarks are from the school of "It's better to have a bad metric than no metric", so there really isn't any way to falsify anyone's opinions...

        • malshe 3 hours ago ago

          This pretty accurately summarizes all the long discussions about AI models on HN.

        • cactusplant7374 2 hours ago ago

          Hourly occurrence on /r/codex. Model astrology is about the vibes.

    • __jl__ 4 hours ago ago

      Impressive jump for GPT-5.3-codex and crazy to see two top coding models come out on the same day...

      • granzymes 4 hours ago ago

        Insane! I think this has to be the shortest-lived SOTA for any model so far. Competition is amazing.

    • leumon 3 hours ago ago

      they tested it at xhigh reasoning though, which is probably double the cost of Anthropic's model.

      Cost to Run Artificial Analysis Intelligence Index:

      GPT-5.2 Codex (xhigh): $3244

      Claude Opus 4.5-reasoning: $1485

      (and probably similar values for the newer models?)

      • redox99 2 hours ago ago

        With $20 gpt plan you can use xhigh no problem. With $20 Claude plan you reach the 5h limit with a single feature.

        • mattkevan 2 hours ago ago

          Ha, Claude Code on a pro plan often can't complete a single message before hitting the 5h limit. Not hit it once so far on Codex.

          • naths88 an hour ago ago

            This, so frustrating. But CC is so much faster too.

      • Computer0 2 hours ago ago

        A provider's API costs seemingly do not reflect each respective SOTA provider's subscription usage allowances.

    • wilg 2 hours ago ago

      In my personal experience the GPT models have always been significantly better than the Claude models for agentic coding, I’m baffled why people think Claude has the edge on programming.

      • dudeinhawaii 2 hours ago ago

        I think for many/most programmers = 'speed + output' and webdev == "great coding".

        Not throwing shade anyone's way. I actually do prefer Claude for webdev (even if it does cringe things like generate custom CSS on every page) -- because I hate webdev and Claude designs are always better looking.

        But the meat of my code is backend and "hard" and for that Codex is always better, not even a competition. In that domain, I want accuracy and not speed.

        Solution, use both as needed!

        • falloutx an hour ago ago

          > I actually do prefer Claude for webdev

          Ah and let me guess all your frontends look like cookie cutter versions of this: https://openclaw.dog/

        • whynotminot 2 hours ago ago

          > Solution, use both as needed!

          This is the way. People are unfortunately starting to divide themselves into camps on this — it’s human nature we’re tribal - but we should try to avoid turning this into a Yankees Redsox.

          Both companies are producing incredible models and I’m glad they have strengths because if you use them both where appropriate it means you have more coverage for important work.

      • soulofmischief an hour ago ago

        GPT 5.2 codex plans well but fucks off a lot, goes in circles (more than opus 4.5) and really just lacks the breadth of integrated knowledge that makes opus feel so powerful.

        Opus is the first model I can trust to just do things, and do them right, at least small things. For larger/more complex things I have to keep either model on extremely short leashes. But the difference is enough that I canceled my GPT Pro sub so I could switch to Claude. Maybe 5.3 will change things, but I also cannot continue to ethically support Sam Altman's business.

    • jronak 2 hours ago ago

      Did you look at the ARC AGI 2? Codex might be overfit for terminal bench

      • tedsanders 2 hours ago ago

        ARC AGI 2 has a training set that model providers can choose to train on, so really wouldn't recommend using it as a general measure of coding ability.

        • mrandish an hour ago ago

          A key aspect of ARC AGI is to remain highly resistant to training on test problems which is essential for ARC AGI's purpose of evaluating fluid intelligence and adaptability in solving novel problems. They do release public test sets but hold back private sets. The whole idea is being a test where training on public test sets doesn't materially help.

          The only valid ARC AGI results are from tests done by the ARC AGI non-profit using an unreleased private set. I believe lab-conducted ARC AGI tests must be on public sets and taken on a 'scout's honor' basis that the lab self-administered the test correctly, didn't cheat or accidentally have public ARC AGI test data slip into their training data. IIRC, some time ago there was an issue when OpenAI published ARC AGI 1 test results on a new model's release which the ARC AGI non-profit was unable to replicate on a private set some weeks later (to be fair, I don't know if these issues were resolved).

          I have no expertise to verify how training-resistant ARC AGI is in practice but I've read a couple of their papers and was impressed by how deeply they're thinking through these challenges. They're clearly trying to be a unique test which evaluates aspects of 'human-like' intelligence other tests don't. It's also not a specific coding test and I don't know how directly ARC AGI scores map to coding ability.

        • janalsncm an hour ago ago

          More fundamentally, ARC is for abstract reasoning. Moving blocks around on a grid. While in theory there is some overlap with SWE tasks, what I really care about is competence on the specific task I will ask it to do. That requires a lot of domain knowledge.

          As an analogy, Terence Tao may be one of the smartest people alive now, but IQ alone isn’t enough to do a job with no domain-specific training.

    • nurettin 3 hours ago ago

      Opus was quite useless today. Created lots of globals, statics, forward declarations, hidden implementations in cpp files with no testable interface, erasing types, casting void pointers, I had to fix quite a lot and decouple the entangled mess.

      Hopefully performance will pick up after the rollout.

      • nickstinemates 2 hours ago ago

        Did you give it any architecture guidance? An architecture skill that it can load to make sure it lays out things according to your taste?

  • itay-maman 3 hours ago ago

    Something that caught my eye from the announcement:

    > GPT‑5.3‑Codex is our first model that was instrumental in creating itself. The Codex team used early versions to debug its own training

    I'm happy to see the Codex team moving to this kind of dogfooding. I think this was critical for Claude Code to achieve its momentum.

    • codethief an hour ago ago

      Sounds like the researchers behind https://ai-2027.com/ haven't been too far off so far.

    • aurareturn 3 hours ago ago

      More importantly, this is the early steps of a model self improving itself.

      Do we still think we'll have soft take off?

      • mrandish 2 hours ago ago

        > Do we still think we'll have soft take off?

        There's still no evidence we'll have any take off. At least in the "Foom!" sense of LLMs independently improving themselves iteratively to substantial new levels being reliably sustained over many generations.

        To be clear, I think LLMs are valuable and will continue to significantly improve. But self-sustaining runaway positive feedback loops delivering exponential improvements resulting in leaps of tangible, real-world utility is a substantially different hypothesis. All the impressive and rapid achievements in LLMs to date can still be true while major elements required for Foom-ish exponential take-off are still missing.

        • rahulyc 33 minutes ago ago

          Yes, but also you'll never have any early evidence of the Foom until the Foom itself happens.

      • quinncom 3 hours ago ago

        Exponential growth may look like a very slow increase at first, but it's still exponential growth.

        • janalsncm an hour ago ago

          Sigmoids may look like exponential growth at first, until they saturate. Early growth alone cannot distinguish between them.

        • gf000 28 minutes ago ago

          If it's exponential growth. It may just as well be some slow growth and continue to be so.

      • aaaalone 2 hours ago ago

        I'm only saying no to keep optimistic tbh

        It feels crazy to just say we might see a fundamental shift in 5 years.

        But the current addition to compute and research etc. def goes in this direction I think.

      • 8note 2 hours ago ago

        making the specifications is still hard, and checking how well results match against specifications is still hard.

        i dont think the model will figure that out on its own, because the human in the loop is the verification method for saying if its doing better or not, and more importantly, defining better

      • thrance 2 hours ago ago

        I think the limiting factor is capital, not code. And I doubt GPTX is anymore competent at raising funds than the other, fleshy, snake oilers...

      • reducesuffering 3 hours ago ago

        This has already been going on for years. It's just that they were using GPT 4.5 to work on GPT 5. All this announcement mean is that they're confident enough in early GPT 5.3 model output to further refine GPT 5.3 based on initial 5.3. But yes, takeoff will still happen because of this recursive self improvement works, it's just that we're already past the inception point.

  • xiphias2 3 hours ago ago

    ,,GPT‑5.3-Codex is the first model we classify as High capability for cybersecurity-related tasks under our Preparedness Framework , and the first we’ve directly trained to identify software vulnerabilities. While we don’t have definitive evidence it can automate cyber attacks end-to-end, we’re taking a precautionary approach and deploying our most comprehensive cybersecurity safety stack to date. Our mitigations include safety training, automated monitoring, trusted access for advanced capabilities, and enforcement pipelines including threat intelligence.''

    While I love Codex and believe it's amazing tool, I believe their preparedness framework is out of date. As it is more and more capable of vibe coding complex apps, it's getting clear that the main security issues will come up by having more and more security critical software vibe coded.

    It's great to look at systems written by humans and how well Codex can be used against software written by humans, but it's getting more important to measure the opposite: how well humans (or their own software) are able to infiltrate complex systems written mostly by Codex, and get better on that scale.

    In simpler terms: Codex should write secure software by default.

    • mrkeen 3 hours ago ago

      Is "high-capability" a stronger or weaker claim than "team of phd-level experts"?

      https://www.nbcnews.com/tech/tech-news/openai-releases-chatg...

    • trcf23 3 hours ago ago

      That’s just classical OpenAI trying to make us believe they’re closing on AGI… Like all « so called » research from them and Anthropic about safety alignment and that their tech is so incredibly powerful that guardrails should be put on them.

    • ActionHank 2 hours ago ago

      I heard the other day that every time someone claps another vibe coded project embeds the api keys in the webpage.

      I wonder if this will continue to be the case.

    • da_grift_shift 2 hours ago ago

      >Our mitigations include safety training, automated monitoring, trusted access for advanced capabilities, and enforcement pipelines including threat intelligence.

      "We added some more ACLs and updated our regex"

  • minimaxir 4 hours ago ago

    I remember when AI labs coordinated so they didn't push major announcements on the same day to avoid cannibalizing each other. Now we have AI labs pushing major announcements within 30 minutes.

    • observationist 3 hours ago ago

      The labs have fully embraced the cutthroat competition, the arms race has fully shed the civilized facade of beneficient mutual cooperation.

      Dirty tricks and underhanded tactics will happen - I think Demis isn't savvy in this domain, but might end up stomping out the competition on pure performance.

      Elon, Sam, and Dario know how to fight ugly and do the nasty political boardroom crap. 26 is gonna be a very dramatic year, lots of cinematic potential for the eventual AI biopics.

      • manquer 3 hours ago ago

        >civilized facade of mutual cooperation

        >Dirty tricks and underhanded tactics

        As long the tactics are legal ( i.e. not corporate espionage, bribes etc), the no holds barred full free market competition is the best thing for the market and the consumers.

        • ajam1507 an hour ago ago

          > As long the tactics are legal ( i.e. not corporate espionage, bribes etc), the no holds barred full free market competition is the best thing for the market and the consumers.

          The implicit assumption here is that we have constructed our laws so skillfully that the only path to win a free market competition is by producing a better product, or that all efforts will be spent doing so. This is never the case. It should be self-evident from this that there is a more productive way for companies to compete and our laws are not sufficient to create the conditions.

        • thethimble 3 hours ago ago

          The consumers are getting huge wins.

          Model costs continue to collapse while capability improves.

          Competition is fantastic.

          • mrandish an hour ago ago

            > The consumers are getting huge wins.

            However, the investors currently subsidizing those wins to below cost may be getting huge losses.

        • dwaltrip 37 minutes ago ago

          Sure, it can be beneficial. But don't forget that externalities are a thing.

    • zozbot234 4 hours ago ago

      They're also coordinating around Chinese New Year to compete with new releases of the major open/local models.

    • tedsanders 3 hours ago ago

      This goes way back. When OpenAI launched GPT-4 in 2023, both Anthropic and Google lined up counter launches (Claude and Magic Wand) right before OpenAI's standard 10am launch time.

    • crorella 4 hours ago ago

      The thrill of competition

    • manquer 3 hours ago ago

      Wouldn't that be illegal ? i.e. cartel to collude like that ?

    • IhateAI 4 hours ago ago

      A sign of the inevitible implosion !

    • cedws 3 hours ago ago

      I wish they’d just stop pretending to care about safety, other than a few researchers at the top they care about safety only as long as they aren’t losing ground to the competition. Game theory guarantees the AI labs will do what it takes to ensure survival. Only regulation can enforce the limits, self policing won’t work when money is involved.

      • vovavili 2 hours ago ago

        The last thing I would want is for excessively neurotic bureaucrats to interfere with all the mind-blowing progress we've had in the last couple of years with LLM technology.

      • thethimble 3 hours ago ago

        As long as China continues to blitz forward, regulation is a direct path to losing.

        • cedws 2 hours ago ago

          Define "losing."

          Europe is prematurely regarded as having lost the AI race. And yet a large portion of Europe live higher quality lives compared to their American counterparts, live longer, and don't have to worry about an elected orange unleashing brutality on them.

          • thethimble 2 hours ago ago

            If the world is built on AI infrastructure (models, compute, etc.) that is controlled by the CCP then the west has effectively lost.

            This may lead to better life outcomes, but if the west doesn't control the whole stack then they have lost their sovereignty.

            This is already playing out today as Europe is dependent on the US for critical tech infrastructure (cloud, mail, messaging, social media, AI, etc). There's no home grown European alternatives because Europe has failed to create an economic environment to assure its technical sovereignty.

          • fakedang an hour ago ago

            Europe has already lost the tech race - their cloud systems that their entire welfare states rely upon are all hosted on servers hosted by American private companies, which can turn them off with a flick of a switch if and when needed.

            When the welfare state, enabled by technology, falls apart, it won't take long for European society to fall apart. Except France maybe.

        • pixl97 2 hours ago ago

          You mean all paths are direct paths to losing.

  • SunshineTheCat an hour ago ago

    I've always been fascinated to see significantly more people talking about using Claude than I see people talking about Codex.

    I know that's anecdotal, but it just seems Claude is often the default.

    I'm sure there are key differences in how they handle coding tasks and maybe Claude is even a little better in some areas.

    However, the note I see the most from Claude users is running out of usage.

    Coding differences aside, this would be the biggest factor for me using one over the other. After several months on Codex's $20/mo. plan (and some pretty significant usage days), I have only come close to my usage limit once (never fully exceeded it).

    That (at least to me) seems to be a much bigger deal than coding nuances.

    • mrandish 4 minutes ago ago

      > the note I see the most from Claude users is running out of usage.

      I suspect that tells us less about model capability/efficiency and more about each company's current need to paint a specific picture for investors re: revenue, operating costs, capital requirements, cash on hand, growth rate, retention, margins etc. And those needs can change at any moment.

    • timpera 6 minutes ago ago

      In my experience, OpenAI gives you unreasonable amounts of compute for €20/month. I am subscribed to both and Claude's limits are so tiny compared to ChatGPT's that it often feels like a rip-off.

      Claude also doesn't let you use a worse model after you reach your usage limits, which is a bit hard to swallow when you're paying for the service.

    • superfrank an hour ago ago

      I only switched to using the terminal based agents in the last week. Prior to this I was pretty much only using it through Cursor and GH Copilot. The Anthropic models when used through GH Copilot were far superior to the codex ones and I didn't really get the hype of Codex. Using them through the CLI though, Codex is much better, IMO.

      My guess is that it's potentially that and just momentum from developers who started using CC when it was far superior to Codex has allowed it to become so much more popular. Potentially, it's might be that, as it's more autonomous, it's better for true vibe-coding and it's more popular with the Twitter/LinkedIn wantrepreneur crew which meant it gets a lot of publicity which increases adoption quicker.

    • AstroBen an hour ago ago

      I'm with you. Codex's plans seems to be a much more generous offering than Claude

      I just.. can't tell a different in quality between them.. so I go for the cheapest

    • fHr an hour ago ago

      Codex is great and I hit the usage once doing multiagent full 5 hour absolute degen session for the nornal workflow alongside never hit it and now x2 useage even and now with the planmode switch back and forth absolute great.

  • textlapse 15 minutes ago ago

    I would love to see a nutritional facts label on how many prompts / % of code / ratio of human involvement needed to use the models to develop their latest models for the various parts of their systems.

  • tosh 3 hours ago ago

    Terminal Bench 2.0

      | Name                | Score |
      |---------------------|-------|
      | OpenAI Codex 5.3    | 77.3  |
      | Anthropic Opus 4.6  | 65.4  |
    • greenfish6 3 hours ago ago

      yea but i feel like we are over the hill on benchmaxxing, many times a model has beaten anthropic on a specific bench, but the 'feel' is that it is still not as good at coding

      • falloutx 2 hours ago ago

        When Anthropic beats Benchmarks its somehow earned, when OpenAi games it, its somehow about not feeling good at coding.

      • AstroBen 3 hours ago ago

        'feel' is no more accurate

        not saying there's a better way but both suck

        • crorella 3 hours ago ago

          The variety of tasks they can do and will be asked to do is too wide and dissimilar, it will be very hard to have a transversal measurement, at most we will have area specific consensus that model X or Y is better, it is like saying one person is the best coder at everything, that does not exist.

          • pixl97 2 hours ago ago

            Yea, we're going to need benchmarks that incorporate series of steps of development for a particular language and how good each model is at it.

            Like can the model take your plan and ask the right questions where there appear to be holes.

            How wide of architecture and system design around your language does it understand.

            How does it choose to use algorithms available in the language or common libraries.

            How often does it hallucinate features/libraries that aren't there.

            How does it perform as context get larger.

            And that's for one particular language.

        • thethimble 3 hours ago ago

          Speak for yourself. I've been insanely productive with Codex 5.2.

          With the right scaffolding these models are able to perform serious work at high quality levels.

          • helloplanets 2 hours ago ago

            He wasn't saying that both of the models suck, but that the heuristics for measuring model capability suck

          • AstroBen 3 hours ago ago

            ..huh?

        • tavavex 2 hours ago ago

          The 'feel' of a single person is pretty meaningless, but when many users form a consensus over time after a model is released, it feels a lot more informative than a simple benchmark because it can shift over time as people individually discover the strong and weak points of what they're using and get better at it.

        • forrestthewoods 2 hours ago ago

          At the end of the day “feel” is what people rely on to pick which tool they use.

          I’d feel unscientific and broken? Sure maybe why not.

          But at the end of the day I’m going to choose what I see with my own two eyes over a number in a table.

          Benchmarks are a sometimes useful to. But we are in prime Goodharts Law Territory.

          • AstroBen 2 hours ago ago

            yeah, to be honest it probably doesn't matter too much. I think the major models are very close in capabilities

            • forrestthewoods an hour ago ago

              I don’t think this is even remotely true in practice.

              I honestly I have no idea what benchmarks are benchmarking. I don’t write JavaScript or do anything remotely webdev related.

              The idea that all models have very close performance across all domains is a moderately insane take.

              At any given moment the best model for my actual projects and my actual work varies.

              Quite honestly Opus 4.5 is proof that benchmarks are dumb. When Opus 4.5 released no one was particularly excited. It was better with some slightly large numbers but whatever. It took about a month before everyone realized “holy shit this is a step function improvement in usefulness”. Benchmarks being +15% better on SWE bench didn’t mean a damn thing.

      • karmasimida 2 hours ago ago

        Your feeling is not my feeling, codex is unambiguously smarter model for me

    • xyst an hour ago ago

      Benchmarks are useless compared to real world performance.

      Real world performance for these models is a disappoint.

  • bgirard 2 hours ago ago

    > Using the develop web game skill and preselected, generic follow-up prompts like "fix the bug" or "improve the game", GPT‑5.3-Codex iterated on the games autonomously over millions of tokens.

    I wish they would share the full conversation, token counts and more. I'd like to have a better sense of how they normalize these comparisons across version. Is this a 3-prompt 10m token game? a 30-prompt 100m token game? Are both models using similar prompts/token counts?

    I vibe coded a small factorio web clone [1] that got pretty far using the models from last summer. I'd love to compare against this.

    [1] https://factory-gpt.vercel.app/

    • veb 2 hours ago ago

      I just wanted to say that's a pretty cool demo! I hadn't realised people were using it for things like this.

      • bgirard 2 hours ago ago

        Thank you. There's a demo save to get the full feel of it quickly. There's also a 2D-ASCII and 3D render you can hotswap between. The 3D models are generated with Meshy. The entire game is 'AI slop'. I intentionally did no code reviews to see where that would get me. Some prompts were very specific but other prompts were just 'add a research of your choice'.

        This was built using old versions of Codex, Gemini and Claude. I'll probably work on it more soon to try the latest models.

  • nananana9 2 hours ago ago

    I've been listening to the insane 100x productivity gains you all are getting with AI and "this new crazy model is a real game changer" for a few years now, I think it's about time I asked:

    Can you guys point me ton a single useful, majority LLM-written, preferably reliable, program that solves a non-trivial problem that hasn't been solved before a bunch of times in publicly available code?

    • pkoiralap 2 hours ago ago

      In the 1930s, when electronic calculators were first introduced, there was a widespread belief that accounting as a career was finished. Instead, the opposite became true. Accounting as a profession grew, becoming far more analytical/strategic than it had been previously.

      You are correct that these models primarily address problems that have already been solved. However, that has always been the case for the majority of technical challenges. Before LLMs, we would often spend days searching Stack Overflow to find and adapt the right solution.

      Another way to look at this is through the lens of problem decomposition as well. If a complex problem is a collection of sub-problems, receiving immediate solutions for those components accelerates the path to the final result.

      For example, I was recently struggling with a UI feature where I wanted cards to follow a fan-like arc. I couldn't quite get the implementation right until I gave it to Gemini. It didn't solve the entire problem for me, but it suggested an approach involving polar coordinates and sine/cosine values. I was able to take that foundational logic turn it into a feature I wanted.

      Was it a 100x productivity gain? No. But it was easily a 2x gain, because it replaced hours of searching and waiting for a mental breakthrough with immediate direction.

      There was also a relevant thread on Hacker News recently regarding "vibe coding":

      https://news.ycombinator.com/item?id=45205232

      The developer created a unique game using scroll behavior as the primary input. While the technical aspects of scroll events are certainly "solved" problems, the creative application was novel.

      • suddenlybananas an hour ago ago

        The story you're describing doesn't seem much better than one could get from googling around and going on stackoverflow

        • strokirk an hour ago ago

          It doesn’t have to be, really. Even if it could replace 30% of documentation and SO scrounging, that’s pretty valuable. Especially since you can offload that and go take a coffee.

    • rohit89 an hour ago ago

      > that hasn't been solved before a bunch of times in publicly available code?

      And this matters because? Most devs are not working on novel never before seen problems.

      • kevstev 42 minutes ago ago

        Heh, I agree. There is a vast ocean of dev work that is just "upgrade criticalLib to v2.0" or adding support for a new field from the FE through to the BE.

        I can name a few times where I worked on something that you could consider groundbreaking (for some values of groundbreaking), and even that was usually more the combination of small pieces of work or existing ideas.

        As maybe a more poignant example- I used to do a lot of on-campus recruiting when I worked in HFT, and I think I disappointed a lot of people when I told them my day to day was pretty mundane and consisted of banging out Jiras, usually to support new exchanges, and/or securities we hadn't traded previously. 3% excitement, 97% unit tests and covering corner cases.

    • Def_Os an hour ago ago

      Yeah, I would LOVE to see attempts at significant video games that are then open-sourced for communities to work on. E.g. OpenGTA or OpenFIFA/OpenNHL.

    • llmslave 7 minutes ago ago

      baffled that people are still suspicious of ai coding models

    • revahage an hour ago ago

      Well, it took opus 4.5 five messages to solve a trivial git problem for me. It hallucinated nonexistent flags three times. Hallucinating nonexistent flags is certainly a novel solution to my git ineptness.

      Not to be outdone, chatgpt 5.2 thinking high only needed about 8 iterations to get a mostly-working ffmpeg conversion script for bash. It took another 5 messages to translate it to run in windows, on powershell (models escaping newlines on windows properly will be pretty nuch AGI, as far as I’m concerned).

    • xandrius 2 hours ago ago

      Why even come to this site if you're so anti-innovation?

      Today with LLMs you can literally spend 5 minutes defining what you want to get, press send, go grab a coffee and come back to a working POC of something, in literally any programming language.

      This is literally stuff of wonders and magic that redefines how we interface with computers and code. And the only thing you can think of is to ask if it can do something completely novel (that it's so hard to even quantity for humans that we don't have software patents mainly for that reason).

      And the same model can also answer you if you ask it about maths, making you an itinerary or a recipe for lasagnas. C'mon now.

      • legulere an hour ago ago

        I don't think that the user you are responding to is anti-innovation, but rather points out that the usefulness of AI is oversold.

        I'm using Copilot for Visual Studio at work. It is useful for me to speed some typing up using the auto-complete. On the other hand in agentic mode it fails to follow simple basic orders, and needs hand-holding to run. This might not be the most bleeding-edge setup, but the discrepancy between how it's sold and how much it actually helps for me is very real.

      • svantana an hour ago ago

        There are different kinds of innovation.

        I want AI that cures cancer and solves climate change. Instead we got AI that lets you plagiarize GPL code, does your homework for you, and roleplay your antisocial horny waifu fantasies.

    • beernet 2 hours ago ago

      Can you point me to a human written program an LLM cannot write? And no, just answering with a massively large codebase does not count because this issue is temporary.

      Some people just hate progress.

      • HAL3000 26 minutes ago ago

        > Can you point me to a human written program an LLM cannot write?

        Sure:

        "The resulting compiler has nearly reached the limits of Opus’s abilities. I tried (hard!) to fix several of the above limitations but wasn’t fully successful. New features and bugfixes frequently broke existing functionality.

        As one particularly challenging example, Opus was unable to implement a 16-bit x86 code generator needed to boot into 16-bit real mode. While the compiler can output correct 16-bit x86 via the 66/67 opcode prefixes, the resulting compiled output is over 60kb, far exceeding the 32k code limit enforced by Linux. Instead, Claude simply cheats here and calls out to GCC for this phase (This is only the case for x86. For ARM or RISC-V, Claude’s compiler can compile completely by itself.)"[1]

        1. https://www.anthropic.com/engineering/building-c-compiler

      • svantana an hour ago ago

        Pretty much any software that people pay for? If LLMs could clone an app, why would anyone still pay good money for the original?

      • falloutx 2 hours ago ago

        Even a normal website like landonorris.com. Try copying all those effects with AI.

        Another example: Red Dead Redemption 2

        Another one: Roller coaster tycoon

        Another one: ShaderToy

        • avaer 3 minutes ago ago

          I wish I could agree with you, but as a game dev, shader author, and occasional asm hacker, I still think AIs have demonstrated being perfectly capable of copying "those effects". It's been trained on them, of course.

          You're not gonna one-shot RD2, but neither will a human. You can one-shot particles and shader passes though.

        • satvikpendem an hour ago ago

          Why do you believe an LLM can't write these, just because they're 3D? If the assets are given (just as with a human game programmer, who has artists provide them the assets), then an LLM can write the code just the same.

          • falloutx an hour ago ago

            What? People can easily get assets, thats not a even a problem in 2026. Roller coaster tycoon's assets were done by the programmer himself. If its so easy why haven't we seen actually complex pieces of software done in couple of weeks by LLM users?

            Also try building any complex effects by prompting LLMs, you wont get any far, this is why all of the LLM coded websites look stupidly bland.

            • satvikpendem 15 minutes ago ago

              Not sure what you're confused about, I never said assets were hard to get, I just said that the LLM needs to be provided a folder of the assets for it to make use of them, it's not going to create them from scratch (at least not without great difficulty, because LLMs are capable of using and coding Three.js for example). I don't know the answer to your first question because I don't hang around in the 3D or game dev fields, I'm sure there are examples of vibe coded games however.

              As to your second question, it is about prompting them correctly, for example [0]. Now I don't know about you but some of those sites especially after using the frontend skill look pretty good to me. If those look bland to you then I'm not really sure what you're expecting, keeping in mind that the example you showed with the graphics are not regular sites but more design oriented, and even still nothing stops LLMs from producing such sites.

              [0] https://youtu.be/f2FnYRP5kC4

      • suddenlybananas an hour ago ago

        And some people clearly hate humans.

    • eviks 2 hours ago ago

      Great question, here is the link from the future:

  • trilogic 4 hours ago ago

    When 2 multi billion giants advertise same day, it is not competition but rather a sign of struggle and survival. With all the power of the "best artificial intelligence" at your disposition, and a lot of capital also all the brilliant minds, THIS IS WHAT YOU COULD COME UP WITH?

    Interesting

    • sdf2erf 3 hours ago ago

      Yeah they are both fighting for survival. No surprise really.

      Need to keep the hype going if they are both IPO'ing later this year.

      • thethimble 3 hours ago ago

        The AI market is an infinite sum market.

        Consider the fact that 7 year old TPUs are still sitting at near 100p utilization today.

      • superze 3 hours ago ago

        How many IPOs can a company really do?

        • re-thc 3 hours ago ago

          As many as they want. They can "spin off" and then "merge" again.

    • rishabhaiover 3 hours ago ago

      What happened to you?

      • raincole 3 hours ago ago

        AI fried brains, unfortunately.

        • wasmainiac 3 hours ago ago

          I mean, he has a point it’s just not very eloquently written.

          • trilogic 3 hours ago ago

            I empathize with the situation, no elegance from them, no eloquence from me :)

    • lossolo 3 hours ago ago

      What's funny is that most of this "progress" is new datasets + post-training shaping the model's behavior (instruction + preference tuning). There is no moat besides that.

      • Davidzheng 3 hours ago ago

        "post-training shaping the models behavior" it seems from your wording that you find it not that dramatic. I rather find the fact that RL on novel environments providing steady improvements after base-model an incredibly bullish signal on future AI improvements. I also believe that the capability increase are transferring to other domains (or at least covers enough domains) that it represents a real rise in intelligence in the human sense (when measured in capabilities - not necessarily innate learning ability)

        • CuriouslyC 2 hours ago ago

          What evidence do you base your opinions on capability transfer on?

      • riku_iki 4 minutes ago ago

        > is new datasets + post-training shaping the model's behavior (instruction + preference tuning). There is no moat besides that.

        sure, but acquiring/generating/creating/curating so much high quality data is still significant moat.

      • WarmWash 3 hours ago ago

        >There is no moat besides that.

        Compute.

        Google didn't announce $185 billion in capex to do cataloguing and flash cards.

        • causalmodels 3 hours ago ago

          Google didn't buy 30% of Anthropic to starve them of compute

          • WarmWash 3 hours ago ago

            Probably why it's selling them TPUs.

  • morleytj 3 hours ago ago

    The behind the scenes on deciding when to release these models has got to be pretty insanely stressful if they're coming out within 30 minutes-ish of each other.

    • meisel 3 hours ago ago

      I wonder if their "5.3" was continuously being updated, with regenerated benchmarks with each improvement, and they just stayed ready to release it when claude released

      • morleytj 2 hours ago ago

        This seems plausible. It would be shocking if these companies didn't have an automated testing suite which is recomputing these benchmarks on a regular basis, and uploading to a dashboard for supervision.

        Given that they already pre-approved various language and marketing materials beforehand there's no real reason they couldn't just leave it lined up with a function call to go live once the key players make the call.

    • Havoc 3 hours ago ago

      It’s also functionally not likely without some sort of insider knowledge or coordination

      • morleytj 3 hours ago ago

        Could be, could also be situations where things are lined up to launch in the near future and then a mad dash happens upon receiving outside news of another launch happening.

        I suppose coincidences happen too but that just seems too unlikely to believe honestly. Some sort of knowledge leakage does seem like the most likely reason.

  • gallerdude 2 hours ago ago

    Both Opus 4.6 and GPT-5.3 one shot a Gameboy emulator for me. Guess I need a better benchmark.

    • paxys an hour ago ago

      As coding agents get "good enough" the next differentiator will be which one can complete a task in fewer tokens.

      • tgtweak an hour ago ago

        Or quicker, or more comprehensively for the same price.

      • nlh an hour ago ago

        Or the same number of tokens in less time. Kinda feels like the CPU / modem wars of the 90s all over again - I remember those differences you felt going from a 386 -> 486 or from a 2400 -> 9600 baud modem.

        We're in the 2400 baud era for coding agents and I for one look forward to the 56k era around the corner ;)

  • vatsachak 31 minutes ago ago

    AI designed websites are so easy to spot that I need to actively design my UI so that it doesn't look AI

  • RivieraKid 42 minutes ago ago

    Do software engineers here feel threatened by this? I certainly am. I'm surprised that this topic is almost entirely missing in these threads.

    • llmslave 6 minutes ago ago

      theres alot of denial, and people that havent taken a serious look at the ai models

    • worldsavior 39 minutes ago ago

      No. AI does not work well enough, you still need a person to look on it and CODE. It probably never will, until AGI which probably also in my opinion will never come.

    • vatsachak 35 minutes ago ago

      AI is mostly garbage at creating useful abstractions. I'd feel threatened if I was a competitive programmer or IMO kid

    • OsrsNeedsf2P 36 minutes ago ago

      I would feel threatened if I didn't invest in learning how to best use AI

    • ReptileMan 40 minutes ago ago

      Jevons paradox hints that the situation is not as bleak as it sounds.

  • ffitch 3 hours ago ago

    > our team was blown away > by how much Codex was able > to accelerate its own development

    they forgot to add “Can’t wait to see what you do with it”

  • karmasimida 2 hours ago ago

    For those who cared:

    GPT-5.3-Codex dominates terminal coding with a roughly 12% lead (Terminal-Bench 2.0), while Opus 4.6 retains the edge in general computer use by 8% (OSWorld).

    Anyone knows the difference between OSWorld vs OSWorld Verified?

    • nopinsight an hour ago ago

      From Claude 4.6 Thinking:

      OSWorld is the full 369-task benchmark. OSWorld Verified is a ~200-task subset where humans have confirmed the eval scripts reliably score success/failure — the full set has some noisy grading where correct actions can still get marked wrong.

      Scores on Verified tend to run higher, so they're not directly comparable.

  • kingstnap 4 hours ago ago

    > GPT‑5.3-Codex was co-designed for, trained with, and served on NVIDIA GB200 NVL72 systems. We are grateful to NVIDIA for their partnership.

    This is hilarious lol

    • uh_uh 4 hours ago ago

      How so?

      • Philpax 3 hours ago ago
      • kingstnap 3 hours ago ago

        Its kind of a suck up that more or less confirms the beef stories that were floating around this past week.

        In case you missed it. For example:

        Nvidia's $100 billion OpenAI deal has seemingly vanished - Ars Technica

        https://arstechnica.com/information-technology/2026/02/five-...

        Specifically this paragraph is what I find hilarious.

        > According to the report, the issue became apparent in OpenAI’s Codex, an AI code-generation tool. OpenAI staff reportedly attributed some of Codex’s performance limitations to Nvidia’s GPU-based hardware.

        • dajonker 2 hours ago ago

          There was never a $100 billion deal. Only a letter of intent which doesn't mean anything contractually.

        • esafak 3 hours ago ago

          > OpenAI staff reportedly attributed some of Codex’s performance limitations to Nvidia’s GPU-based hardware.

          They should design their own hardware, then. Somehow the other companies seem to be able to produce fast-enough models.

  • koolala an hour ago ago

    I want to recompile a Rust project to be f32 instead of f64.

    Am I better off buying 1 month of Codex, Claude, or Antigravity?

    I want to have the agent continuesly recompile and fix compile errors on loop until all the bugs from switching to f32 are gone.

    • vatsachak 33 minutes ago ago

      Literally just find and replace

    • EmilStenstrom an hour ago ago

      Doesn't matter which one. All of them can do things like this now, given a good enough feedback loop. Which your problem has.

    • argsnd an hour ago ago

      All of them can do it but Codex has the least frustrating usage limits.

  • dllrr 2 hours ago ago

    Using opus 4.6 in claude code right now. It's taking about 5x longer to think things through, if not more.

    • andyferris 37 minutes ago ago

      The notes explicitly call out you may want to dial the effort setting back to medium to reduce latency/tokens (high being default, apparently there is a max setting too).

  • prng2021 3 hours ago ago

    Did they post the knowledge cutoff date somewhere

  • fishpham 4 hours ago ago
  • tyfon 3 hours ago ago

    I'm having a hard time parsing the openai website.

    Anyone know if it is possible to use this model with opencode with the plus subscription?

  • ponyous 3 hours ago ago

    I think models are smart enough for most of the stuff, these little incremental changes barely matter now. What I want is the model that is fast.

  • jpau an hour ago ago

    Interesting that this was released without a prior GPT-5.3 release. I wonder if that means we won't see a GPT-5.3?

  • jdthedisciple 3 hours ago ago

    Gotta love how the game demo's page title is "threejs" – I guess the point was to demo its vibe-coding abilities anyway, but yea..

  • __mharrison__ 3 hours ago ago

    I never really used Codex (found it to slow) just 5.2, which I going to be an excellent model for my work. This looks like another step up.

    This week, I'm all local though, playing with opencode and running qwen3 coder next on my little spark machine. With the way these local models are progressing, I might move all my llm work locally.

    • andix 3 hours ago ago

      I think codex got much faster for smaller tasks in the last few months. Especially if you turn thinking down to medium.

    • raffkede 2 hours ago ago

      I think the slow feeling is a UI thing in codex

  • Robin_f 3 hours ago ago

    Anthropic mostly had an advantage in speed. It feels like with a 25% increase in speed with Codex 5.3, they are now losing that advantage as well.

    • smith7018 2 hours ago ago

      I just asked Opus 4.6 to debug a bug in my current changes and it went for 20 minutes before I interrupted it. Take that as you will.

      • bgirard 2 hours ago ago

        Doesn't feel like a useful data point without more context. For some hard bugs I'd be thrilled to wait 30 minutes for a fix, for a trivial CSS fix not so much. I've spent weeks+ of my career fix single bugs. Context is everything.

        • smith7018 2 hours ago ago

          Sure, but I've never experienced a 20 minute wait with CC before. It was an architectural question but it would have taken a couple minutes with a definitive answer on 4.5.

  • modeless 3 hours ago ago

    It's so difficult to compare these models because they're not running the same set of evals. I think literally the only eval variant that was reported for both Opus 4.6 and GPT-5.3-Codex is Terminal-Bench 2.0, with Opus 4.6 at 65.4% and GPT-5.3-Codex at 77.3%. None of the other evals were identical, so the numbers for them are not comparable.

    • alexhans 3 hours ago ago

      Isn't the best eval the one you build yourself, for your own use cases and value production?

      I encourage people to try. You can even timebox it and come up with some simple things that might look initially insufficient but that discomfort is actually a sign that there's something there. Very similar to moving from not having unit/integration tests for design or regression and starting to have them.

    • rsanek 3 hours ago ago

      I usually wait to see what ArtificialAnalysis says for a direct comparison.

    • input_sh 3 hours ago ago

      It's better on a benchmark I've never heard of!? That is groundbreaking, I'm switching immediately!

      • modeless 3 hours ago ago

        I also wasn't that familiar with it, but the Opus 4.6 announcement leaned pretty heavily on the TerminalBench 2.0 score to quantify how much of an improvement it was for coding, so it looks pretty bad for Anthropic that OpenAI beat them on that specific benchmark so soundly.

        Looking at the Opus model card I see that they also have by far the highest score for a single model on ARC-AGI-2. I wonder why they didn't advertise that.

        • input_sh 3 hours ago ago

          No way! Must be a coinkydink, no way OpenAI knew ahead of time that Anthropic was gonna put a focus on that specific useless benchmark as opposed to all the other useless benchmarks!?

          I'm firing 10 people now instead of 5!

  • GenerWork 3 hours ago ago

    I find it very, very interesting how they demoed visuals in the form of the “soft SaaS” website and mentioned how it can do user research. Codex has usually lagged behind Claude and Gemini when it comes to UX, so I’m curious to see if 5.3 will take the lead in real world use. Perhaps it’ll be available in Figma Make now?

    • brokencode 3 hours ago ago

      I’m hoping they add better IDE integration to track active file and selection. That’s the biggest annoyance I have in working with Codex.

  • virtualzx 38 minutes ago ago

    is so fun that the two releases used almost completely non-overlapping benchmarks!

  • gwd 3 hours ago ago

    gpt-5.3-codex isn't available on the API yet. From TFA:

    > We are working to safely enable API access soon.

  • dawidg81 3 hours ago ago

    May AI not write the code for me.

    May I at least understand what it has "written". AI help is good but don't replace real programmers completely. I'm enough copy pasting code i don't understand. What if one day AI will fall down and there will be no real programmers to write the software. AI for help is good but I don't want AI to write whole files into my project. Then something may broke and I won't know what's broken. I've experienced it many times already. Told the AI to write something for me. The code was not working at all. It was compiling normally but the program was bugged. Or when I was making some bigger project with ChatGPT only, it was mostly working but after a longer time when I was promting more and more things, everything got broken.

    • katspaugh 3 hours ago ago

      Honest question: have you tried evolving your code architecture when adding features instead of just "promting more and more things"?

      • dawidg81 an hour ago ago

        I've tried that too but it was almost the same, chatgpt kept forgetting many things about the code and project structure. In summary AI can get problematic for me and i get with troubles with it. This is like one of the reasons why I still prefer traditional text editor for writing code like Vim over a "software on steroids" like VS Code and things like that...

    • pixl97 2 hours ago ago

      > What if one day AI will fall down and there will be no real programmers to write the software.

      What if you want to write something very complex now that most people don't understand? You keep offering more money until someone takes the time to learn it and accomplish it, or you give up.

      I mean, there are still people that hammer out horseshoes over a hot fire. You can get anything you're willing to pay money for.

    • nubg an hour ago ago

      Sorry but companies will not hire you but instead a person who learned how to code with AI. Get with the times or lose.

      • dawidg81 an hour ago ago

        I'm afraid of all of the modern world especially in technology, I guess if now I would "come back" to all of modern and new things: the commercialized world, AI, corporations, etc...my head would explode. I mean I can't imagine living in such world. I am not sure if everything would be alright eith myself in all this everything,This is just too much...

      • cheeze an hour ago ago

        It's that Austin Powers clip of the guy slowly getting smooshed by the steam roller.

  • imasliev 3 hours ago ago

    GPT-5.2-Codex was so cool at price/value rate, hope 5.3 will not ruin the race with claude

  • foft 3 hours ago ago

    Having used codex a fair bit I find it really struggles with … almost anything. However using the equivalent chat gpt model is fantastic. I guess it’s a matter of focus and being provided with a smaller set of code to tackle.

  • kingstnap 4 hours ago ago

    That was fast!

    I really do wonder whats the chain here. Did Sam see the Opus announcement and DM someone a minute later?

    • Mond_ 4 hours ago ago

      OpenAI has a whole history of trying to scoop other providers. This was a whole thing for Google launches, where OpenAI regularly launched something just before Google to grab the media attention.

      • rsanek 3 hours ago ago

        Some recent examples:

        GPT-4o vs. Google I/O (May 2024): OpenAI scheduled its "Spring Update" exactly 24 hours before Google’s biggest event of the year, Google I/O. They launched GPT-4o voice mode.

        Sora vs. Gemini 1.5 Pro (Feb 2024): Just two hours after Google announced its breakthrough Gemini 1.5 Pro model, Sam Altman tweeted the reveal of Sora (text-to-video).

        ChatGPT Enterprise vs. Google Cloud Next (Aug 2023): As Google began its major conference focused on selling AI to businesses, OpenAI announced ChatGPT Enterprise.

    • NewsaHackO 2 hours ago ago

      I assume some sort of corporate espionage. This is high stakes after all

    • maxpert 3 hours ago ago

      Tell me that you are hurt without telling me that you are hurt this applies to Sam right now

  • rustyhancock 3 hours ago ago

    Anyone remember the dot-com era when you would see one provider claim the most miles of fibre and then later that week another would have the title?

  • ecshafer 3 hours ago ago

    Funny that this and Opus 4.6 released within minutes of each other. Each showing similar score improvements. Each claiming to be revolutionary.

  • davidmurdoch 3 hours ago ago

    I've been using 5.2 the way they're describing the new use case for 5.3 this whole time.

  • PieUser 2 hours ago ago

    How'd they both release at the same time? Insiders?

  • binsquare 3 hours ago ago

    At first try it solved a problem that 5.2 couldn't previously.

    Seems to be slower/thinks longer.

  • mrcwinn an hour ago ago

    According to Sam Altman, Anthropic is for "rich people." Judging by his $4 million man-baby Koeniggsegg, he must be a huge Claude Code user!

  • bryanhogan 3 hours ago ago

    The most important question: Can it do Svelte now?

    • speedgoose 2 hours ago ago

      Today is the best day to rewrite everything in React. You may not enjoy React, but AI agents do. And they are the ones writing the code.

    • davidmurdoch 2 hours ago ago

      5.2 was already very good with svelte 5, at least when you have the svelte MCP server set up.

  • edem 4 hours ago ago

    So can I use this from Opencode? Because Anthropic started to enforce their TOS to kill the Opencode integration

    • avb an hour ago ago

      You can also use via Opencode Zen, Github Copilot, or probably any number of other model providers that Opencode integrates with.

      Not sure why everyone stays focused on getting it from Anthropic or OpenAI directly when there are so many places to get access to these models and many others for the same or less money.

    • tfehring 3 hours ago ago

      OpenAI models in general, yes - `opencode auth login`, select OpenAI, then ChatGPT Pro/Plus. I just checked and 5.3-codex isn't available in opencode yet, but I assume it will be soon.

    • regularfry 3 hours ago ago

      I've tried opus 4.5 in opencode via the GitHub Copilot API, mostly to see if it works all. I don't think that broke any terms of service? But also I haven't checked how much more expensive I made it for myself over just calling them directly.

    • rs_rs_rs_rs_rs 3 hours ago ago

      You can use Anthropic models in Opencode, make an api key and you're good to do(you can even use the in house Opencode router, Zen).

      What you can't do is pretend opencode is claude code to make use of that specific claude code subscription.

    • InsideOutSanta 3 hours ago ago

      Yes, OpenAI said they'd allow usage of their subscriptions in opencode.

  • kopollo 3 hours ago ago

    Where is the google?

    • hsaliak 2 hours ago ago

      gemini-3-flash-preview will be GA soon i hope. /s

  • drcongo an hour ago ago

    Does it insert adverts in your code?

  • simianwords 4 hours ago ago

    Any notes on pricing?

    • Tiberium 4 hours ago ago

      It's not in the API yet - "We are working to safely enable API access soon.", but I assume the rate-limits won't be worse than for 5.2 Codex.

      • nine_k 3 hours ago ago

        Ah, "It's ready, but not yet".

        • yunyu 3 hours ago ago

          You can just use it outside of the API?

  • bg24 3 hours ago ago

    I am on a max subscription for Claude, and hate the fact that OpenAI have not figured out that $20 => $200 is a big jump. Good luck to them. In terms of model, just last night, Codex 5.2 solved a problem for me which other models were going round and round. Almost same instructions. That said, I still plan to be on $100 Claude (overall value across many tasks, ability to create docs, co-work), and may bump up OpenAI subscription to the next tier should they decide to introduce one. Not going to $200 even with 5.3, unless my company pays for it.

    • satvikpendem an hour ago ago

      You should look into Kilo Pass by Kilo Code (https://kilo.ai/features/kilo-pass). It's basically a fixed subscription and your credits roll over each month, and you get free extra credits too which are used up first before paid credits. It's similar to paying for Cursor except the credits roll over which is why I'm contemplating moving to it, because I don't want to be locked into any one LLM provider the way Claude Code or Codex make you become.

    • aerhardt 2 hours ago ago

      I'm coding about 6-9h per day with Codex CLI on the $20 Plus sub, occasionally switching to extra-high reasoning and feeding it massive contexts, all tools enabled, sometimes 2-3 terminal sessions running in parallel and I've never hit limits... I operate on small-ish codebases but even so I try to work in the most local scope possible with AGENTS.md at the sub-directory levels.

      Are you really hitting limits, or are you turned off by the fact you think you will?

      • bg24 2 hours ago ago

        You are correct :-) I am turned off by the fact that I will hit the limit if I used more. But you gave me confidence. I guess $20 can go a long way. I think only once in the last 3 months I got rate limited in Codex.

    • wiether 2 hours ago ago

      I use Codex in OpenCode through the API and find the experience quite enjoyable.

      • bg24 2 hours ago ago

        Need to try OpenCode. Thanks.

    • andix 3 hours ago ago

      I guess the jump is on purpose. You can buy Codex credits and also use codex via the API (manual switching required).

  • petetnt an hour ago ago

    Whoa, I think GPT-5.2-Codex was a disappointment, but GPT-5.3-Codex is definitely the future!

  • maheshrijal 4 hours ago ago

    It seems Fast!

  • I_am_tiberius 3 hours ago ago

    I'd like to know if and how much illegal use of customer prompts are used for training.

    • xlbuttplug2 3 hours ago ago

      "But we anonymize prompts before training!"

      Meanwhile the prompt: Crop this photo of my passport

    • renewiltord 3 hours ago ago

      Oh yeah that’s in the “These Are The Illegal Things We Did” section 7.4 in the Model Card.

  • roya51788 3 hours ago ago

    what are the benchmarks against opus 4.6?

  • hubraumhugo 3 hours ago ago

    Anybody else not seeing it available in Codex app or CLI yet (with Plus)?

    • haneul 3 hours ago ago

      My codex CLI didn’t notice version bump available, but I manually did pnpm add -g @openai/codex and 5.3 was there after.

  • heraldgeezer 3 hours ago ago

    Anthropic and GTP 2 new models at once?

  • wahnfrieden 3 hours ago ago

    Pelican seems much worse than the Opus 4.6 one (though the bicycle is more accurate):

    https://gist.github.com/simonw/a6806ce41b4c721e240a4548ecdbe...

  • OutOfHere 3 hours ago ago

    It is absurd to release 5.3-Codex before first releasing 5.3.

    Also, there is no reason for OpenAI and Anthropic to be trying to one-up each other's releases on the same day. It is hell for the reader.

    • aurareturn 3 hours ago ago

      Because Claude Code is stealing the thunder so OpenAI is focusing on coding now.

      • whizzter 3 hours ago ago

        Yeah, Claude Code is what everyone is talking about these days and since OpenAI has always been the spending driver being 2nd or 3rd fiddle just isn't acceptable if they're gonna justify it.

      • stri8ted 2 hours ago ago

        That is where the money is.

        • ghosty141 an hour ago ago

          This. I think software development is the best usecase for AI yet. I use it almost daily at work and it's a huge help.

          Enterprise customers will happily pay even 100$/mo subscriptions and it has a clear value proposition that can be decently verified.

    • apetresc 3 hours ago ago

      Why is it absurd?

    • tomashubelbauer 3 hours ago ago

      I agree, I was confused about where 5.3 non Codex was. 5.2-Codex disappointed me enough that I won't be giving 5.3 Codex a try, but I'm looking forward to trying 5.3 non Codex with Pi.

      • sunaookami 2 hours ago ago

        GPT-5.x in general are very disappointing, the only good chat model was GPT-5 in the first week before they made "the personality warmer" and Codex in general was always kinda meh.

  • nubg 2 hours ago ago

    lmao so cringe that they delay releasing the model until anthropic does

  • raincole 4 hours ago ago

    Almost like Anthropic and OpenAI are trying to front run each other

  • shibeprime 4 hours ago ago

    I know we just got a reset and a 2× bump with the native app release, but shipping 5.3 with no reset feels mismatched. If I’d known this was coming, I wouldn’t have used up the quota on the previous model.

  • maxpert 3 hours ago ago

    Is this me or Sam is being absolute sore loser he is and trying to steal Opus thunder?

    • nickthegreek 3 hours ago ago

      Why is it loser? He very well could be a sore winner here.

      • koakuma-chan 3 hours ago ago

        OpenAI is still the only AI company that has structured outputs. Anthropic now supports JSON schema but you can't specify array length.

        • jiggawatts an hour ago ago

          Google Gemini definitely has structured output.

        • wahnfrieden 2 hours ago ago

          Can you elaborate what you mean - OAI structured outputs means JSON schema doesn't it? So are you just saying they both support JSON schema but Anthropic has a limitation?

          • koakuma-chan 2 hours ago ago

            OpenAI, in addition to JSON schema, supports "context-free grammar"[0], i.e. regex and lark. Anthropic also supports JSON schema since a few weeks ago, but they don't support specifying the length of JSON array, so you still have to worry about the model producing invalid output.

            [0]: https://platform.openai.com/docs/guides/function-calling#con...

            One thing that pisses me off is this widespread misunderstanding that you can just fall back to function calling (Anthropic's function calling accepts JSON schema for arguments), and that it's the same as structured outputs. It is not. They just dump the JSON schema into the context without doing the actual structured outputs. Vercel's AI SDK does that and it pisses me off because doing that only confuses the model and prefilling works much better.

    • OutOfHere 3 hours ago ago

      They both are doing this to each other.

      BTW, loser is spelled with a single o.

    • wahnfrieden 3 hours ago ago

      You could also claim that Anthropic is trying to scoop OpenAI by launching minutes earlier, as OpenAI has done with Google in the past.

      For downvoters, you must be naive to think these companies are not surveilling each other through various means.