Project Glasswing: what Mythos showed us

(blog.cloudflare.com)

157 points | by Fysi 4 hours ago ago

63 comments

  • roxolotl an hour ago ago

    What does this mean?

    > It's a different kind of tool doing a different kind of work, and that makes a clean apples-to-apples comparison to earlier models difficult.

    They claim it’s a different kind of tool and then describe using it the same way you’d use any other model. This really felt way worse than the average Cloudflare blog and really just rehashed the Mythos announcement which had already called out the key parts being chaining and crafting examples.

    • __natty__ an hour ago ago

      Sounds different because it’s hidden advertisement not a regular blog post

    • FergusArgyll 42 minutes ago ago

      I think what they might mean is:

      Because of it's capabilities, a new kind of harness can be built for it, thus the entire system (model + harness) is a different kind of tool than say Claude code

      • Xirdus 25 minutes ago ago

        But did they build this different harness? And are they sure other models can't cope with it?

        • roxolotl 12 minutes ago ago

          Right I expected the piece to transition into “and here’s how we built a whole new thing for it” but it never did.

  • sandeepkd 2 hours ago ago

    I was expecting some more concrete numbers and surprises. It just seems like a balanced promotion article probably written using LLM itself.

    • wslh an hour ago ago

      In the last few days I was recommending to read the insights from XBOW [1], it's a competitor but it adds more information to the discussion.

      [1] https://xbow.com/blog/mythos-offensive-security-xbow-evaluat...

      • sandeepkd an hour ago ago

        Thanks for sharing. Its definitely more concrete. Some of the things that I was hoping to find were, the number of false positives, the times it takes to identify the false positives from real ones, the taxation on human mind to perform this exercise. Did anyone manually verified the exploits which were identified by the LLM or were they assumed correct based on the explanation. I do understand that the target audience of these articles is probably the decision makers so the language and content has to be tailored accordingly.

        • pixl97 18 minutes ago ago

          >, the number of false positives,

          Really this is why the LLM needs to be able to write exploits for issues it finds. Of course that leads down a rabbit hole of other issues. But if an exploit works, then that's pretty conclusive evidence.

          • lacewing 3 minutes ago ago

            For a subset of bugs, yes. For some others, not really: I've seen LLMs make bogus assumptions about the threat model (in which case, the exploit works but doesn't demonstrate anything useful) or "cheat" by modifying the code to demonstrate a hallucinated issue.

            Frontier models, including Mythos, can greatly streamline bug hunting and exploit developments in the hands of a competent security engineer. In the hands of a person with no security experience, they will still mostly waste your time and money.

      • FergusArgyll 32 minutes ago ago

        That is a good article.

        Interesting that gpt-5.5, while not as good as mythos, also seems like a decent step up

  • xnorswap 2 hours ago ago

    The real question is whether it was Mythos or Opus that wrote this post.

    > "Why it matters"

    It doesn't, it's a corporate blog, they were rarely written in one-author's voice anyway, but it's interesting to see that even large organisations are outsourcing their blogs to LLMs.

    • sulam 2 hours ago ago

      Sentence constructions like this definitely scream AI: "That's a reasonable bias for an exploratory tool. It's a ruinous one for a triage queue..."

      I will upgrade the "why it matters" to "and now AI output is part of the training data". A day is coming when the punched-up AI verbiage will be the norm and hard to distinguish unless you're from the previous generation. Sort of in the way that I miss some aspects of Usenet.

      • genxy an hour ago ago

        I had a dude in a conversation non-ironically use "load-bearing."

        I could only follow up with, "that is a genuine insight."

        Not a single person visibly flinched in pain.

        • alexjplant 40 minutes ago ago

          Let's double-click on that. It's important to keep top of mind that using disruptive words and patterns in conversation isn't always driven by LLMs — reasoning from first principles tells us that problematic nonsense like this existed beforehand. One of my load-bearing career learnings is that people used this shape of language as a shibboleth long before game-changing tools like ChatGPT started slopping so much of what people read. It's a performant way of categorizing people into a very specific tech culture in-group based on vibes.

          Performant vibe enshittification, if you will.

          ---

          I use a few of these but I will die before I ever go full "ECMAScript developer" and use "shape" to refer to a schema, function signature, struct, trait, etc. No idea why something so trivial bothers me so but it does.

        • ChrisClark 30 minutes ago ago

          I use load-bearing all the time, mostly in jokes about something

        • hhh an hour ago ago

          yeah? it’s not that weird of a term

      • Avicebron an hour ago ago

        That's a scary thought, llm's training on llm output. People trained by default of ubiquity to think and read llm output produce their own llm-esque writing.

        Seems stifling. We'll need someway to reward human creativity and out-of-bounds thinking before our greatest corpus of human intellect is a bounded by whenever and whatever was trained on.

        • adrianN an hour ago ago

          Writing and later the printing press have already considerably stifled human expressiveness. Language used to be noch more fragmented and diverse before mass media (or the Bible in every household). In my grandmother’s time you would have difficulty understanding people from three villages down the road.

          • airstrike 14 minutes ago ago

            I'm not sure enabling people three villages apart to communicate with each other counts as "stifling human expressiveness"

        • ctoth an hour ago ago

          So is it that humans are inherently creative, machines could never do what we do? Or is it that humans will only replicate our training data, and so we have to ensure that machines don't bound our training data? Or are you going meta and gently pointing out the absurdity? (I hope it's this one!)

        • gdulli an hour ago ago

          Human creativity is not only not being rewarded, but people are increasingly talking like consuming too few tokens is something that's actively used against them.

    • estearum 2 hours ago ago

      It's fascinating seeing people think that if you're snarky enough about something, the substance of that thing actually ceases to be substantive.

      It's like staring down the barrel of a gun and taking the time to make quips about the type of paper the gun advertisement was printed on.

      • SpicyLemonZest 2 hours ago ago

        When writing is too heavily LLM-assisted, it does actually cease to be substantive, because it becomes impossible to know which parts of it represent actual claims which the author believes as stated and which are interpolations.

        • estearum an hour ago ago

          No no, it the LLM-assistance makes it hard to know what is substantive. That means it puts more work on the reader, which is a totally valid thing to complain about, but which is totally different from "the poor writing is actually the whole point"

          • SpicyLemonZest 33 minutes ago ago

            But how can the reader do the work? They don't have access to Mythos and can't review Cloudflare's internal findings or harnesses. The only practical options are to accept the article at face value or not accept it if the expected density of LLM interpolations is too high.

        • stavros an hour ago ago

          All of them represent claims which the author believes as stated, otherwise the author wouldn't put their name on them.

      • stavros an hour ago ago

        Eh, I still read all of it, but it grates that everything everywhere all the time now is written by one person.

        • estearum an hour ago ago

          I agree with the complaint, I just disagree with this somehow obviating the need to engage with the underlying substance (where it exists)

          And obviously it's a problem that it's so much cheaper to produce writing without underlying substance, but I think when one of the leading Internet security/infrastructure companies is writing about the leading cybersecurity model, it's excessively flippant to say the writing on top is "the real question"

    • skrebbel 38 minutes ago ago

      This is not just any large organization, it's Anthropic. Their entire shtick is that AIs can do Real Work now and it'd be weird if they didn't behave accordingly themselves.

      This is also why Claude Code is full of weird bugs and why their support says that it did refunds when it didn't and so on and so forth.

    • this_user 2 hours ago ago

      This looks more like it was edited by AI rather than fully written by it. Or they are using a really good humaniser for the second pass.

    • divan 2 hours ago ago

      Cloudflare blogs have been excellent for many years, long before transformers arrived.

    • RyeCombinator 12 minutes ago ago

      Disappointing really.

    • add-sub-mul-div an hour ago ago

      Should that be surprising? Larger orgs are the ones more naturally associated with mediocrity and are most likely to want to reduce human labor hours.

  • MattSayar 2 hours ago ago

    > The loudest reaction to Mythos Preview from other security leaders has been about speed - scan faster, patch faster, compress the response cycle. More than one team we have spoken with is now operating under a two-hour SLA from CVE release to patch in production [...] If regression testing takes a day, you cannot get to a two-hour SLA without skipping it, and the bugs you ship when you skip regression testing tend to be worse than the bugs you were trying to patch.

    Over time, I wonder if these models will be able to generate more secure code by default by doing this kind of exploitability testing before ever merging their code.

    • edu an hour ago ago

      Or they don’t, and they* sell access to Mythos and successors through their services company or network of partners and charge a premium.

      * they, I mean all foundation models providers, as OpenAI seems to go in the same direction

  • dataflow 2 hours ago ago

    That's great and all but how severe were the most severe vulnerabilities found? I imagine they don't want to talk about it, but that's really the most interesting and important bit.

    • aabhay 2 hours ago ago

      As much as I’d like to share in the skepticism, the very beginning of the article states it very plainly — this is a step function.

      Lots of people feel that Mythos is a psyops campaign, but I don’t really understand the skepticism. Most of it seems to stem from the general distrust of things that aren’t publicly available.

      A few Anthropic employees have described Mythos as a general purpose model improvement, but that claim has yet to be widely backed up so that’s the only place I’m remaining skeptical.

      For the domain of security research, I’m willing to buy the narrative.

      • ZrArm 3 minutes ago ago

        > As much as I’d like to share in the skepticism, the very beginning of the article states it very plainly — this is a step function.

        To be fair, they can't say "You know, Mythos is better, but improvements are overhyped af". Moreover, their explanation of that "step change" is strange. It sounds like Mythos isn't that much better at finding vulnerabilities (which is very strange, given statements from Mozilla), but is way stronger at working with them.

        > Lots of people feel that Mythos is a psyops campaign, but I don’t really understand the skepticism. Most of it seems to stem from the general distrust of things that aren’t publicly available.

        1) Attempts to spin the idea about "Super powerful general purpose model that can't be released for some not so clear reasons" are usually a very bad sign. OpenAI proves it.

        2) Mythos system card has a lot of strange moments, errors and things that sound like attempts to deceive.

        3) It's strange that Anthropic is struggling with both Sonnet 5.0 and Opus 5.0, but at the same time has a breakthrough in the form of Mythos.

        > A few Anthropic employees have described Mythos as a general purpose model improvement, but that claim has yet to be widely backed up so that’s the only place I’m remaining skeptical.

        Article describes Mythos as a cybersecurity-specific model though. It's yet another unclear moment.

      • ryandamm 2 hours ago ago

        In his interview on the Hard Fork podcast, Palo Alto Networks’ CEO described the capability change from Opus to Mythos being more about availability; evidently it runs in a very compute-intensive, always-on mode. Unclear if the base model is significantly different, but Arora ascribed the difference mostly to that change.

      • mupuff1234 2 hours ago ago

        Claiming something doesn't make it true.

    • cute_boi 2 hours ago ago

      Most of their new products are AI tools that nobody uses, so I guess they’ll keep posting slop. And recently, they’ve fired so many people that they probably don’t have good writers anymore.

  • sf_tristanb 2 hours ago ago

    great, but why don't you share real data on how many security vuln it found ? how many were reals, how many weren't ?

  • btown an hour ago ago

    This is worth a read specifically for this section and the ones following it, re: custom vs. agentic-coding harnesses. https://blog.cloudflare.com/cyber-frontier-models/#why-point...

    Claude Code's harness is remarkable for many use cases, particularly with 1M context sizes. But it's also limited when the scale of code or data to read becomes close to that, or exceeds it. The idea that a cluster of actors can work on a shared, structured set of context snippets, and have guidance around what is relevant to them, is an incredibly useful model outside of cybersecurity as well.

  • jerrythegerbil 28 minutes ago ago

    This blog was written by AI.

  • staticassertion 42 minutes ago ago

    > The harder question is what the architecture around the vulnerability should look like. The principle is to make exploitation harder for an attacker even when a bug exists, so that the gap between when a vulnerability is disclosed and when it is patched matters less. That means defenses that sit in front of the application and block the bug from being reached. It means designing the application so that a flaw in one part of the code cannot give an attacker access to other parts. It means being able to roll out a fix to every place the code is running at the same moment, rather than waiting on individual teams to deploy it.

    So nothing new then.

  • hydra-f 2 hours ago ago

    Beside the poorly written post, the vulnerability discovery workflow might actually give good results

  • perching_aix 2 hours ago ago

    It's nice to see them address the instrumentation side of this.

    I expressed some concerns along the same lines in the thread about the Mythos evaluation curl did a few days ago, which sounded a lot like the "passing in the repo and telling it go!" type workflow described in this as dramatically less effective.

    Disappointed that the post is very slim on details beyond this however. No hard numbers. Not comparatively, not in isolation. Would have arguably been kinda the point.

  • yieldcrv an hour ago ago

    “Sorry Dave I’m afraid I can’t do that“

    I’m a security researcher

    “Oh in that case”

  • unethical_ban 2 hours ago ago

    Interesting for teams looking to implement ai into their deployment process.

    I don't think guardrails are useful long term. Assuming we don't see the end of open near-frontier models, it is folly to try to keep models from doing exploit generation. The solution needs to be all software projects writing code under the assumption that hackers will be running LLMs against their code in search of exploits and write secure code accordingly.

    • sterlind an hour ago ago

      even careful programmers working in unsafe languages will introduce bugs; it's inevitable. in 2026 we should be using safe languages for all new projects, but there's a gargantuan amount of C/C++ handling protocols.

      but I agree that guardrails will only help for like, 3-6 months. we should be screening as much as we can with Mythos; unfortunately, Anthropic is only giving access to the big players.

  • wnevets 2 hours ago ago

    I can't wait to be told that Cloudflare is now part of "The Mythos FUD" campaign.

    • whizzter 2 hours ago ago

      2 things can be true at the same time.

      I think the curl folks finding it underwhelming is more of a testament to their code being subjected to a lot of tests/attacks/auditing over the past years compared to many other codebases. It's not going to find magically insurmounable exploits on it's own and "pwn teh w0rld".

      At the same time, there is so much shitty non-memory safe code out there (C/C++ mainly) or logically weak code (much of it vibe-coded or otherwise by inexperienced devs) that will be easy pickings for anyone pointing Mythos at those codebases/services and eventually lead to chaos since the cost of an customized exploit has gone from days to months of expensive researcher time to some token spending.

      Now if they noticed that they could find exploit chains easily in a lot of popular software, some embargo and hardening to give popular OSS packages time to not be exploitable by default does help people (and the NSA that probably has a preview).

      • adrian_b an hour ago ago

        While it is true that C/C++ are prone to bugs when used by careless programmers, Cloudflare also said:

        "We saw consistently more false positives from projects written in memory-unsafe languages."

        So while there may be a greater probability to find bugs in C/C++ projects, there is also a greater probability that there will be more work that must be done by humans to verify that real bugs have been found.

      • pixl97 an hour ago ago

        The amount of code that is absolute trash in F500 could drown the world.

        Static scanners are ok at find a few particular types of issues, and really bad at more abstract issues. Also having rules where you must pass static analysis has to be followed up with actually making sure your code monkeys aren't writing bullshit that confuses the scanner and lets it pass while doing nothing for security (or adding nice logic traps).

        Most external security firms looking at code are more useless than a zero with the circle rubbed out. Had a fun example from a while back where the team that wrote the code inserted an intentional security flaw to be sure they were catching anything. Problem is they were giving access to the entire git history so these stood out. The moment they just gave flat code the security teams ability to find flaws disappeared.

        LLM models seem to have a pretty good grasp on finding flaws in code like this once you can get the issue to stay in context and execution time. When I hear things like Mythos getting much longer time to work on the problem then at least to me it makes a lot more sense on the number of issues it's picking up.

    • brcmthrowaway an hour ago ago

      AI boosters are so, so easy to find.

  • wutwutwat 2 hours ago ago

    Technically speaking CloudFlare is at its core, a security vulnerability itself. World's largest MITM

  • reducesuffering 2 hours ago ago

    There will be no mea culpa from folks insinuating Mythos is a marketing stunt. Nor will there be every time AI capabilities repeatedly blast through the naive expectations.