We found a bug in the hyper HTTP library

(blog.cloudflare.com)

145 points | by Pop_- 4 days ago ago

57 comments

  • Twey 6 hours ago ago

    This would have been flagged by Clippy lints `let_underscore_untyped` or `let_underscore_must_use`, which sadly are not enabled by default.

    • microgpt 5 hours ago ago

      Or just by not writing let _ =

      • Twey 4 hours ago ago

        All recurrent people problems are system problems.

        • microgpt 3 hours ago ago

          As seen by the fact that forcing the programmer to write let _ = to silence the warning did not fix the bug.

          You know what might've solved this though? Using threads instead of async

    • pwdisswordfishq 6 hours ago ago

      Ehh, easy fix

          #[allow(clippy::let_underscore_untyped,clippy::let_underscore_must_use)]
          let _ = self.poll_flush(cx)?;
      • Twey 5 hours ago ago

        I said ‘flagged’, not ‘fixed’ :)

        You can always write the wrong code if you want it enough. But hopefully a warning would have prompted someone to think harder about this flow.

        • pwdisswordfishq 5 hours ago ago

          But "let _ =" is already an explicit suppression of a must-use warning. Where does this arms race of "no, I really know what I am doing, compiler" versus "no, this really looks like a mistake, programmer" end?

          • Twey 4 hours ago ago

            That's an excellent question I don't have an answer for in general :)

            IMHO the goal is usually for the compiler not to make these decisions but to provide the tools for the APIs people build to make them. That's kind of passing the buck, though.

            I guess in this case the core problem is that the API for these I/O calls has no representation in the type system for what's happening to the buffer. Proxying it as ‘the programmer must think about this code path’ is a reasonable best-effort but, evidently, sometimes inadequate.

          • tialaramex 4 hours ago ago

            I do feel like Rust did enough to allow software engineers and their managers to make an explicit choice here.

      • PoignardAzur 4 hours ago ago

        And this is why you should warn on `clippy::allow_attributes_without_reason` in your projects.

      • lunar_mycroft 6 hours ago ago

        You can set the lints to `forbid` instead of `deny`, which means they can't be `allowed` like that.

      • nesarkvechnep 6 hours ago ago

        Yeah, but you must know about them and the possible bug first in order to allow them...

        • Twey 5 hours ago ago

          Hence ‘sadly’. IMNSHO both of these (or at least _untyped) should be enabled by default. Untyped `let _` is too big a footgun during refactorings.

        • Joker_vD 5 hours ago ago

          At which point you wouldn't have written this bug in the first place; or the warnings would trigger immediately, you'd change _ to an actual variable and then remove the warning pragmas because now you don't assign to _.

          • Twey 5 hours ago ago

            `Poll` is marked `#[must_use]` so if you were assigning to something other than `_` you'd get a warning that you're ignoring the `Pending` path. The Clippy lint is only for `_` which Rust considers a use by default.

        • turboponyy 3 hours ago ago

          Not really. If I'm using a linter, I go and configure the strictest possible ruleset, and only disable rules when justified on a need-by-need basis. It's just a matter of discipline.

  • edelbitter 10 hours ago ago

    Cloudflare does not notice (until a customer complains) that they are sending broken responses at scale? I would have thought they would notice this from sampling and linting a few replies.. just in case they did something like Cloudbleed again.

    • ramon156 7 hours ago ago

      Can you get reasonable results without exposing sensitive info? I'm asking because I genuinely have no idea what it's like at their scale

  • worldsavior 7 hours ago ago

    > We spent six weeks chasing a nearly invisible bug — a race condition that occurred only under specific conditions — in the hyper library that impacted how the Images binding returned processed image data back to the client. In the end, it took four lines of code to fix it.

    That's a long time, must be frustrating.

    • gmueckl an hour ago ago

      It is a long time and it gets frustrating when there is significant time where there is flailing with no visible progress.

      I have had long bug hunts (~a month each) and witnessed ones that took much, much longer. But the longest one I witnessed was drawn out because reproduction was initially unreliable and could take weeks to months. Thankfully, reproduction was by letting a box sit in a corner while tje people involved moved on to other tasks. This kept everybody sane.

  • microgpt 7 hours ago ago

    Would using Rust have prevented this?

    • testdelacc1 2 hours ago ago

      I get that it’s fun to dunk on Rust when a Rust bug surfaces. But is it a bit petty to bring this out when there’s any type of bug of any severity in any Rust software?

      In this case a small minority of requests were getting truncated responses.

      No one said Rust software is bug free. If someone thinks that they’ve been seriously misled.

    • geodel 4 hours ago ago

      Agree. This is warning to people who thought Rust is optional at cloud scale.

    • re-thc 6 hours ago ago

      Isn't this already Rust?

      • pjmlp 6 hours ago ago

        That was obviously a joke question, pointing that Rust isn't the solution for everything.

      • lelanthran 5 hours ago ago

        Woosh :-)

    • Ygg2 4 hours ago ago

      No. Anyone expecting that hasn't read No Silver Bullet essay.

      • tialaramex 4 hours ago ago

        Actually I suspect that Rust is a Silver Bullet in that sense. That essay seems to be a case where people know of the essay but haven't read it. Normally in English a "Silver Bullet" is something much bigger, a panacea or cure all which entirely solves a problem but in his essay Brooks is talking about order-of-magnitude improvements, and that looks a lot like Rust.

        Brooks was expecting such "Silver Bullet" improvements as often as every few decades, we're arguably overdue significantly. He cites Ada as an example of where such an improvement might come from, well, Rust isn't Ada but a lot of the same ideas about correctness are present.

        Google reports order of magnitude changes from their Rust work for example.

        • HumanOstrich 4 hours ago ago

          Order of magnitude more complexity.

    • Cthulhu_ 6 hours ago ago

      The Hyper library in question is a Rust library.

      Did you read the article, or are you a "use rust" parrot / bot based on titles?

      • waysa 6 hours ago ago

        Sarcasm. (I guess)

        • frankharv 5 hours ago ago

          Obviously written by a C freak using a BSD

  • pseudony 6 hours ago ago

    So “fearless concurrency” still only happens when one just decides to not be afraid… :)

    • c0balt 6 hours ago ago

      This does not appear to be a concurrency bug though?

      • microgpt 5 hours ago ago

        Of course it's a concurrency bug. It races sending data to the kernel against the kernel sending data to the network. If the wrong one wins the bug occurs.

        • tetha an hour ago ago

          But it did not take 2 threads within the same application to interact in a bad way on data the system controlled to cause this problem.

          This reads more like an overly broad transition in a deterministic state machine. The fix was to split up a bad transition to shutdown.

          • microgpt 16 minutes ago ago

            Concurrency bugs don't have to be within a single process.

        • inexcf an hour ago ago

          Isn't that like saying there can never be a language with safe concurrency since the code could interact with C code that segfaults? I dunno this kinda reminds me of the 10/10 Rust CVE that turned out to be cmd.exe on Windows not sanitizing inputs and languages like Java just labeled it "won't fix".

          • microgpt 14 minutes ago ago

            You mean the one where Windows doesn't have argv the way Unix does, and instead just has a single string that is interpreted slightly differently by each executable? That is a language making false assertions about how the underlying platform works, causing an impedance mismatch that is impossible to fix.

      • pseudony 5 hours ago ago

        “ a race condition that occurred only under specific conditions — in the hyper library”

  • nopurpose 6 hours ago ago

    Nice writeup, but I don't understand how `curl` didn't trigger bug for them (or any other hyper HTTP server out there), given the explanation in the article.

    `curl --http1.1` sends `Connection: Close` so sender (hyper) must attempt to shutdown connection after sending whole body. Surely any network is slower than memory copy into socket kernel buffers, so it must reliably trigger condition "buffer flush can't be done in one go" and thus trigger early TCP shutdown.

  • 100ms 9 hours ago ago

    > The failure was caused by a timing-dependent race condition in hyper’s HTTP/1 connection handling. When the reader was slower and the socket buffer filled, poll_flush returned Poll::Pending, but the dispatch loop discarded that result. Hyper then treated the response as complete and shut down the socket while data remained buffered internally, causing the client to receive an EOF before the full body arrived.

    https://github.com/hyperium/hyper/issues/4022

    Saved you 3000 words

  • Thaxll 6 hours ago ago

    So much for Rust forcing you to handle errors.

    • Matl 5 hours ago ago

      Go does force you too, but it also supports _ as a bypass - because sometimes you do know better. Just not in this case.

      Rust never promised it'll let programmers turn off their brain, that's what LLMs are for.

    • wongarsu 6 hours ago ago

      You could argue the bug happened exactly because hyper's poll_flush treats flushing some but not all data as a successful return, not an error case.

    • jerf 4 hours ago ago

      There's a hidden equivocation there. "Handling" errors, as far as the language is concerned, mean you do something with them, but explicitly discarding them is most definitely a "something".

      From a human perspective we can consider that not handling the error.

      But the language has no mechanism for "knowing" that discarding the error is wrong. Discarding errors is a fully valid mechanism that we must be able to do in a program because it is sometimes correct. There really isn't even a sensible way to define a way to "force" a user to "handle" errors. The language can only be designed to make it hard to forget to "handle" them somehow in the way the language sees, but it is always possible for the user to incorrectly handle them, of which discarding them when they shouldn't have is only one particularly cognitively-available option but is hardly the full scope of possibilities. Probably isn't even the most common mistake to make, I would imagine there are far more errors that are not handled "correctly" than ones that are spuriously discarded.

      Note I keep saying "language" rather than Rust. All a language can do is surface the issue, and Rust does that. It can't force good code. No language can.

    • atoav 6 hours ago ago

      You could say the exact same thing about safety belts and airbags in cars after someone has died in a crash.

      Why even bother with measures that prevent many problems if they won't prevent all of them, right?

      • chlorion 5 hours ago ago

        This is the argument I like too.

        It's the same argument anti-vaxers love to make. "Well you can still get covid after getting the shot", which is something I read and heard quite a lot. That doesn't make the thing useless.

        Humans are really dumb.

  • algoth1 7 hours ago ago

    I wonder if this bug was found via project glasswing

    • re-thc 6 hours ago ago

      > I wonder if this bug was found via project glasswing

      Did you read how they said it took weeks? Would run out of tokens at that rate...

  • xacky 3 hours ago ago

    Yet Cloudflare relies on bugs in browsers to "verify" you.