Java FFM zero-copy transport using io_uring

(mvp.express)

89 points | by mands 6 days ago ago

41 comments

  • nateb2022 5 hours ago ago

    This looks like most of it was vibecoded.

    Unnecessary comments like:

      clientChannel.configureBlocking(false); // Non-blocking client
    
    can be found throughout the source, and the project's landing page is a good example of typical SOTA models' outputs when asked for a frontend landing page.
    • Squarex 2 hours ago ago

      They openly talk about it here https://www.mvp.express/philosophy/ in section " AI-Assisted Development".

      • koakuma-chan 2 hours ago ago

        AI openly talks about it.

        • Squarex 3 minutes ago ago

          yeah, for sure, the docs screams AI too

    • krisgenre 5 hours ago ago

      Okay, but is that a bad thing?

      • sgammon 5 hours ago ago

        If the author doesn't understand their own code, I probably won't

        • another_twist 4 hours ago ago

          Vibe coding doesnt mean the author doesnt understand their code. Its likely that they don't want carpal tunnel from typing out trivial code and hence offload that labor to a machine.

          • quietbritishjim 4 hours ago ago

            JNI for io_uring is not trivial code.

          • sgammon 4 hours ago ago

            "Vibe-coding" means the author deliberately does not understand their code. "AI-assisted engineering" is what you are thinking of.

      • noitpmeder 5 hours ago ago

        For your pet project? No. For something you're building for others to use? Almost certainly yes.

        • falcojr 4 hours ago ago

          You do realize that it's possible to ask AI to write code and then read the code yourself to ensure it's valid, right? I usually try to strip the pointless comments, but it's not the end of the world if people leave them in.

          • simlevesque 4 hours ago ago

            Yeah but you're leaving out a crucial part: the code is full of useless comments.

            That leaves 2 options:

            - they didn't read the code themselves to ensure it's valid

            - they did read the code themselves but left the useless comments

            No matter which happened it shows they're a bad developer and I don't want to run their code.

          • rustman123 4 hours ago ago

            The comments aren’t the problem.

          • sgammon 4 hours ago ago

            > I usually try to strip the pointless comments

            You could add your own instead, explaining how things work?

            > It's possible to ask AI to write code and then read the code yourself

            Sure, but then it would not be vibecoding.

            • cbsmith 4 hours ago ago

              >> It's possible to ask AI to write code and then read the code yourself

              > Sure, but then it would not be vibecoding.

              Wait, what?

              • sgammon 4 hours ago ago

                AI assisted coding/engineering becomes "vibe coding" when you decide to abdicate any understanding of what you are building, instead focusing only on the outcome

              • the_af 4 hours ago ago

                Vibe-coding as originally defined (by Karpathy?) implied not reading the code at all, just trying it and pasting back any error codes; repeat ad infinitum until it works or you give up.

                Now the term has evolved into "using AI in coding" (usually with a hint of non rigor/casualness), but that's not what it originally meant.

    • rohanray 2 hours ago ago

      Apologies! It's a long read and was the only time I did not want to use AI to summarize for a purpose :) ---- So Yes, a lot of the code has been written using AI. I have also been transparent about it on https://www.mvp.express/philosophy/ in the section "AI-Assisted Development".

      However, that does not mean that I as the author do not understand the code/concepts :) I also don't deny the fact that I might not have gone through the entire codebase till now.

      For some background: 1. I have been working in Capital Markets-Trading, basically FIX (https://www.fixtrading.org/what-is-fix/) systems since a few years now and have been using QuickFIX/J at my job. 2. At the same time, I have been intrigued with Java FFM especially after seeing huge performance gain over idiomatic Java code for a (~500 MB market data) file processing job which I had written a few months back at my regular work. 3. Fellow FIX developers from JVM world would know that there are other Java FIX systems that achieve "extra/huge performance boost" by using "Java's sun.misc.Unsafe" in several parts of the FIX system and OMS.

      Reflecting on above 3 points, I had envisioned writing a modern Java FIX engine with 1. 0.0% usage of sun.misc.Unsafe in the entire codebase, 2. achieve close(-enough) performance to market leading C/C++ FIX engines. This was somewhat in the beginning of this year-2025. However, a month or two into this effort I realized 2 key essential ingredients which will dictate performance, latency, & throughput of the entire system - 1. Serialization & 2. Transport. By then, I had already written quite a few tests and benchmarks and was amazed by the performance boost solely relying on FFM; also no unsafe, zero copy, zero allocations are the benefits as byproduct & ofcourse extremely low GC pauses comparatively.

      Since I had already started using FFM MemorySegment et al to build the key infra parts of the system - I was of the opinion that restricting these only within a FIX system alone would be a crime. Hence, MYRA & MVP.Express were incubated as an idea overnight - modern, safe, lightweight, modular, FFM oriented high-performant Java infra libs.

      Well, I have been posting only on Reddit's Java sub till now to get some initial feedback. However, I just noticed today a sudden huge inflow of traffic and that's how I realized its coming from a post by mands on HN. Thanks mands! I had no intention of posting to HN (yet). No complaints :) I'm glad it made here and also appreciate all the feedback.

      A note on why the extensive (un)checked usage of AI to build this - I would like to go breadth first rapidly i.e. expanding the ecosystem to let others tinker with. Work in pipeline - 1. JIA Cache - build a modern JVM based off-heap & safe distributed caching using the MYRA libs as infra. 2. MVP.Express - a light-weight Java only RPC system focusing on performance, type-safety, schema-driven, high throughput & low latency by leveraging MYRA libs & JIA-Cache as building blocks. Side note: I am currently on a vacation. Once back; I plan to start integrating XDP/eBPF as another backend for myra-transport.

      Agree or not - That's a hell lot of work! And that's the reason I am using AI extensively. To quickly build modern FFM based solutions and validate the existence/purpose - through performance and other metrics. Ideally, they should be real good candidates to perhaps displace incumbent similar systems which have a lot of legacy pre Java 8 code; meaning even if such existing systems need to be modernized they would potentially have to be re-written from scratch using modern Java paradigms. Well that's what MYRA & MVP.Express is trying to do now as Stage 0 at a rapid pace - to see a market fit!

      Having said that, I am very cautious about the design and guard-rails which is evident from the extensive test suite & benchmark every MYRA lib has and will have. Trying to follow a close TTD loop here.

      Next stages: If the MYRA libs and related ecosystem seems to be a good fit for modern Java projects, then I and others (its OSS for a purpose) can contribute also by manually reading (human verification) certain parts of the code in which they are experts at. This way we/us as a Java community can build modern forward-looking libs & solutions to power the enterprises for the next decade or two. It may sound silly but I believe in this philosophy and I hope you will too!

      Let's look at it from another (realistic) perspective - I have been working on this since a few months (2 to 3 give or take) along with my current 9-5; have been possible only due to AI. TBH if there was no AI, most probably I would not even have thought of starting this myriad task - since I know practically I would never have been able to finish ever or would have taken an enormous timeline and perhaps might have abandoned it half-way!

      Hope, this clears some air and brings some honest clarity about the goals & philosophy of MYRA (& myself seconded). Also, I am not a io_uring/XDP expert and AI has been really helpful to bring my vision into reality. Although, I am in parallel trying to grow my knowledge into the technical nitty-gritties of these tools/technologies. However solely due to AI, I was able to rapidly build something and hence, prove that using io_uring has substantial benefit - evident from benchmarks against Java Netty. That's what I meant earlier by rapidly expanding on the breadth of the ecosystem first and warranting every solution's purpose thru benchmarks and other metrics; not to forget NO unsafe and NO JNI as well are also golden nuggets.

      Last but not the least, I am excited by the response here on HN and will stay close here going forward; will be sharing updates here as well. I would also appreciate all kind of concerns/feedback/suggestion.

      Thanks -RR

      • lossolo 24 minutes ago ago

        > However, that does not mean that I as the author do not understand the code/concepts :) I also don't deny the fact that I might not have gone through the entire codebase till now.

        You didn't go through the codebase, but you understand the code? What?

  • jeffreygoesto 6 days ago ago

    27us roundtrip is not really state of the art for zero copy IPC, about 1us would be. What is causing this overhead?

    • jstimpfle 7 hours ago ago

      Asking for those who, like me, haven't yet taken the time to find technical information on that webpage:

      What exactly does that roundtrip latency number measure (especially your 1us)? Does zero copy imply mapping pages between processes? Is there an async kernel component involved (like I would infer from "io_uring") or just two user space processes mapping pages?

      • foltik 3 hours ago ago

        27us and 1us are both an eternity and definitely not SOTA for IPC. The fastest possible way to do IPC is with a shared memory resident SPSC queue.

        The actual (one-way cross-core) latency on modern CPUs varies by quite a lot [0], but a good rule of thumb is 100ns + 0.1ns per byte.

        This measures the time for core A to write one or more cache lines to a shared memory region, and core B to read them. The latency is determined by the time it takes for the cache coherence protocol to transfer the cache lines between cores, which shows up as a number of L3 cache misses.

        Interestingly, at the hardware level, in-process vs inter-process is irrelevant. What matters is the physical location of the cores which are communicating. This repo has some great visualizations and latency numbers for many different CPUs, as well as a benchmark you can run yourself:

        [0] https://github.com/nviennot/core-to-core-latency

        • jstimpfle an hour ago ago

          I was really asking what "IPC" means in this context. If you can just share a mapping, yes it's going to be quite fast. If you need to wait for approval to come back, it's going to take more time. If you can't share a memory segment, even more time.

          • foltik 8 minutes ago ago

            No idea what this vibe code is doing, but two processes on the same machine can always share a mapping, though maybe your PL of choice is incapable. There aren’t many libraries that make it easy either. If it’s not two processes on the same machine I wouldn’t really call it IPC.

            Of course a round trip will take more time, but it’s not meaningfully different from two one-way transfers. You can just multiply the numbers I said by two. Generally it’s better to organize a system as a pipeline if you can though, rather than ping ponging cache lines back and forth doing a bunch of RPC.

    • znpy 7 hours ago ago

      It may or may not be good, depending on a number of fact.

      I did read the original linux zerocopy papers from google for example, and at the time (when using tcp) the juice was worth the squeeze when payload was larger than than 10 kilobytes (or 20? Don’t remember right now and i’m on mobile).

      Also a common technique is batching, so you amortise the round-trip time (this used to be the cost of sendmmsg/recvmmsg) over, say, 10 payloads.

      So yeah that number alone can mean a lot or it can mean very little.

      In my experience people that are doing low latency stuff already built their own thing around msg_zerocopy, io_uring and stuff :)

      • hinkley 3 hours ago ago

        io_uring is a tool for maximizing throughput not minimizing latency. So the correct measure is transactions per millisecond not milliseconds per transaction.

        Little’s Law applies when the task monopolizes the time of the worker. When it is alternating between IO and compute, it can be off by a factor of two or more. And when it’s only considering IO, things get more muddled still.

    • rohanray 5 days ago ago

      It's not a local IPC exactly. The roundtrip benchmark stat is for a TCP server-client ping/pong call using a 2 KB payload; TCP is although on local loopback (127.0.0.1).

      Source: https://github.com/mvp-express/myra-transport/blob/main/benc...

    • blibble 7 hours ago ago

      indeed, you can get a packet from one box to another in 1-2us

      • steeve 4 hours ago ago

        with io_uring? How? I tried everything in the book

  • exabrial an hour ago ago

    Impressive. I'm sure the numbers will continue to improve as both the FFM and this project mature.

    Java Native databases or KVP stores would be good usage targets IMHO

  • rohanray 7 hours ago ago

    It's not a local IPC exactly. The roundtrip benchmark stat is for a TCP server-client ping/pong call using a 2 KB payload; TCP is although on local loopback (127.0.0.1).

    The payload is encoded using myra-codec FFM MemorySegment directly into a pre-registered buffer in io_uring SQE on the server. Similarly, on the client side CQE writes encoded payload directly into a client provided MemorySegment. The whole process saves a few SYSCALLs. Also, the above process is zero copy.

    Source: https://github.com/mvp-express/myra-transport/blob/main/benc...

    P.S.: I had posted this as a reply to jeffrey but not able to see it. Hence, reposting as a direct reply to the main post for visibility as well.

    Disclaimer: I am the author of https://mvp.express. I would love feedback, critical suggestions/advise.

    Thanks -RR

    • owl_might 3 hours ago ago

      Do you vibecoded this entire thing ? That's clearly the impression it gives. I haven't seen a single line of text or code in this entire organization that looks human.

      Do you have the skills to verify what the AI has generated, and are you confident that everything works as advertised?

    • refulgentis 5 hours ago ago

      Pretty much what NateB said* - but that might leave you at "what's wrong with that? that's how I could get it done"

      There's WAY too much content, way too many names and stuff that feels subtly off. I'm 37, been on this site for 16 years. I'm assuming target audience here is enterprise Java developers, which isn't my home, so I'm sure I'm missing some stuff is idiomatic in that culture.

      But the vast, vast amount of things that are completely unfamiliar tells me something else is going on and it's not good.

      Like I bet this is f'ing cool, otherwise you wouldn't put in the effort to share it. But you're better off having something super brief** in a GitHub README than a pseudo-marketing site that's straining to fit a cool technical thing into the wrong template.

      * https://news.ycombinator.com/item?id=46255661

      ** what you wrote is great! "The payload is encoded using myra-codec FFM MemorySegment directly into a pre-registered buffer in io_uring SQE on the server. Similarly, on the client side CQE writes encoded payload directly into a client provided MemorySegment. The whole process saves a few SYSCALLs. Also, the above process is zero copy." -- then the site looks like it wants to sell N different products and confusing flowcharts, but really, you're just geeked out and did something cool and want to share the technical details. So it's designed for the wrong audience.

  • TheGuyWhoCodes 5 hours ago ago

    In my opinion adding kryo in the benchmark is somewhat disingenuous as it does not require a message schema definition while MyraCodec/SBE/FlatBuffers do.

    The only thing that says is schemeless and is zero copy is Apache Fory which is missing from the benchmark.

    • rohanray 2 hours ago ago

      I had added Kryo since that seems to be the fastest Java serialization library which does not use sun.misc.unsafe.

      Thanks for sharing Apache Fory! Will try to add that to the benchmark as well.

  • DarkmSparks 3 hours ago ago

    Most of it seems to be 404ing now

    • rohanray 2 hours ago ago

      Oh! That shouldn't be the case :( Please let me know if you are still facing 404. I just checked and no alerts from my monitoring yet.

      Thanks for letting know though!