RX – a new random-access JSON alternative

(github.com)

141 points | by creationix 2 days ago ago

104 comments

btown 2 days ago ago

This is really interesting. At first glance, I was tempted to say "why not just use sqlite with JSON fields as the transfer format?" But everything about that would be heavier-weight in every possible way - and if I'm reading things right, this handles nested data that might itself be massive. This is really elegant.
My one eyebrow raise is - is there no binary format specification? https://github.com/creationix/rx/blob/main/rx.ts#L1109 is pretty well commented, but you can't call it a JSON alternative without having some kind of equivalent to https://www.json.org/ in all its flowchart glory!

[-]
- creationix a day ago ago
  
  Thanks. I had this for older versions, but forgot to write it up again for the latest version.
  One old version that is meant to be more human readable/writable is jsonito
  https://github.com/creationix/jsonito
  I'll add similar diagrams and docs for the format itself here.
  
  [-]
  - creationix a day ago ago
    
    Initial format docs are now here:
    https://github.com/creationix/rx/blob/main/docs/rx-format.md
    Railroad diagrams will come later when I have more time.
    
    [-]
    - btown 8 hours ago ago
      
      Neat! In case you took me too literally: railroad diagrams are fun, but far from the only way to give spec level clarity, so don’t feel you need to overindex on my silly comment!
      I am curious why it’s parsed right to left. Is this so that you could add new data to a top-level JSONL-esque list, solely by rewriting the end of the data structure, and not needing to change the beginning (or worst-case shift every single byte of data, if you need a longer count)?
      It’s an interesting design tradeoff, because you can’t show a partial parse if you’re streaming the content naively beginning to end, which is a bit odd in a world where streams that begin to render token-by-token are all the rage.
      But if you have an ability to do range queries, it’s quite effective, and it does allow for those incremental updates!
      
      [-]
      - creationix 8 hours ago ago
        
        Tha main reason for the reverse encoding is it makes it easier on the writer. You simply do a depth-first traversal of the data graph and emit data on the way back up the stack. Zero buffering is needed since this naturally means you write contents before the length prefix.
        But it does open up a future direction I want to make with mutable datasets using append-only persistent data structures. The chain primitive is currently only used for strings, but it will be used to do the equivalent of `{...oldObj, ...newObj}` as a single chain `(pointerToOldObj, newObj)`.
        With chains and pointers, you can write new versions of a dataset and reuse all the existing values that are unchanged. This, combined with random-access reads and fixed-block caching makes for a fairly complete MVCC database.
      - creationix 7 hours ago ago
        
        And don't worry about railroad diagrams. I already intended to create them, I've just been extra busy this week with other things.
Levitating 2 days ago ago

JSON is human-readable, why even compare it with this. Is any serialization format now just a "JSON alternative"?

[-]
- jy14898 2 days ago ago
  
  Came to the same conclusion the moment I had to hunt to see the outputs https://github.com/creationix/rx/tree/main/samples
  
  [-]
  - ok123456 a day ago ago
    
    Reminds me of https://en.wikipedia.org/wiki/PHP_serialization_format
  - SV_BubbleTime a day ago ago
    I was instantly suspicious that a “new better format” for serialization didn’t open with the input/output. And this is why (fucking lol, gtfo):
```
    Q^mSat,3^b:d+s+E,4Fri,3^u:h+k+u,6Thu,3^P:j+
```
    If you are effectively going binary, do it. CBOR or Protobuf or any dozen other binary serializations that would be far more efficient.
    The author claims this is because of copy and pasting… cool, remind me what BASE64 is again?
    [-]
    - creationix a day ago ago
      
      It is also a format that can be read as-is without any preprocessing. In some cases base64 can do that, and this format does make heavy use of base64 varints.
      Sure, you can encode as JSON, then compress with gzip and then base64 encode. You'll probably end up with something smaller than rx and be extremely safe to copy-paste. But your consumers are going to consume orders of magnitude more CPU reading data from this document.
      RX is usable as-is, is compressed, and is copy-pasteable. It's the unique combination of properties that makes it interesting.
      
      [-]
      - SV_BubbleTime a day ago ago
        
        >It is also a format that can be read as-is without any preprocessing.
        >Q^mSat,3^b:d+s+E,4Fri,3^u:h+k+u,6Thu,3^P:j+
        My man… no. I have no doubt you could kind of figure out what that sample is hot off the heels of writing this, and likely not in six months. And to consider that anyone else would fill their brain with the rules to decipher that, Nah 2.0.
        
        [-]
        
        creationix 21 hours ago ago
        
        I meant computers can read it without any preprocessing. It's random access. You don't need to parse it, you don't need to decompress it. You just start at the end and follow pointers till you get to the desired value.
        Even a trivial doc like this is challenging for me to read as a human.
        
        [-]
        
        112233 17 hours ago ago
        
        But... what sort of storage device does not allow your computers to use all 256 byte values? Why is random access data stored on teletype?
        
        [-]
        
        creationix 7 hours ago ago
        
        > what sort of storage device does not allow your computers to use all 256 byte values
        - clipboards
        - logs
        - terminal output
        - alerts
        - yaml configs
        - JSON configs
        - hacker news comments
        - markdown documentation
        - etc...
        I assure you, this is not a solution looking for a problem. I started out with binary encodings first, but then realized it limits so many workflows.
        
        hombre_fatal 18 hours ago ago
        
        Ick, why are you talking to another person like this?
        > Nah.com, fam.
- creationix 2 days ago ago
  
  - this encodes to ASCII text (unless your strings contain unicode themselves) - that means you can copy-paste it (good luck doing that with compressed JSON or CBOR or SQLite - there is a scale where JSON isn't human readable anymore. I've seen files that are 100+MB of minified JSON all on a single very long line. No human is reading that without using some tooling.
  
  [-]
  - bawolff 2 days ago ago
    
    That kind of feels a bit worst of both worlds. None of the space savings/efficiency of binary but also no human readability.
    Being able to copy/paste a serialization format is not really a feature i think i would care about.
    
    [-]
    - creationix a day ago ago
      
      It's a gradient. I did design several binary formats first, but for my use cases, this is actually better. There is nuance to various use cases.
      > None of the space savings/efficiency of binary
      For string heavy datasets, it's nearly the same encoding size as binary. I get 18x smaller sizes compared to JSON for my production datasets. This was originally designed as a binary format years ago (https://github.com/creationix/nibs) and then later after several iterations, converted to text.
      > Being able to copy/paste a serialization format is not really a feature i think i would care about
      Imagine being paged at 3am because some cache in some remote server got poisoned with a bad value (unrelated to the format itself). You load the value in dashboard, but it's encoded as CBOR or some binary format and so you have to download it in a binary safe way, upload that binary file to some tooling or install a cbor reader to your CLI. But then you realize that you don't have exec access to the k8s pods for security reasons, but do have access to a web-based terminal. Again, to extract a binary value you would need to create a shell, hexdump the file and somehow copy-paste that huge hexdump from the web-based terminal to your local machine, un-hex dump it, and finally load it into some CBOR reader.
      A text format, however is as simple as copy-paste the value from the dashboard and paste into some online tool like https://rx.run/ to view the contents.
  - mpeg a day ago ago
    
    if one of the advantages is making it copy-pastable then I would suggest the REXC viewer should give you the option to copy the REXC output, currently I have no way of knowing this by looking at your github or demo viewer
    another thing, I put in a 400KB json and the REXC is 250KB, cool, but ideally the viewer should also tell me the compressed sizes, because that same json is 65kb after zstd, no idea how well your REXC will compress
    edit: I think I figured out you can right click "copy as REXC" on the top object in the viewer to get an output, and compressed it, same document as my json compressed to 110kb, so this is not great... 2x the size of json after compression.
    
    [-]
    - creationix a day ago ago
      
      Thanks for testing it out! Yes, the website could use some love to make everything more discoverable.
      The primary use case is not compression, it's just a nice side effect of the deduplication. This will never beat something like zstd, brotli, or even gzip.
      My production use cases are unique in that I can't afford the CPU to decompress to JSON and then parse to native objects. But with this format, I can use the text as-is with zero preprocessing and as a bonus my datasets are 18x smaller.
    - creationix a day ago ago
      
      > 2x the size of json after compression
      Right and that makes sense. There is more information in here. The entire thing is length prefixed and even indexed for O(1) array lookups and O(log2 N) object lookups.
      If you don't care about random access and you don't mind the overhead of decompression, don't use RX.
      
      [-]
      - mpeg a day ago ago
        
        I think this makes sense, when you explain it like that, it might be a matter of cleaning up the docs a bit so the "why" of RX is more clear (admittedly, a README is not always the best channel for this!)
        
        [-]
        
        creationix a day ago ago
        
        I've rewritten the framing in the README to first explain when you should use RX and when you should not. Most uses of JSON should probably stay JSON.
        Let me know what you think
        https://github.com/creationix/rx/blob/main/README.md#when-to...
  - rendaw 2 days ago ago
    
    Are there any examples? If it's ASCII I'd expect to see some of the actual data in the readme, not just API.
    Unless, to read that correctly, it only has a text encoding as long as you can guarantee you don't have any unicode?
    
    [-]
    - creationix a day ago ago
      
      > it only has a text encoding as long as you can guarantee you don't have any unicode?
      The format is technically a binary format in that length prefixes are counts of bytes. But in practice it is a textual format since you can almost always copy-paste RX values from logs to chat messages to web forms without breaking it.
      unciode doesn't break anything since strings are encoded as raw unicode with utf-8 byte length prefixes. It supports unicode perfectly.
      If your data only contains 7-bit ASCII strings, the entire encoding is ASCII. If your data contains unicode, RX won't escape it, so the final encoding will contain unicode as UTF-8.
    - creationix a day ago ago
      
      oh, sorry about that. I forgot to include the description of the format with examples.
      I did add some small examples to the repo.
      https://github.com/creationix/rx/blob/main/samples/quest-log...
      The older, slightly outdated, design spec is in the older rex repo (this format was spun out of the rex project when I realized it's actually a good standalone format)
      https://github.com/creationix/rex/blob/main/rexc-bytecode.md
      
      [-]
      - SV_BubbleTime a day ago ago
        
        'fdiscovered,aextreme,7danger,6+1A+16;6level_range,b:QThe Heap ,d'th
        Oof.
        
        [-]
        
        dontdoxxme a day ago ago
        
        Very similar to bittorrent’s bencode. That has the benefit that it has a canonical encoding which this doesn’t (because of the different compression options). I wouldn’t be put off by how it looks as text.
        
        [-]
        
        creationix 21 hours ago ago
        
        Very true. I had forgotten about bencode, I should read up on that again.
        It makes sense they need a canonical form because they want same values to have same content hashes.
  - kukkamario 2 days ago ago
    
    You don't want to copy-paste anything like that as text anyway. Just copy and paste files.
    No human is reading much data regardless of the format.
    What is the benefit over using for example BSON?
    
    [-]
    - creationix 7 hours ago ago
      
      > Just copy and paste files
      If all your workflows allow copying as binary files, more power to you! But there are a lot of workflows where that is not possible. This was inspired by years of hands-on operational incident handling in production systems. Every time we use a binary format, it's extra painful.
      This particular format would be slightly more compact as binary, but not enough to justify closing the door on all the use cases that would preclude.
      I'll probably add a binary variant for people who prefer that (or for people who want to be able to embed binary values in the data without base64 encoding it)
  - soco a day ago ago
    
    I have an idea, why don't we all go back using XML at this point, as any initial selling point / differentiator has been slowly eroded away?
- creationix a day ago ago
  
  Thanks for the feedback. I've improved the framing to make the purpose/value more clear. What do you think about "RX is a read-only embedded store for JSON-shaped data"?
  https://www.npmjs.com/package/@creationix/rx
- Gormo a day ago ago
  
  It's also quite odd to create a serialization format optimized for random access.
  
  [-]
  - creationix a day ago ago
    
    Serialized just means encoded as a stream of bytes so that it can be transferred between systems. There are absolutely cases where you want to be able to query a value directly like a database instead of parsing the entire thing to memory before you can read it. Think of this as no-sql sqlite.
    
    [-]
    - Gormo 19 hours ago ago
      
      > Serialized just means encoded as a stream of bytes so that it can be transferred between systems.
      Yes, serially. Which means no random-access across the transfer channel.
  - j16sdiz a day ago ago
    
    many serialization format are just a memory structure dump.
  - IshKebab a day ago ago
    
    Not at all. What makes you say that?
- dietr1ch 2 days ago ago
  
  cat file.whatever | whatever2json | jq ?
  (Or to avoid using cat to read, whatever2json file.whatever | jq)
  
  [-]
  - Gormo a day ago ago
    
    That's not really random access, though. You're effectively just searching through the entire dataset for every targeted read you're after.
    What might be interesting is to have a tool that processes full JSON data and creates a b-tree index on specified keys. Then you could run searches against the index that return byte offsets you can use for actual random access on the original JSON.
    OTOH, this is basically just recreating a database, just using raw JSON as its storage format.
    
    [-]
    - creationix a day ago ago
      
      > What might be interesting is to have a tool that processes full JSON data and creates a b-tree index on specified keys. Then you could run searches against the index that return byte offsets you can use for actual random access on the original JSON.
      I did build that once. But keeping track of the index is a pain. Sometimes I was able to generate the index on-demand and cache it in some ephemeral storage, but overall it didn't work out so well.
      This system with RX will work better because I get the indexes built-in to the data file and can always convert it back to JSON if needed.
    - dietr1ch a day ago ago
      
      Well, JSON had no random access to begin with, so maybe that's on needing JSON.
      Maybe a query over the random-access file then converted into JSON would work?
  - creationix 2 days ago ago
    
    Or in this case, just do `rx file.rx` It has jq like queries built in and supports inputs with either rx or json. Also if you prefer jq, you can do `rx file.rx | jq`
    
    [-]
    - dietr1ch a day ago ago
      
      wow, on that case then using `jq` is just a presentation preference at the very last step unless jq is more expressive (which might be the case given how long it has been around?).
      
      [-]
      - creationix a day ago ago
        
        right, the jq query language is much more complex and featureful than the simple selector syntax I added to the rx-cli. But more could be added later as needed or it could just stream JSON output. It would be pretty trivial to hook up a streaming JSON encoder to rx-cli which could then pipe to jq for low-latency lookups. The problem is jq would need to JSON parse all that data which will be expensive.
garrettjoecox 2 days ago ago

Very cool stuff!
This did catch my eye, however: https://github.com/creationix/rx?tab=readme-ov-file#proxy-be...
While this is a neat feature, this means it is not in fact a drop in replacement for JSON.parse, as you will be breaking any code that relies on the that result being a mutable object.

[-]
- creationix 2 days ago ago
  
  True, the particular use case where this really shines is large datasets where typical usage is to read a tiny part of it. Also there is no reason you couldn't write an rx parser that creates normal mutable objects. It could even be a hybrid one that is lazy parsed till you want to turn it mutable and then does a normal parse to normal objects after that point.
dtech 2 days ago ago

It's not quite clear to me why you'd use this over something more established such as protobuf, thrift, flatbuffers, cap n proto etc.

[-]
- maxmcd 2 days ago ago
  
  Those care about quickly sending compact messages over the network, but most of them do not create a sparse in-memory representation that you can read on the fly. Especially in javascript.
  This lib keeps the compact representation at runtime and lets you read it without putting all the entities on the heap.
  Cool!
  
  [-]
  - creationix a day ago ago
    
    Exactly. Low heap allocations when reading values is one of the main driving factors in this design!
  - IshKebab a day ago ago
    
    Amazon Ion has some support for this - items are length-prefixed so you can skip over them easily.
    It falls down if you have e.g. an array of 1 million small items, because you still need to skip over 999999 items to get to the last one. It looks like RX adds some support for indexes to improve that.
    I was in this situation where we needed to sparsely read huge JSON files. In the end we just switched to SQLite which handles all that perfectly. I'd probably still use it over RX, even though there's a somewhat awkward impedance mismatch between SQL and structs.
    
    [-]
    - creationix a day ago ago
      
      I did seriously consider SQLite, but my existing datasets don't map easily to relational database tables. This is essentially no-sql for sqlite.
- konart 2 days ago ago
  
  What if you are reading from a service which already have an established API?
  It's not like you can just tell them to move to protobuf.
  
  [-]
  - SV_BubbleTime a day ago ago
    
    What about CBOR that can retain JSON compatibility?
    If you are working with an end you don’t control, this “newer better” format isn’t in your cards either.
    
    [-]
    - creationix a day ago ago
      
      How does CBOR retain JSON compatibility more than RX?
      RX can represent any value JSON can represent. It doesn't even lose key order like some random-access formats do.
      In fact, RX is closer to JSON than CBOR.
      Take decimals as an example:
      JSON numbers are arbitrary precision numbers written in decimal. This means it can technically represent any decimal number to full precision.
      CBOR stores numbers as binary floats which are appriximations of decimal numbers. This is why they needed to add Decimal Fractions (Tag 4)
      RX already stores as decimal base and decimal power of 10. So out of the box, it matches JSON
barishnamazov 2 days ago ago

You shouldn't be using JSON for things that'd have performance implications.

[-]
- creationix 2 days ago ago
  
  As with most things in engineering, it depends. There are real logistical costs to using binary formats. This format is almost compact as a binary format while still retaining all the nice qualities of being an ASCII friendly encoding (you can embed it anywhere strings are allowed, including copy-paste workflows)
  Think of it as a hybrid between JSON, SQLite, and generic compression. This format really excels for use cases where large read-only build artifacts are queried by worker nodes like an embedded database.
  
  [-]
  - Asmod4n 2 days ago ago
    
    The cost of using a textual format is that floats become so slow to parse, that it’s a factor of over 14 times slower than parsing a normal integer. Even with the fastest simd algos we have right now.
    
    [-]
    - HelloNurse 2 days ago ago
      
      So it depends. Float parsing performance is only a problem if you parse many floats, and lazy access might reduce work significantly (or add overhead: it depends).
      
      [-]
      - creationix a day ago ago
        
        Exactly. My for use cases, this format is amazing. I have very few floats, but lots and lots of objects, arrays and strings with moderate levels of duplication and substring duplication. My data is produced in a build and then read in thousands or millions of tiny queries that lookup up a single value deep inside the structure.
        rx works very well as a kind of embedded database like sqlite, but completely unstructured like JSON.
        Also I'm working on an extension that makes it mutable using append-only persistent data structures with a fixed-block caching level that is actually a pretty good database.
    - creationix a day ago ago
      
      if you data is lots and lots of arrays of floats, this is likely not the format for you. Use float arrays.
      Also note it stores decimal in a very compact encoding (two varints for base and power of 10)
      That said, while this is a text format, it is also technically binary safe and could be extended with a new type tag to contain binary data if desired.
    - meehai 2 days ago ago
      
      and with little data (i.e. <10Mb), this matters much less than accessibility and easy understanding of the data using a simple text editor or jq in the terminal + some filters.
      
      [-]
      - xxs 2 days ago ago
        
        what do you mean by little data, most communication protocols are not one off
      - creationix 21 hours ago ago
        
        Also good luck parsing 10 MiB of JSON in a loop that can't tolerate blocking the CPU for more than 10ms.
        What's expensive is very relative to the use case.
- hrmtst93837 2 days ago ago
  
  That rule sounds clean until the DB dump, API trace, or langauge boundary lands in your lap. Binary formats are fine for tight inner loops, but once the data leaks into logs, tooling, support, or a second codebase, the bytes you saved tend to come back as time lost decoding some bespoke mess.
  
  [-]
  - creationix a day ago ago
    
    Yep. I did try binary formats first. I tried existing ones like CBOR, I tried making my own like Nibs. The text encoding is an operational concern, not a technical one.
    This is the same reason I've been advocating for JSONL at work. It's not ideal technically, but it's a good balance of technically good enough while being also human friendly when things go wrong.
    - https://vercel.com/blog/how-we-made-global-routing-faster-wi... - https://vercel.com/blog/scaling-redirects-to-infinity-on-ver...
    RX is one step towards less human friendly, but more machine friendly. I try to keep things balanced in my designs.
- squirrellous 2 days ago ago
  
  I agree in principle. However JSON tooling has also got so good that other formats, when not optimized and held correctly, can be worse than JSON. For example IME stock protocol buffers can be worse than a well optimized JSON library (as much as it pains me to say this).
  
  [-]
  - tabwidth 2 days ago ago
    
    Yeah the raw parse speed comparison is almost a red herring at this point. The real cost with JSON is when you have a 200MB manifest or build artifact and you need exactly two fields out of it. You're still loading the whole thing into memory, building the full object graph, and GC gets to clean all of it up after. That's the part where something like RX with selective access actually matters. Parse speed benchmarks don't capture that at all.
    
    [-]
    - magicalhippo a day ago ago
      
      > The real cost with JSON is when you have a 200MB manifest or build artifact and you need exactly two fields out of it.
      There are SAX-like JSON libraries out there, and several of them work with a preallocated buffer or similar streaming interface, so you could stream the file and pick out the two fields as they come along.
      
      [-]
      - IshKebab a day ago ago
        
        You still have to parse half the entire file on average. Much slower than formats that support skipping to the relevant information directly.
        
        [-]
        
        creationix a day ago ago
        
        yep, this is exactly the kind of use case that caused me to design this format.
    - xxs 2 days ago ago
      
      as parser: keep only indexes to the original file (input), dont copy strings or parse numbers at all (unless the strings fit in the index width, e.g. 32bit)
      That would make parsing faster and there will be very little in terms on tree (json can't really contain full blow graphs) but it's rather complicated, and it will require hashing to allow navigation, though.
      
      [-]
      - creationix a day ago ago
        
        yep. I built custom JSON parsers as a first solution. The problem is you can't get away from scanning at least half the document bytes on average.
        With RX and other truly random-access formats you could even optimize to the point of not even fetching the whole document. You could grab chunks from a remote server using HTTP range requests and cache locally in fixed-width blocks.
        With JSON you must start at the front and read byte-by-byte till you find all the data you're looking for. Smart parsers can help a lot to reduce heap allocations, but you can't skip the state machine scan.
- Spivak 2 days ago ago
  
  Can you imagine if a service as chatty and performance sensitive as Discord used JSON for their entire API surface?
dietr1ch a day ago ago

A tiny note on the speed comparison: The 23,000x faster single-key lookup seems a bit misleading to me.
Once you get the computational complexity advantage, then you can make it as much times faster as you want. In these cases small instances matter to judge constants, and to the average (mean?) user, mean instance sizes.
I'm not sure how to sell the advantage succinctly though. Maybe just focus on "real-world" scenarios, but there's no footnote with details on the comparison

[-]
- creationix a day ago ago
  
  That benchmark is a fair comparison for a real-world production workload and use case. Sadly I can't share the details. But suffice it to say that the dataset is a huge object with tens of thousands of paths as keys and moderately large objects as values (averaging around 3KB of JSON each) all with slightly different shapes. The use is reading just a few entries by path an then looking up some properties within those entries.
  The benchmark (or is supposed to) measures end-to-end parse + lookup.
  JSON: 92 MB RX: 5.1 MB
  Request-path lookup: ~47,000x faster
  Time to decode a manifest and look up one URL path:
  JSON: 69 ms REXC: 0.003 ms
  Heap allocations: 2.6 million vs. 1
  JSON: 2,598,384 REXC: 1 (the returned string)
50lo 2 days ago ago

The biggest challenge for formats like this is usually tooling. JSON won largely because: every language supports it, every tool understands it.
Even a technically superior format struggles without that ecosystem.

[-]
- latexr 2 days ago ago
  
  And that in turn affects tool adoption. I have dabbled in Lua for interacting with other software such as mpv, but never got much into the weeds with it because it lacks native JSON support, and I need to interact with JSON all the time.
  
  [-]
  - creationix a day ago ago
    
    yeah, LuaJIT is one of the use cases I had in mind working on this. JSON is pretty fast in modern JS engines, but in Lua land, JSON kinda sucks and doesn't really match the language without using virtual tables.
    JSON has `null` values with string keyds, but lua doesn't have `null`. It has `nil`, but you can't have a key with a nil value. Setting nil deletes the key
    Lua tables are unordered. But JS and JSON are often ordered and order often matters.
    RX, however matches Lua/LuaJIT extremely well and should out-perform the JS Proxy based decoder using metatables. Since it's using metatables anyway do to the lazy parsing, it's trivial to do things like preserve order when calling `pairs` and `ipairs` and even including keys with associated null values.
    You can round trip safely in Lua, which is not easy with most JSON implementations.
jbverschoor 2 days ago ago

So this is two things? A BSON-like encoding + something similar to implementing random access / tree walker using streaming JSON?
Docs are super unclear.
_flux 2 days ago ago

It doesn't seem the actual serialization format is specified? Other than in the code that is.
Is it versioned? Or does it need to be..
killbot5000 a day ago ago

The documentation reference a “decode” function, and it’s imported to the example code, but it’s never called. I’m not sure what the API is after reading the examples.
creationix 2 days ago ago

A new random-access JSON alternative from the creator of nvm.sh, luvit.io, and js-git.
pshirshov 2 days ago ago

Looks similar to https://github.com/7mind/sick

[-]
- creationix a day ago ago
  
  You're right. Some important differences:
  sick is binary, rx is textual (this matters for tooling)
  sick has size limits (65534 max keys for example. I have real-world rx datasets reaching this size already) rx uses arbitrary precision variable-length b64 integers. There are no size limits anywhere inherit in the format, just in implementations.
  sick does not preserve object key order rx preserves object key order, but still implements O(log2 N) lookups for object keys.
  etc.
bsimpson a day ago ago

It feels petty to show up with a naming not, but the name is unfortunately/confusingly similar to the already well-known RxJS.
Why is it called RX?

[-]
- creationix a day ago ago
  
  I'm happy to hear suggestions. This format was actually the internal .rexc bytecode for Rex (routing expressions), but when I realized it was actually a pretty good standalone format, I renamed it `.rx` for short. I am aware of RxJS, but I think that `rx-format` is different enough and `.rx` file extensions are unique enough, it's not too confusing.
WatchDog 2 days ago ago

Cool project.
The viewer is cool, took me a while to find the link to it though, maybe add a link in the readme next to the screenshot.
TKAB a day ago ago

could this be useful for embedding info in server generated web pages that are then picked up by a JavaScript. e.g. a tom-select country picker that gets its data from an embedded RX structure?

[-]
- creationix a day ago ago
  
  yes, this would work very well for any case where you have embedded databases of unstructured data that you want to query in a website or edge server
Spivak 2 days ago ago

I love these projects, I hope one of them someday emerges as the winner because (as it motivates all these libraries' authors) there's so much low hanging fruit and free wins changing the line format for JSON but keeping the "Good Parts" like the dead simple generic typing.
XML has EXI (Efficient XML Interchange) for precisely the reason of getting wins over the wire but keeping the nice human readable format at the ends.

[-]
- snthpy 2 days ago ago
  
  TIL.
  EXI looks useful. Now I just wish there was a renderer in the pugjs format as I find that terse format much pure readable than verbose XML. I also find indentation based syntax easier to visually parse hierarchical structure.
transfire 2 days ago ago

I am a little confused. Is this still JSON? Is it “binary“ JSON?

[-]
- SV_BubbleTime a day ago ago
  
  It’s neither!
  Sample output:
  'fdiscovered,aextreme,7danger,6+1A+16;6level_range,b:QThe Heap ,d'th
  Human unreadable, ascii output. Line up and get yours today!
  
  [-]
  - creationix a day ago ago
    
    it's not really possible to stay human readable and get the compression levels and random access properties I was going for. But it is as human tooling friendly as possible given the constraints.
    
    [-]
    - SV_BubbleTime a day ago ago
      
      >it's not really possible
      I find it obvious that your first attempt failed. Try again, you have not even remotely failed enough if you are making the argument that this is kinda readable. Yes, ascii words are easy to pick out, you didn’t do that, you did the part that makes it all harder.
benatkin 2 days ago ago

Interesting. I've heard about cursors in reference to a Rust library that was mentioned as being similar to protobuf and cap'n proto.
Does this duplicate the name of keys? Say if you have a thousand plain objects in an array, each with a "version" key, would the string "version" be duplicated a thousand times?
Another project a lot of people aren't aware of even though they've benefitted from it indirectly is the binary format for OpenStreetMap. It allows reading the data without loading a lot of it into memory, and is a lot faster than using sqlite would be.
Edit: the rust library I remember may have been https://rkyv.org/

[-]
- creationix a day ago ago
  
  > Does this duplicate the name of keys?
  Yes, the format allows for objects to be stored with a pointer to a shared schema (either an array of keys or another object that has the desired keys)
  The current implementation is pretty close to ideal when deciding to use this encoding.
gritzko 2 days ago ago

I recently created my own low-overhead binary JSON cause I did not like Mongo's BSON (too hacky, not mergeable). It took me half a day maybe, including the spec, thanks Claude. First, implemented the critical feature I actually need, then made all the other decisions in the least-surprising way.
At this point, probably, we have to think how to classify all the "JSON alternatives" cause it gets difficult to remember them all.
Is RX a subset, a superset or bijective to JSON?
https://github.com/gritzko/librdx/tree/master/json

[-]
- creationix a day ago ago
  
  The current format version is the exact same feature set as JSON. I even encode numbers as arbitrary precision decimals (which JSON also does). This is quite different from CBOR which stores floats in binary as powers of 2.
  I could technically add binary to the format, but then it would lose the nice copy-paste property. But with the byte-aware length prefixes, it would just work otherwise.
- SV_BubbleTime a day ago ago
  
  You went from BSON to your own and skipped CBOR and Protobuf? … I wonder if you would have made different decisions without Claude vibing you in a direction?
NoSalt a day ago ago

Why do we need an "alternative" when JSON, itself, is so fantastic?

[-]
- creationix a day ago ago
  
  the project framing needs some help perhaps. JSON is really good at a lot of use cases that this will never replace. But there are cases where JSON is currently used where this is much better. In particular large unstructured datasets where you only need to read a tiny subset of the data in a single request.
  Maybe a better framing would be no-sql sqlite?