From specification to stress test: a weekend with Claude

(juxt.pro)

37 points | by henrygarner 7 hours ago ago

34 comments

amarble 6 hours ago ago

As a counterpoint, I also tried writing something with Claude last weekend: a Google docs clone[1]. I spent $170 on Anthropic API credits, and got something that did mostly what I asked but was basically useless. It seems that for simple interfaces for which there is an exact specification, like the recent compiler and web browser examples, it's possible to write bigger projects that "work" as a demo although not in a way they'd be viable alternatives. For anything that requires taste and judgment, we've still got a long way to go. There are lots of great demos out there but few if any real examples of vibe coded (or whatever you want to call it) software standing alone as an alternative to project people wrote.
[1] https://www.marble.onl/posts/this_cost_170.html

[-]
- speedgoose 5 hours ago ago
  
  So we are now at the stage that AI coding agents don’t work because you can’t create a good google docs alternative from scratch without dependencies with $170 in one weekend.
  
  [-]
  - simion314 5 hours ago ago
    
    I think the conclusion is that Vibe oding is shit, and I agree with this, but having an AI assistant that can do some specific tasks that you can review is a good strategy.
    I also have a story about Vibe Coding, I had the AI make for me a mqarkdown editor with some extra features, it worked fine but the problem is I have no idea how it works, I do not know if I can add feature X easily or it needs a rewrite from scratch, I have no ideas for improvements or new features, if there is a problem I have no clue what causes it since I I never looked at the code.
    So I concluded that I (others can do whatever they want, they are free people with their own standards ) I will only use Vibe Coding for throw away personal shit, that I will not make public, like for example I made some bash scripts, some python scripts to automate some stuff.
- mr_mitm 5 hours ago ago
  
  Does "vibe coding" now mean the agent needs to produce a working product in one shot? Or not even just working, but meeting unspecified requirements? I thought in the original phrasing by Karpathy it meant that you don't care about the code, yet you may still iterate because you might care about the product. I tried out the "Impeccable" skill for Claude Code and found it to be very useful if you care about visual aspects and good UI/UX. (I vibe coded a web app for tracking personal finances where I used this skill, see: https://github.com/AdrianVollmer/Solvency. Not one-shotted, and not perfect, but personally I'm impressed with what is possible in a few weekends with a $90 subscription to Claude.)
- NitpickLawyer 5 hours ago ago
  
  > The result is OK. It has all the features I asked for, and includes document sharing, collaborative editing in real time, support for fonts and line spacing, etc. etc. I could not have paid a developer $170 and got this. The problem, of course, is that, while abstractly impressive, this is completely useless
  Well, what would you expect from a few hours of running in a loop with these constraints?
  > This project exists to build a document editor from the ground up. Violating these constraints defeats the entire purpose.
  > FORBIDDEN dependencies (do NOT install or use these):
  > Rich text editor frameworks: ProseMirror, Slate, Quill, TipTap, Draft.js, CKEditor, TinyMCE, Lexical, or any similar library
  > CRDT/OT libraries: Yjs, Automerge, ShareDB, OT.js, or any similar library
  > Full CSS frameworks: Bootstrap, Tailwind, Material UI (small utility libs for specific needs are OK)
  > ORMs: Prisma, TypeORM, Sequelize (use raw SQL or a thin query builder)
  I can't help but wonder what you thought you would achieve, and how getting "mostly what you asked for" is still disappointing to you.
  > there is no taste being applied.
  There are 0 lines in AGENT_PROMPT.md about "taste". You have instructed something/someone on how to build more than what to build.
  Your goals are (from a quick skim):
  - The goal of this project is ultimately to generate a working alternative to Google Docs with the same functionality.
  - You are an autonomous software engineer building AltDocs, a from-scratch alternative to Google Docs.
  I see a FEATURES.md file, but not clear if this is from you or expanded by the model. It seems pretty slim.
  All in all, I don't get the "disappointment". It seems, from your blog post, that the "model" did most of the things you asked for. The disappointment might come from what you asked for, more than from the "model" being bad... To paraphrase a line from a sitcom: "Damn, Andrew, I can't control the weather!" :)
- embedding-shape 5 hours ago ago
  
  > For anything that requires taste and judgment, we've still got a long way to go. There are lots of great demos out there but few if any real examples of vibe coded (or whatever you want to call it) software standing alone as an alternative to project people wrote
  Yeah, we still have exactly the same problem as before LLMs/agents, namely that we lack people with "Good Taste". I've outlined how I feel about before (https://emsh.cat/good-taste/) but the TLDR is basically that while LLMs can help you move faster, they won't suddenly mean you'll make better choices, probably the reverse is true, you'll move faster and make worse choices.
  Having good taste and knowing how things actually should function is 80% of the work of building good software, and so far all these tools that are trying to replace human choices lead to worse software, and we need more tooling that puts the human and the LLM working together, instead of just outsourcing from the human to the LLM.
altmanaltman 6 hours ago ago

This blog post is a sophisticated piece of content marketing for a company called JUXT and their proprietary tool, "Allium." While the technical achievement is plausible, the framing is heavily distorted to sell a product.
Here is the breakdown of the flaws and the "BS" in the narrative.
1. The "I Didn't Write Code" Lie The author claims, "I didn't write a line of implementation code." The Flaw: He wrote 3,000 lines of "Allium behavioural specification." The BS: Writing 3,000 lines of a formal specification language is coding. It’s just coding in a proprietary, high-level language instead of Kotlin.
The Ratio is Terrible: The post admits the output was ~5,500 lines of Kotlin. That means for every 1 line of spec, he got roughly 1.8 lines of code.
Why this matters: True "low-code" or "no-code" leverage is usually 1:10 or 1:100. If you have to write 3,000 lines of strict logic to get a 5,000-line program, you haven't saved much effort—you've just swapped languages.
2. The "Weekend Project" Myth The post frames this as a casual project done "between board games and time with my kids." The Flaw: This timeline ignores the massive "pre-computation" done by the human. The BS: To write 3,000 lines of coherent, bug-free specifications for a Byzantine Fault Tolerant (BFT) system, you need to have the entire architecture fully resolved in your head before you start typing. The author is an expert (CTO level) who likely spent weeks or years thinking about these problems. The "48 hours" only counts the typing time, not the engineering time.
3. The "Byzantine Fault Tolerance" (BFT) Bait-and-Switch The headline claims "Byzantine fault tolerance," which implies a system that continues to operate correctly even if nodes lie or act maliciously (extremely hard to build). The Flaw: A "Resolved Question" block in the text admits: "The system's goal is Byzantine fault detection, not classical BFT consensus." The BS: Real BFT (like PBFT or Raft with signatures) is mathematically rigorous and keeps the system running. "Fault Detection" just means "if the two copies don't match, stop." That is significantly easier to build. Calling it "BFT" in the intro is a massive overstatement of the system's resilience.
4. The "Maintenance Nightmare" (The Vendor Lock-in Trap) The post glosses over how this system is maintained. The Flaw: You now have 5,500 lines of Kotlin that no human wrote. The BS: This is the "Model Driven Architecture" (MDA) trap from the early 2000s.
Scenario: You find a bug in the Kotlin code.
Option A: You fix the Kotlin. Result: Your code is now out of sync with the Spec. You can never regenerate from Spec again without losing your fix.
Option B: You fix the Spec. Result: You hope the AI generates the exact Kotlin fix you need without breaking 10 other things.
The Reality: You are now 100% dependent on the "Allium" tool and Claude. If you stop paying for Allium, you have a pile of unmaintainable machine-generated code.
5. The Performance "Turning Point" The dramatic story about 10,000 Requests Per Second (RPS) has a hole in it. The Flaw: The "bottleneck" wasn't the code; it was a Docker proxy setting (gvproxy). The BS: This is a standard "gotcha" for anyone using Docker on Mac. Framing this as a triumph of AI debugging is a stretch—any senior engineer would check network topology when seeing high latency but low CPU usage. 10k RPS is also not "ambitious" for a modern distributed system; a single well-optimized Node.js or Go process can handle that easily.

[-]
- antonly 6 hours ago ago
  
  Hello, could you please put your over-sensationalized, overly-long, AI-generated comments somewhere else? Thank you.
  Kindly, the HN Community.
  
  [-]
  - altmanaltman 5 hours ago ago
    
    [flagged]
    
    [-]
    - antonly 3 hours ago ago
      
      I find it interesting that you request me to discuss subject matter when your post is intellectually equivalent to: "I generated this sequence of numbers using my pseudo-random number generator"
      
      [-]
      - altmanaltman 3 hours ago ago
        
        I find it interesting you still continue with personal attacks without saying nothing of substance. What is wrong factually in the post? Why are you convinced I'm an AI? I also find it interesting that you've just been hostile throughout this entire interaction and still have nothing to add to the actual discussion. Stop the slop.
        
        [-]
        
        antonly 3 hours ago ago
        
        Alright, here I go.
        I don't think you are AI. I merely lament the fact that you found it appropriate to post a clearly LLM generated comment.
        The problem with LLM generated comments for me is not their content, but rather their nature. I am not addressing the "actual discussion", as in my personal opinion, there is no "discussion" to be had. The post constitutes an automated response akin to an answering machine (be it much more sophisticated), and I generally do not find discussions with answering machines interesting at all.
        
        [-]
        
        altmanaltman 3 hours ago ago
        
        That sounds like a you problem. Since you're not interested in discussing anything or finding this to be an actual discussion, I'll stop trying to engage you in good faith anymore, sorry.
    - FeteCommuniste 5 hours ago ago
      
      Stop posting slop.
      
      [-]
      - altmanaltman 5 hours ago ago
        
        [flagged]
        
        [-]
        
        retsibsi 5 hours ago ago
        
        I don't think people should be rude to you, but the comment was AI-generated, right? Lots of people dislike that as it feels kind of wasteful and disrespectful of our time; it can literally take you less time to generate the comment than for us to read it, and the only information you added is whatever was in the (presumably much shorter) prompt. If you'd written it yourself, it may or may not be interesting and correct, but I'd at least know that someone cared enough to write it and all of it made sense from that person's perspective. Sometimes I am interested in an LLM's take on a topic, but not when browsing a forum for humans.
        
        [-]
        
        altmanaltman 4 hours ago ago
        
        I'm sorry but if you're blaming some text on a website to be "disrespectful of our time", I don't know what to say to you.
        I stand behind everything on my comment and I have engaged in good faith with every single reply to it here (even though none of them talk about anything in the comment itself).
        Go through my profile, see how I engage with people and tell me again I'm AI.
        If you do not have anything to say to the subject matter of a comment and just have personal snide remarks, I do think it's a waste of your time but do not blame me for it or tell me to leave the platform.
        Typing this comment right now is a waste of time for me but I do not feel the need to grandstand over it as if there's a massive opportunity cost to it. I'm a human writing/interacting in a "forum for humans."
        
        [-]
        
        retsibsi 4 hours ago ago
        
        I didn't say or think that your account was AI-run, and I didn't tell you to leave the platform. I just tried to explain why your comment might have annoyed people and triggered negative responses (while agreeing that the rude ones were inappropriate).
        
        [-]
        
        altmanaltman 4 hours ago ago
        
        Sure, cheers then. I don't care about negative responses if they're negative just because they think it's AI-generated, without having to say anything substantial on the actual comment or the article. I have demonstrated my willingness to engage in good faith but those comments have not.
        If negative responses have no substance behind it, it makes no sense to care about them or take them seriously.
        Also, the fact that you assume it takes more time reading that comment than it took for me to write is pretty weird (I still don't get what was so wrong about the comment that simply reading it is a waste of time to people).
        
        [-]
        
        retsibsi 3 hours ago ago
        
        > the fact that you assume it takes more time reading that comment than it took for me to write is pretty weird
        I didn't do that either! I had no idea whether you just fired off a quick prompt and pasted the result without even reading it, or spent ages crafting and rereading and revising it, or (most likely) something in between those extremes. I said generated comments can take less time to create than to read, and that's one reason people push back against them. There's a risk that the forum just gets buried in comments that take near-zero effort to 'write' but create non-trivial time/effort/annoyance for those of us wading through them in search of actual human perspectives. And even the relatively good ones will be little different from what we could all obtain from an LLM if we wanted it.
        FWIW, I didn't even get to the substance, because I instinctively bounce off LLM-written content posted in human contexts without explanation. You're obviously free not to care about that, and I wouldn't have replied and got into this meta discussion if not for the back-and-forth you were already involved in.
        edit: but if you do care about getting through to people like me, even a short manually-written introduction can make me significantly more likely to read the rest of the content. To me, pure LLM output is a pretty strong signal of a bot/low-effort human account. But if someone acknowledges that they're pasting in an AI response and bothers to explain why they think it's interesting and worthwhile, I'll look at it with more of an open mind.
        
        [-]
        
        altmanaltman 3 hours ago ago
        
        I stand by the comment in its entirety. If formatting is an issue that makes it unreadable for you (to not even get to the substance), I can't help you. I do not care about "getting through" to anyone, I'm a human interacting on a human forum and I responded to the content of the article which was mostly BS about creating AI slop (on top of being a content marketing piece trying to sell people shit using deceptive claims).
        But I will defend myself when I'm told obtuse things without any substance backing it.
        
        [-]
        
        retsibsi 3 hours ago ago
        
        I'm obviously just annoying you, which really wasn't my goal, so I'll stop here. But I want to note that if you think this all comes down to "formatting", you're still not hearing what I'm trying to say.
- bandrami 5 hours ago ago
  
  > Your code is now out of sync with the Spec
  Is there even a sync to be had? The same prompt to the same LLM at different times will yield different artifacts, even if you were to save and re-use the seed.
asksomeoneelse 5 hours ago ago

Is there a link to the specification and the resulting generated code ? I skimmed through the article and the author's github profile, but couldn't find anything related.
Seems like a serious oversight if this is your selling point.

[-]
- boxed 5 hours ago ago
  
  Their docs say:
  > Allium has no compiler and no runtime. It is purely descriptive, defined entirely by its documentation
  From what I understand that means there is no spec, no parser, it's just vibe evaluated by LLMs.
  
  [-]
  - asksomeoneelse 4 hours ago ago
    
    Sorry, I meant the spec and code of the toy project described in the post.
    I don't necessarily expect their secret sauce to be openly available. But given the grandiose claims made there, I do expect something to back it up other than a corpospeak "trust me bro".
yunohn 5 hours ago ago

This post appears to essentially be an ad for Allium, so I’m gonna focus on that.
I don’t fully understand how allium solves the inherent issues with markdown, i.e. free-form language specs. It appears to have some loose syntactical rules, but outside of that relies on comments which are, again, free-form. Moreso, the “Resolved” portions remind of me how LLM agents like Claude code just go and edit existing doc files and make a mess of repetitive and overlapping thoughts.
boxed 5 hours ago ago

> Allium has no compiler and no runtime. It is purely descriptive, defined entirely by its documentation
That seems ill advised at best.
henrygarner 7 hours ago ago

Over a weekend, between board games and time with my kids, Claude and I built a distributed system with Byzantine fault tolerance, strong consistency and crash recovery under arbitrary failures. I described the behaviour I wanted in Allium, worked through the bugs conversationally and didn't write a line of implementation code.

[-]
- SPICLK2 6 hours ago ago
  
  I don't see why you need to bring your kids into this, and as a parent any suggestion of being distracted by tech during time with the kids raises my suspicions.
  We have a strict no laptops and no phones rule when the kids are around (unless we're specifically doing something with them using those tools - looking at the weather forecast, or looking up some information).
  "I can prompt AI while playing with the kids" is not a future I want.
  
  [-]
  - retsibsi 5 hours ago ago
    
    They said "between", not "during"; I think the point was that they didn't spend a full weekend with Claude.
    edit: anyone care to explain why this is a bad comment, rather than just downvoting? The GP comment says "Over a weekend, between board games and time with my kids,", and the parent comment lectures them based on an obvious strawman: "I can prompt AI while playing with the kids"
    
    [-]
    - SPICLK2 4 hours ago ago
      
      I'll bite.
      "between time with the kids" is another person's "during time with the kids".
      For me, "between time with the kids" means my kids are engaged in another activity that does not require my input until they are done with it. Whatever I am doing during this time also is typically very interruptible, so I am ready to help the kids along to their next "thing" (the joys of being a parent!). On a typical weekend (my oldest is 7), I'll get maybe 2 hours of this time during the time my kids are awake.
      
      [-]
      - retsibsi 4 hours ago ago
        
        That all makes sense, but I don't see how it supports your uncharitable read of the original comment. Both because they could very easily have meant something a bit different (we have no idea how old the kids are or whether they were even around all weekend; maybe there were times when they were at friends' houses or similar), and because vibe coding could be the interruptible activity they do while the kids are busy. Maybe you feel strongly that it's not a legitimate use of that kind of semi-downtime, but we don't know if that's what they were talking about in the first place.
        
        [-]
        
        SPICLK2 4 hours ago ago
        
        "Hours spent on the project" would be a much more useful metric, with no confounding variables. As it is, the mere mention of interleaving time with your kids and time engrossed in tech hits a nerve at time when IMO too many parents are doing this already, and lends unwarranted validity to the idea.