A steam locomotive from 1993 broke my yarn test

(blog.cloudflare.com)

162 points | by jgrahamc a day ago ago

74 comments

  • bouke a day ago ago

    So the real problem is that Jest just executes to whatever `sl` resolves. The fix they intent to release doesn't address that, but it tries to recognise the train steaming through. How is this acceptable behaviour from a test runner, as it looks like a disaster to happen. What if I have `alias sl=rm -rf /`, as one typically wants to have such a command close at hand?

    • tlb a day ago ago

      Exec doesn't know about shell aliases. Only what's in the $PATH.

      I liked the shell in MPW (Mac Programmer's Workshop, pre-NeXT) where common commands had both long names and short ones. You'd type the short ones at the prompt, but use the long, unambiguous ones in scripts.

      • Kwpolska a day ago ago

        PowerShell has long commands and short aliases, but the aliases can still shadow executables, e.g. the `sc` alias for `Set-Content` shadows `sc.exe` for configuring services. And you only notice when you see no output and weird text files in the current working directory.

      • szszrk a day ago ago

        Networking crowd probably think it's obvious. Because of things like Cisco cli, or even Mikrotik. Or "ip" cli as well, I guess.

        I never bothered to check what's the origin of that pattern.

        • hnlmorg 19 hours ago ago

          Ive taken entire web farms offline due to an unexpected expansion of a command on a Cisco load balancer.

          The command in question was:

              administer-all-port-shutdown 
          
          (Or something to that effect —it’s been many years now)

          And so I went to log in via serial port (like I said, *many years ago so this device didn’t have SSH), didn’t get the prompt I was expecting. So typed the user name again:

              admin
          
          And shortly afterwards all of our alarms started going off.

          The worst part of the story is that this happened twice before I realised what I’d done!

          I still maintain that the full command is a stupid name if it means a phrase as common as “admin” can turn your load balancer off. But I also learned a few valuable lessons about being more careful when running commands on Cisco gear.

      • skykooler 19 hours ago ago

        Theoretically you could do this in Linux by calling /usr/bin/sl or whatever - but since various distros put binaries in different places, that would probably cause more problems than it could solve.

    • Tractor8626 a day ago ago

      No. This is not the real problem. There is nothing you can do if your 'bash', 'ls', 'cat', 'grep', etc do something they not supposed to do.

      Proper error handling would be helpful though.

    • Etheryte a day ago ago

      The fact that Jest blindly calls whatever binary is installed as `sl` is downright reckless and that's an understatement. If they need the check, a simple way to avoid the problem would be to install it as a dependency, call `require.resolve()` [0] and Bob's your uncle. If they don't want the bundle size, write a heuristic, surely Meta can afford it. Blindly stuffing strings into exec and hoping it works out is not fine.

      [0] https://nodejs.org/api/modules.html#requireresolverequest-op...

      • Joker_vD 20 hours ago ago

        "That's just, like, your opinion, man". There is another school of thought that postulates that an app should use whatever tools that exist in the ambient environment that the user has provided the app with, instead of pulling and using random 4th-party dependencies from who knows where. If I symlinked e.g. "find", or "python3", or "sh", or "sl" to my weird interceptor/preprocessor/trapper script, that most likely means that I do want the apps to use it, damn it, not their own homebrewed versions.

        > a simple way to avoid the problem would be to install it as a dependency

        I've seen once a Makefile that had "apt remove -y [libraries and tools that somehow confuse this Makefile] ; apt install -y [some other random crap]" as a pre-install step, I kid you not. Thankfully, I didn't run it with "sudo make" (as the README suggested) but holy shit, the presumptuousness of some people.

        The better way would have been to have "Sapling CLI" explicitly declared as a dependency, and checked for, somehow. But as the whole history of dev experience shows, that's too much ask from the people, and the dev containers are, sadly, the sanest and most robust way to go.

        • Etheryte 20 hours ago ago

          I think where our opinions differ is what boundaries this logic should cross. When I'm in Bash-land, I'm happy that my Bash-isms use the rest of what's available in the Bash env. When I'm in Node, likewise, as this is an expected and desirable outcome. Where this doesn't sit right with me is when a Node-land script crosses this boundary and starts murking around with things from a different domain.

          In general, I would want everything to work by the principle of least surprise, so Node stuff interacts with Node dependencies, Python does Python things, Bash does Bash env, etc. If I need one to interact with the other, I want to be explicit about it, not have some spooky action at a distance.

          • Joker_vD 18 hours ago ago

            Completely understandable, it's just... it's just not in the cards. A large part of UNIX ecosystem has not, historically, been kind to this view. Remember autotools/autoconf, makefiles with DESTDIR, and all that similar jazz? People genuinely proposed that stuff as the solution for the management of ambient dependencies. And it takes just one slip up of "shelling out" (hopefully it's actually "forking off", not literally shelling out) for all kinds of funny business re-appearing again — and don't even start on the /lib and .so management.

    • blueflow a day ago ago

      What else should the test runner do?

      • pavel_lishin a day ago ago

        There must be a better way to tell if a repo is a Sapling repo than by running some arbitrary binary, right?

        • Symbiote a day ago ago

          For Git one could look for .git/config. There must be something equivalent.

      • pasc1878 a day ago ago

        Use the full path of sl and not rely on $PATH in the same way cron and macOS GUI apps do for I assume this exact reason.

        • stonegray a day ago ago

          Is the full path guaranteed? For example homebrew, snap, and apt might put it all in different places. $PATH is a useful tool.

          • pasc1878 21 hours ago ago

            But not in this case where you have two executables with the same name.

            You have to know where the tool was installed or else be certain no other sl is on your path.

          • undefined a day ago ago
            [deleted]
        • Joker_vD 19 hours ago ago

          How would knowing the full path help you anyway? It's either in "/usr/bin/sl", or "/usr/local/bin", or "~/.local/bin", now what?

          By the way, believe it or not, POSIX compliance requires existence of only two directories (/dev and /tmp) and three files (/dev/console, /dev/null, and /dev/tty) on the system; everything else is completely optional, including existence of /bin, /etc, and /usr.

          • pasc1878 3 hours ago ago

            Because you know what you installed and so which sl to use.

            • Joker_vD 3 hours ago ago

              But the sl is not invoked by you. It is invoked by some npm module (a 5-times-removed dependency from any side) which hopes that either there is "sl" in the $PATH and it is the Sapling CLI, or there is no "sl" in the $PATH. This module can't use absolute paths because it does not know how the end user's system looks.

        • skipants 21 hours ago ago

          What if the full path is just `/usr/bin/sl`?

          • pasc1878 21 hours ago ago

            Then yopu get the sl there which could be correct.

        • charcircuit a day ago ago

          Finding the full path of sl requires looking at $PATH

          • pasc1878 21 hours ago ago

            In this case not as then you find the wrong sl - you need to know where the correct sl was installed.

  • GTP a day ago ago

    Just from the title, I suspected that Steam Locomotive had something to do with it. So I quickly glanced through the article up to the point where the locomotive shows up. Sometimes there's the idea hanging in my mind to make a version called Slow Locomotive, where the train slows down every time you press ctrl-c.

    • dullcrisp a day ago ago

      If you press ^Z does it stop entirely?

      And do these sorts of ideas ever get you into trouble?

      • GTP an hour ago ago

        > If you press ^Z does it stop entirely?

        Great idea, if I ever end up doing this I will steal it :D

        > And do these sorts of ideas ever get you into trouble?

        Not so far, but an idea is never a problem in itself. The problem can be the context. I don't see any issue in publishing a project like this on GitHub, while I see how I could get in trouble if I install it on a corporate server.

      • throwanem a day ago ago

        I once reimplemented in Perl Nethack's logic for phase-of-moon and Friday 13th computation and notification, and added the resulting cute little script to the root .profile on our consulting firm's main web hosting boxes.

        I didn't get fired when my boss found it by surprise a couple months (and lunar cycles) later, but I did learn a valuable lesson about how one may wisely limit one's exercise of whimsy.

        Google took a few years more to achieve the same discovery, as I recall, but presumably this has to do with pedagogical methods involving not as many ex-sergeants.

  • fifticon a day ago ago

    as a 30+y employed systems programmer, when I read a story like this, I get angry at the highly piled brittle system,not at the guy having sl installed. I am aware there exists a third option of not getting angry in the first place, but I hate opaque nonrobust crap. This smells like everything I hate about front-end tooling. ignorance and arrogance in perfect balance.

    • ericmcer a day ago ago

      What would you have done differently? They were dependent on SL (which is a facebook source control system written in C) but the user had overwritten the expected path with a shell script. That is not something most engineers would build around... "what if the user is overwriting the path to dependencies with nonsense shell scripts?".

      It doesn't feel like something that is entirely the Jest maintainers fault, I am not sure why Jest needs a source control system but there are probably decent reasons.

      Like if I overwrite `ls` to a shell script that deletes everything on my desktop and then I execute code you wrote that relies on `ls` are you to blame because you didn't validate its behavior before calling it?

      • MD87 21 hours ago ago

        The difference is that `ls` is specified in POSIX and everyone has roughly the same expectations of what it does.

        Nothing specifies what a binary called `sl` does. The user didn't "overwrite" anything. They just had an `sl` binary that was not the `sl` binary Jest expects. Arguably they had the more commonly known binary with that name.

      • mmlb 21 hours ago ago

        Use the lessons learned from those before us in less heterogeneous days, aka inspect the binaries you're going to call out to for fitness. Things like "check if grep is gnu or bsd" or "check if sl is sapling or steamlocomotive".

        I've done that a bit to deal with macos crippled bash for example.

      • ploxiln 15 hours ago ago

        jest (or whatever was trying to auto-detect a "sapling" repo) should take explicit configuration to enable "sapling" or "mercurial" or whatever integration. And not try to run "sl" 16+ times in various modules/threads trying to auto-detect it.

        "automagic" things trying to be easy and helpful is really a significant source of my stress fixing software these days.

      • sixothree 19 hours ago ago

        I hate to say it but choosing to name something sl in the first place is about as arrogant as you can get. I just can’t understand the world in which sl was an acceptable name to use much less an acceptable executable to have a dependency on.

    • Tractor8626 a day ago ago

      Totally happens in C code too. Maybe even more often.

      Just today had proxmox not working because of invalid localhost line in /etc/hosts. Or had problem with logging in KDE because /etc/shadow was owned by root.

      In both cases only incomprehensible error messages. Luckily solutions was googleable.

  • salmonellaeater a day ago ago

    A useful error message would have made this a 1-minute investigation. The "fix" of trying to detect this specific program is much too narrow. The right fix is to change Yarn to print a message about what it was trying to do (check for a Sapling repo) and what happened instead. This is also likely a systemic problem, so a good engineer would go through the whole program and fix other places that need it.

  • burnte a day ago ago

    I discovered SL in 1999, and forgot about it. I rediscovered it 5 years later when on my personal server I typoed ls as sl and hit enter. A steam locomotive drove across my screen, and I remembered installing it 5 years later and laughed by butt off. I wound up pranking myself and it took 5 years to pay off!

  • pjc50 a day ago ago

    Plus points for using strace. It's one of those debugging tools everyone know about for emergencies that can't be solved at a higher level, and a great convenience of using Linux. The Windows ETW system is much harder to use, and I'm not sure if it's even possible at all under OSX security.

    • throwway120385 a day ago ago

      I have solved an incredible number of problems just by looking at strace output very carefully. Strace combined with Wireshark or Tcpdump are incredible as a toolset for capturing what a program is doing and for capturing what the effect is either on the USB or the NIC.

    • frizlab a day ago ago

      macOS has dtrace which is actually nicer to use. Cannot be used on all processes when SIP is on though.

      • pjc50 a day ago ago

        Last time I tried SIP prevented me from using it on my own processes, but I may have been holding it wrong.

    • dontlaugh a day ago ago

      macOS’s Solaris-inspired dtrace is actually nicer, especially the UI.

      • pjc50 a day ago ago

        Is there a guide for how to use this, including the UI, with SIP on?

        • jntun 20 hours ago ago

          Instruments is implemented under-the-hood with dtrace, that could be what they are referring to.

          • dontlaugh 20 hours ago ago

            Yes. Most things run well with Instruments attached. I’ve only used the dtrace cli a few times.

    • mrguyorama 18 hours ago ago

      The chrome folks built https://randomascii.wordpress.com/2015/04/14/uiforetw-window... to improve ETW usability.

      You usually don't need that full industrial level tracing though on Windows! Process Monitor is 95% of the solution for most people, and provides very similar functionality to strace, if a lot easier to read.

  • snovymgodym a day ago ago

    The real story here is that the author and his coworker wasted a bunch of time tracking down this bug because their dev environment was badly set up.

    > his system (MacOS) is not affected at all versus mine (Linux)

    > nvm use v20 didn't fix it

    If you are writing something like NodeJS, 99% of the time it will only ever be deployed server-side on Linux, most likely in a container.

    As such, your dev environment should include a dev dockerfile and all of your work should be done from that container. This also has the added benefit of marginally sandboxing the thousands of mystery-meat NPM packages that you will no doubt be downloading from the rest of your machine.

    There is zero reason to even mess with a "works on my machine" or a "try a different node version" situation on this kind of NodeJS project. Figure out your dependencies, codify them in your container definition, and move on. Oh, your tests work on MacOS? Great, it could not matter less because you're not deploying there.

    Honestly, kind of shocking that a company like Cloudflare wouldn't have more standard development practices in place.

    • bilekas a day ago ago

      >If you are writing something like NodeJS, 99% of the time it will only ever be deployed server-side on Linux, most likely in a container.

      I'm really curious where you're getting this impression from ? I for one never run docker containers on my dualcore atom server with 4gb ram.. but i have a lot of node services running.

      > There is zero reason to even mess with a "works on my machine" or a "try a different node version" situation on this kind of NodeJS project

      There are a lot of reasons to investigate these things, infact that's what I would expect from a larger more industry invoved companies, knowing the finer nuances and details of these things can be important. What might seem benign can just as quickly become something really dangerous or important when working on a huge scale such as CloudFlare.

      Edit : BTW I do agree mistakes were made, and the hell that is NPM chain of delivery attacks is terrifying. Those are the points I would focus on more personally.

      • snovymgodym a day ago ago

        > I'm really curious where you're getting this impression from?

        Experience mainly, though perhaps I live in a bubble. My "99%" assertion was more pointed at the "server-side on Linux" part than the "most likely in a container" part.

        Really the point I wanted to make was that your development and test environment should be the same as, or as close as possible to, your production environment.

        If your app is going to be deployed on Red Hat Enterprise Linux (whether in a container, VM, or baremetal), then don't bother chasing down cryptic NPM errors that arise when you run it on Ubuntu, Mac, or Windows. Just run everything out of a RHEL docker container which mimics your production environment and spent your limited time doing the actual task at hand. It simply is not worth your time to rabbit hole endlessly on NPM errors that happen on an environment you'll never deploy to.

        > There are a lot of reasons to investigate these things, ...

        Sure, I don't really disagree with that and generally it's good to have a solid understanding of your tools and what lies in the layers below the abstractions that you normally work with. The detective work in the post is solid.

        But the thing is that the author was supposed to be learning NodeJS in order to ramp up on a React project. But he got derailed (heh) by this side quest which delayed him being able to do the actual work he set out to do. Whether or not it was worth the time is subjective. But either way, it would not have happened in the first place with better dev environment practices.

        • bilekas 20 hours ago ago

          > Really the point I wanted to make was that your development and test environment should be the same as, or as close as possible to, your production environment.

          I’m really glad to hear that actually, I think you did make that point but it was a bit overlooked with the other points.

          About having better Dev environments I think you're also spot on, not just with infrastructure but also with support from other maybe more experienced developers who could identify these things early and knowledge share, for me at least that's one of the main development requirements, if you're not learning, you should be teaching.

      • throwanem a day ago ago

        The last time I dealt with a non-dockerized Node deployment, at work or at home, was in 2013. That this was also the year of Docker's initial release is no coincidence at all.

        • bilekas a day ago ago

          I think for production it’s a good move, it just doesn’t feel like a sure assumption that the majority of node services are containerized.

          • throwanem 21 hours ago ago

            Well, the argument is more that the vast majority of Node services should be containerized, because the potentially large benefit of so doing outweighs the relatively small cost. I can't speak to anyone's assumptions, but I can say I'm inclined to support this argument because my professional experience for many years has been that containerization causes far fewer problems than it solves.

  • Kwpolska a day ago ago

    Naming your source control tool after a common mistyping of ls is such a Facebook move.

    • m4rtink a day ago ago

      Yeah! What are they going to do next - call a programming language "go" or something ? Even Google would not be that stupid - imaging Googling for that and getting only irrelevant stuff!

      • 12345hn6789 13 hours ago ago

        Go slice array differences golang

    • computerfriend a day ago ago

      Naming it after a commonly installed program that has been around since 1993 is also some hubris.

      • mrguyorama 18 hours ago ago

        The reality is that most devs writing code in Facebook were not alive in 93, and certainly weren't Linux admins at that time.

        Does Facebook even have any greybeards in the trenches?

  • wrs a day ago ago

    I had a similar problem where builds were timing out. When I looked at the build log, there was a calendar in it (?!). I eventually figured out a script was calling `date`, and something I had `go install`ed (I think) had a test binary called `date` that was an interactive calendar.

  • undefined a day ago ago
    [deleted]
  • rossdavidh 19 hours ago ago

    I demonstrated that I am not a serious or good programmer by installing steam locomotive on my Linux laptop immediately after reading this.

  • normie3000 a day ago ago

    > git commit, which hooked into yarn test

    There's the real wtf. How are you meant to commit a failing test? Or any other kind of work in progress?

    • zdragnar a day ago ago

      You mark the failing test with "failing". The test runner knows that it might fail but doesn't fail the suite.

      I'm not a big fan of git commit hooks, but it can give faster feedback than waiting for a CI runner to point out something that should have been obvious if you keep it light weight (such as style linting or compiler warnings).

      Edit: replaced "Todo" with "failing" since we're talking about jest specifically: https://jestjs.io/docs/api#testfailingname-fn-timeout

    • computerfriend a day ago ago

          git commit -n
      • normie3000 3 hours ago ago

        Aha! I have a new alias!

  • jokoon a day ago ago

    I thought a real steam locomotive was passing next to a data center and crashed the server because of the vibrations of the train.

  • rrauenza 20 hours ago ago

    I'm trying to recall -- wasn't there someone who had a similar issue with a game? Maybe a (pun not intended) Steam game? They'd try to run their game and something else would launch? Or vice versa?

  • sureglymop 19 hours ago ago

    Relatable debugging, though after 2 tries I would have moved straight to strace/truss.

    Edit: okay I continued reading and that was actually the next step. :)

  • mzs a day ago ago
  • zitterbewegung 21 hours ago ago

    If you were troubleshooting this and I know what I’m saying is with 20/20 hindsight why wouldn’t you try to test this on someone else’s machine to see if it is an environment issue ? They seemed to get use extensive analysis at that point. Also I’ve seen Jenkins deployments that have test runners that would run JS unit tests.

  • polygot 21 hours ago ago

    Would dev containers solve this issue?

    • tyzoid 17 hours ago ago

      Most likely, yes. Then it wouldn't have mattered that the `sl` package was installed.

  • WalterBright 21 hours ago ago

    Not about steam locomotives. Disappointed.