Show HN: A shell-native cd-compatible directory jumper using power-law frecency

(github.com)

25 points | by jghub 3 days ago ago

16 comments

Leftium 2 days ago ago

I plan to use frecency in my bookmarking app.
Although you don't have any problems with lag, it is possible to efficiently compute frecency in O(1) complexity
> But with an additional trick, no recomputation is necessary. The trick is to store in the database something with units of date...
Full details: https://wiki.mozilla.org/User:Jesse/NewFrecency#Efficient_co...

[-]
- jghub 2 days ago ago
  
  The Mozilla approach is a clever optimization for large-scale databases (like browser history), but it relies on a specific decay model that can be represented as a point-in-time value I believe.
  I chose a different path for sd for two reasons:
  Mathematical Fidelity: The power-law convolution of the cd event "time series" with a power law kernel yielding (S = sum 1/(t-ti)^p) provides a more nuanced ranking of "burst" activity vs. "long-term" habits than a single-point frecency score. Calculating this over a window (N=1280) allows for a "steep" decay (this p≈10 default) that handles context-switching more naturally.
  The incurred computational overhead is indeed there, but modest: in my tests I see about 9ms for stack recomputation with the default window (N=1280) compared to total real time for the cd action (including pattern matching and executing the cd) of about 22ms (I am on ksh, in bash it is about 30ms). even when increasing that window to 2^14 (8192) the stack recomputation (scoring/ranking) takes only about 20ms (so the basic algorithm is indeed O(N) but the the bottleneck is rather the sub-shell overhead, not the math. So moving to O(1) would save not much (5ms?) I guess.
  The "Attention Span" Logic: By using a fixed window of ≈1000 events for the "attention span" (short term memory) and falling back to the full ≈10000-event log (the long term memory) only if no match is found, sd rarely ever does not find a matching dir (and is usually right picking it).
  So you are right, doing the O(N) computation does impose a cost but it is indeed modest and negligible for the task at hand (I am quite sensitive to "command lag" I think, but 30, even 60ms is still "prompt" for me).
zahlman 2 days ago ago

Sorry, there's a lot here about the technical implementation details but much less I can understand about the problem being solved. What exactly do you mean by "manage"? What happens differently when you use this command, versus when you use built-in `cd`?

[-]
- jghub 2 days ago ago
  
  To "manage" in this context refers to maintaining a ranked database of your directory history so you can navigate without providing full or relative paths. The difference in practice:
  With built-in `cd`:
  Navigation is manual and location-dependent. To reach a deeply nested project, you must provide the exact path: `cd ~/src/work/clients/acme/frontend/src/components`
  With sd:
  Navigation is intent-based. Once `sd` has indexed a directory, you provide a fragment of the name: `sd comp`
  How it works differently:
  1. Passive Indexing: `sd` remembers all you cd actions (up to configurable limit, typically several 1000) and computes after each cd action a weighted score for each path you visit.
  2. Intent Resolution: When you run `sd <string>`, it doesn't check your current working directory. It queries the database for the most "frecent" (frequent + recent) path matching that string and moves you there.
  3. Ambiguity Handling: If "comp" matches multiple paths (e.g., `frontend/components` and `backend/components`), the power-law model I mentioned calculates which one is more relevant to your current "attention span" to resolve the tie.
  The problem being solved is the cognitive load of remembering and typing long, nested file paths. It replaces a "search and find" task with a "recall" task.
  so the core functionality is very similar to things like z/zoxide, but the ranking is using full visit history and detailed behaviour thus is different.
  
  [-]
  - 1718627440 2 days ago ago
    
    > To reach a deeply nested project, you must provide the exact path: `cd ~/src/work/clients/acme/frontend/src/components`
    In key presses that is 'cd ~TABsTABwTABTABaTABfTABsTABcENTER', if it is an often used project, I likely have it in CDPATH, so it's 'c acmTABfTABsTABc' or if I am actively working 'CTRL-Rfro CTRL-R CTRL-R'.
    To me it sounds like a solution looking for a problem. Why should I give up predictability when it won't really save keypresses. Not that keypresses are that expensive anyway.
    
    [-]
    - jghub 2 days ago ago
      
      That workflow is perfectly valid -- if you have a stable set of 5-10 projects, CDPATH and manual Tab-completion are hard to beat.
      I built sd for a different scenario: when your 'active' set of locations to reach is large and/or shifts frequently. When managing dozens of repos, e.g., where sub-directories have identical names (e.g., client-a/src vs client-b/src), CDPATH becomes noisy and Tab-completion requires more prefix-typing.
      Tools like zoxide (33k+ stars or so on github) or z have proven this is a widespread need. sd is my take on providing that functionality in a shell-native way using a qualitatively different ranking approach via utilization of full cd events history and a power-law "aging" model that I find more accurate for fast context-switching.
      In my experience, this saves significant keypresses by allowing for shorthand matching. For example, I can reach .../client-a/src just by typing 'sd ent-a'. The algorithm disambiguates based on the ranking; if the top match is ever wrong (it rarely is for me), you can just do ^P <CR> (recall and re-run) to jump to the 2nd highest match.
      But it’s definitely a matter of taste -- manual curation vs. an automated 'attention span' model.
      
      [-]
      - 1718627440 2 days ago ago
        
        Thanks for the explanation.
        Do you really have hundreds of projects you work at at the same time?
        > For example, I can reach .../client-a/src just by typing 'sd ent-a'. The algorithm disambiguates based on the ranking; if the top match is ever wrong (it rarely is for me), you can just do ^P <CR> (recall and re-run) to jump to the 2nd highest match.
        CTRL-R can also do this, or does this tool also do partial matches, e.g. 'tas' matches '.../client-a/src' ?
        
        [-]
        
        jghub a day ago ago
        
        To clarify the matching: sd uses AWK-based regex matching against the full logged pathname ($0 ~ pattern).
        It does not use "fuzzy" sub-sequence matching (pooling 't', 'a', and 's' from different path segments). For .../client-a/src, you wouldn't use 'sd tas', but rather 'sd t-a' or 'sd a/src'.|
        'sd "cl.*src"'
        would also be a valid jump.
        Regarding CTRL-R:
        * Filtering Noise: CTRL-R searches raw, chronological history. sd builds a scored directory stack from your history. You are querying a ranked index of destinations, not a linear tape of commands. This allows jumping to a directory visited 200 actions ago without scrolling through history with CTRL-R.
        * Ranking Logic: CTRL-R is strictly chronological. sd uses a power-law "aging" model. If client-a is a daily driver and client-b was a one-off visit, sd correctly prioritizes client-a for a 'src' query, even if it wasn't the absolute last place visited.
        * Spatial Cache: I don't work on 100 projects at once, but I do touch a "long tail" of ~100 distinct paths over several weeks/months (logs, build dirs, repos). sd acts as a cache for these nesting depths so I don't have to recall them.
        Ultimately, you just provide a modestly specific pattern — 'sd t-a/s' or 'sd src' -- and the ranking handles the disambiguation.
        
        [-]
        
        1718627440 a day ago ago
        
        I see. Sounds interesting, I guess it is just not for me.
        How does it cope with the directory stack, what happens when I do a 'cd -' after using sd, where does it take me?
        I mean fundamentally a filesystem is just a namespace and I can order it like the hell I want, it also supports different types of aliasing.
        
        [-]
        
        jghub a day ago ago
        
        Sure, needs and tastes differ.
        'cd -' behaves exactly as it always does. Functionality-wise, when you use the 'sd' wrapper (which shadows the builtin by default), it treats the standard cd behavior as a prioritized subset. If you provide a valid path -- 'cd path', 'cd ~-', 'cd -', or 'cd ..' -- it resolves that first and behaves as it always has. If you only used those commands, you would never even notice sd is there.
        The power comes when you choose to use it: you are free to try 'cd src' and see if it moves you across the filesystem to the desired location based on your recent activity.
        Regarding aliases: one can certainly manage navigation with symlinks and aliases. Historically, that was actually my incentive to write sd. I had dozens of symlinks and ~50 aliases like alias cdsrc='cd /path/to/deeply/buried/dir'. Maintaining that manual list became increasingly annoying. sd solved that by replacing manual curation with a sliding window weighted summation of my actual movements. In a sense, it essentially replaces manual aliases with pattern matching against the dynamically changing directory stack based on what I’m currently focused on.
        
        [-]
        
        1718627440 a day ago ago
        
        I meant when I do:
        cd ... sd ... sd ... cd -
        Does it undo the last sd or the last cd?
        When I do a pushd, does sd change the odds, to the time I was in that directory? Does sd have a replacement for pushd?
        
        [-]
        
        jghub a day ago ago
        
        To answer your specific questions on shell integration:
        cd - behavior: It undoes the last directory change irrespective of whether you did it with builtin cd or via sd. If you do "sd project-a" and then "cd -", you go back to where you were before. The sd function is a "thin" wrapper that ultimately executes "command cd <pathname>", so it respects the shell's native $OLDPWD logic perfectly.
        pushd and "odds": In its current implementation, sd only records a visit when you specifically use the sd command (or its cd alias) to move. If you use pushd or popd directly, sd isn't invoked and doesn't "see" that movement. This keeps manual stack operations entirely separate from the automated ranking. To be clear: the shell managed pushd/popd stack is distinct from the dynamic score-sorted directory stack maintained/utilized by sd. Personally I simply do not have any more use for the former so would argue that sd not seeing cd actions initiated by push/popd is not an issue for the tool.
        pushd replacement: sd doesn't replace pushd/popd/dirs. Think of pushd as a manual bookmarking system and sd as a temporal index. They coexist: pushd is for explicit, manual stack management (presuming you still want/need it in addition to sd), while sd is for "get me to this project's source" jumps using shorthand.
        The design goal is to be a "well-behaved citizen" of the shell. It doesn't break standard invariants; it just adds a shorthand layer that requires zero manual maintenance.
ekropotin 2 days ago ago

> While it is conceptually similar to tools like z or zoxide, the underlying ranking model is different.
I mean, cool stuff. But does it really matter from usability perspective?

[-]
- jghub 2 days ago ago
  
  The difference in usability is most apparent during context switching and burst activity.
  The Problem with Heuristics: Most tools use simple counters and multipliers. If you spend 10 minutes performing a repetitive task in a temporary directory (e.g., a "one-off" burst of cd actions into a build folder), a heuristic model can "overlearn" that path. This often clobbers your long-term habits, making your primary project directory harder to reach until you have manually "re-trained" the tool.
  The Usability of the Power-Law Model: Because sd calculates the score by convolving a fixed window of history (the "Attention Span"), it distinguishes between a long-term habit and a recent fluke more effectively.
  Stability: Your main work directories remain at the top of the results even if you briefly go "off-track" in other folders.
  Decay Precision: if you look at the README of the project, the current chosen default exponent of the power-law decay p=9.97 -- which just means that if you include the 1000 last cd actions into ranking computation that the 500th of those (in the middel of the window) gets a weight of 1/2^9.97=1/1000 -- this effectively means that only any visit to the considered dir within last 500 contributes to the score (while the window width controls how long the cd remains within "attention span") and only the last 100 or so really influence the scor. so that is by default (can be altered by user easily) much "steeper" than a linear or simple exponential decay. It prioritizes what you are doing right now without forgetting what you do consistently.
  Zero Maintenance: You rarely have to "purge" or "delete" paths from the database because the math naturally suppresses "noise" while amplifying "signals."
  In short: The math matters because it reduces the number of times the tool teleports you to the wrong place after a change in workflow.
  The difference in detail behaivour boils down to: sd does log the individual cd event history, so nows something about _when_ each visit has happened not only how _often_ it has happened plus when the single most recent one did (which to the best of my understanding is what z/zoxide do). So there is more information to utilize for the ranking. And I believe the difference in behaviour is notable and makes a difference.
  the rest is matter of taste: sd does allow to alter ranking transiently in running shell if you note that your current task requires shortly to emphasise or de-emphasise recency importance (alter power exponent) or you want smaller or larger attention span (alter window size). I rarely need/do that but it happens.
  last not least sd also provides an interactive selection facility to pick any dir currently within attention span (i.e. "on the dir stack"). if you have fzf installed it uses that as UI for this selection, othrwise falls back to simply index-based selection in terminal. Again, I used this not too often, but sometimes one does want that (e.g. if simply not remembering the path but having to read it to remember it...)
  
  [-]
  - cap11235 a day ago ago
    
    Is this not just equivalent to clearing out zoxide's db periodically?
    
    [-]
    - jghub a day ago ago
      
      Not really, because zoxide's "aging" is a global multiplicative downscale of cumulative scores. It doesn't solve the problem of "Historical Rigidity."
      In zoxide, every visit you have ever made contributes to a monolithic frequency score. Because the "aging" is a global multiplier (0.9), it preserves the relative proportions of your history. If "Project A" has 1000 visits from three months ago, its score can remain so high that it still outranks "Project B," which you started this week and visited 50 times. This is why zoxide provides a manual remove command -- users sometimes have to intervene when the "ghosts" of old projects won't stop winning matches. I have never felt the need to do something like that in sd (although the option is there, mostly as a historic artifact).
      In sd, the ranking is based on the density of visits within a fixed window (the "Attention Span").
      If you haven't visited "Project A" in your last ≈1000 moves, it has exited the window and is ignored for ranking purposes (as long as a pattern match is found elsewhere on the stack -- otherwise fallback logic kicks in and "Project A" is still discoverable). So it doesn't matter if you visited it 1000 times earlier in the year; it no longer occupies your "attention." Conversely, if you've visited "Project B" 50 times in the last two days, those visits occupy a high-weight portion of the power-law curve.
      sd essentially uses a sliding window weighted summation approach. It prioritizes what you are doing now without being weighed down by the "debt" of what you were doing months ago. This provides "Zero Maintenance" because the math naturally suppresses old signals as they exit the window, whereas cumulative models might eventually require manual pruning to fix a sluggish ranking.