1. The results of this tool are not good. It’s recommending outdated models like Qwen2.5 series and missing good new models.
2. This could have been a single web page that runs in your browser and lets you enter hardware specs, like all of the other tools like this. It is not a good idea to install and run unknown projects like this on your computer in this age.
3. The project is very obviously vibecoded, down to the README
4. Every comment from this account appears to be AI generated too.
I would recommend not installing and running this on your computer. There is no advantage over other tools and everything about the account and project looks like low effort AI generated content.
That’s amazing. Everyone has noticed all of the AI slop projects being posted to Show HN and r/localllama but this is the first time I’m seeing the AI text that is telling people to do this. I was naive enough to think these people had the idea to post the project on their own, but even the idea to post it is coming from the LLMs. Amazing.
EDIT: r/selfhosted is in there too. This explains why that subreddit is having such a problem with AI slop project spam.
/r/selfhosted in particular has gone from a fun community of computing enthusiasts and developers having fun with their software and data on their own terms, to a literal self-promotion wasteland of AI-generated slop.
(And I say this as someone who is not actually against the _intelligent_ use of LLMs for software development.)
The premise of "best" makes this a non-starter for me right away. The site says their definition of "best" is 1) fits in RAM 2) has a high benchmark score.
Best for what? All models have their strengths and weaknesses. Controlling for number of parameters, some are better at general knowledge, some are better at writing and planning, some have more creativity, some are better at writing code, some are better at debugging code, etc, et al, and so on.
The "best" model is not "whatever fits into VRAM." You can do lots of useful stuff with a small CPU-only model. Just a few days ago, there was a 29M model optimized for nothing but tool calling.
Last and probably most controversial, the idea that LLM benchmarks scores have any actual real-world value whatsoever is a collective hallucination. They are for marketing and serve no other purpose. New LLMs are always specifically trained to score high on the benchmarks the developers want. Somehow, every new release of every new model _always_ show it scoring slightly above the models it claims are its competitors on most tests. Since LLM output is non-deterministic, you often get wildly different responses to identical prompts, and it is trivial for the developers to cherry-pick results. Since they never show their work, we are expected to take them for their word.
Yes, you need to know if the model will fit into your RAM, and whether the speed will be acceptable. But the _only_ way to know whether a model is suitable for your specific task is to try it out for yourself and see if it does (most of the time) the thing you need it to do.
Love that it defaults to the GPU being "NVIDIA GeForce 8800 GTX", a GPU released in 2006 with ~700MB of VRAM...
The estimates seems far off as well, took https://www.canirun.ai/model/gpt-oss-120b as an example, with a RTX Pro 6000 and every single number is off, and notably misses estimation for the most important quant for GPT-OSS, the MXFP4 variant.
Every browser gives me a different result, I guess I can't blame the site for that. But it should perhaps mention which browser would be the most accurate.
Not perfect, but I find the artificialanalysis.ai "Intelligence vs. Output Tokens Used in Artificial Analysis Intelligence Index" chart[0] (scroll down to the titled chart) to be of great use. A proper evaluation needs to compare 3 things together: score, speed, and verbosity. This chart plots score vs verbosity.
It looks nice. I've been searching for something like this recently, and was frustrated with rankings that lack latest models or don't clearly distinguish quantizations.
Showing quality loss per quantization is nice.
I'd prefer this as a website, since I'd handle running of the model with a dedicated inference server anyway.
It would be nice to see what's the maximum context length that can fit on top of the baseline.
I was surprised how much token generation speed tanks when using very long context. 30/s can drop down to 2/s. A single speed metric didn't prepare me for that.
I was also positively surprised that some models scale well with batch parallelism. I can get 4x speed improvement by running 8 requests in parallel. But this affects memory requirements, and doesn't apply to all models and inference engines. It would be nice to show that. Some sites fold it into "what's your workflow", but that's too opaque.
KV cache quantization also makes a difference for speed, VRAM usage and max usable context.
On Apple Silicon MLX-compatible model builds make a difference, so I'd like to see benchmarks reassure they're based on the fastest implementation.
Multi-token-prediction is another aspect that may substantially change speed.
It seems pretty rubbish I have to say, its recommending me loads of qwen 2.5 which are really old and I'm easy running qwen3.5 and 3.6 models on this mac at decent quants
Cool idea – the local LLM space really needs a tool like this that actually understands hardware and real benchmarks instead of just “biggest model that fits.”
Only thing I would add is the ability to point to new/uncataloged benchmarks. If I have a favorite benchmark that best matches my use-case, the ability to point to it and have it fuzzy match model names or what have you would be a neat feature.
Fair question. llmfit answers "will this model fit in my memory?" — it's a fit/size calculator, and a good one. whichllm answers a different question: "of the models that fit, which is actually best?" It pulls candidates, then ranks them by merged real benchmarks (LiveBench / Artificial Analysis / Aider / Arena ELO / Open LLM Leaderboard) with a recency penalty, so a newer 27B beats an older 32B even though both fit — on a 24GB card it puts Qwen3.6-27B above Qwen3-32B on benchmarks, not size.
If "biggest that fits" is the answer you want, llmfit is the simpler tool and Python won't matter to you. If you want "which fitting model is worth running," that ranking layer is the whole reason whichllm exists. Different jobs — I'd genuinely send fit-only users to llmfit.
Your LLM should have bothered to notice that llmfit also has quality scores (and defaults to sorting by them). One might quibble about weighting of factors -- llmfit favors Qwen3.6-35B-A3B over Qwen3.6-27B, whereas I found the quality of the latter to be worth waiting for -- but it absolutely ranks models by quality.
Where are the metrics being sourced from exactly? Externally? or does this project have benchmarks running somewhere for the purposes of this project. Latter would likely be more apples to apples comparisons in case some external sources are biasing etc
The plan command is clever. How do you handle the VRAM estimation for models with sliding window attention vs full context? Something like Mistral at 32k context uses way less KV cache than Llama at the same context length, but from the README it looks like the estimation is based on a fixed context size. Does it account for that?
Good catch that's a real gap. The KV estimate is GQA/MQA-aware (per-model head config) but currently assumes dense full-context attention; it does not model sliding-window / chunked attention, so for SWA models like Mistral or Gemma at long context it over-estimates KV. The error is conservative — it tells you a model needs more than it does, not less, so it won't push you into an OOM — but it's still wrong. I'll open a tracking issue with per-architecture window sizes; if you have a reference for the exact SWA configs you care about it'll speed the fix. This is the kind of report I posted for.
I love this community, I started building a simple website for this exactly a couple of hours ago and you made an even more advanced version already. Hats off to you sir.
If i ever decide to actually publish the site, is it alright if I mention you somewhere as a "If you want a more accurate estimation, check out this project:<your repo>", as i think there is value in having a simple website estimate this information for you, and give you instructions/ common flags on how to start it yourself (also a prompt crafted for you to optionally give to an llm to set it up for you), but im going off simple "choose an os, gpu/vram, here's a list of options" and not actually scanning (which is a lot more accurate).
OP is a newish user, all of their responses here are copied straight from Claude, and this project has an LLM slop readme (count them: 48 em-dashes on the page!) and LLM code. Just not very interesting.
Has anyone gotten the old gpt-oss models running? They scored very high on benchmarks but I constantly had strange problems with them.
So two questions there:
(1) is it actually possible to get good results with them (some people said they got good results, which implies that it might have been hard to get them running properly, but if you can, then they're actually good?). Which also implies the second question,
Cool, but it looks like it doesn’t actually test anything on your machine? It does hardware detection and then some lookups. Maybe I missed it but I really want a tool like this to actually run a model on my machine to get the speed numbers.
I’ve been using RapidMLX for this. The integrated speed tests matter because the quality of the backend is a moving target and the quantization / MLX format conversion also matter. It’s not enough to say “oh use this model family with X parameters” you have to add the architecture specific quantization too.
You're right that it doesn't run anything — it's a pre-download / pre-purchase decision tool, so it estimates rather than measures by design (you can simulate a GPU you don't own with --gpu). That's a genuine limitation vs running the model: a measured t/s on your exact backend/quant will always beat my estimate. The estimate is bandwidth-bound, per-quant and per-backend, and deliberately conservative on VRAM (weights + GQA-aware KV + activation) so it errs toward "won't fit" rather than crashing you mid-run. Where I can get real measurements I fold them in — calibration data / PRs for specific hardware are very welcome; that's the path to numbers you can trust rather than just plausible ones. On-device measurers like RapidMLX are complementary, a different point in the workflow.
accurate memory estimation is key here. it will crash if that accurate and it cant be generic for all local llm. each local llm has different context estimates.
1. The results of this tool are not good. It’s recommending outdated models like Qwen2.5 series and missing good new models.
2. This could have been a single web page that runs in your browser and lets you enter hardware specs, like all of the other tools like this. It is not a good idea to install and run unknown projects like this on your computer in this age.
3. The project is very obviously vibecoded, down to the README
4. Every comment from this account appears to be AI generated too.
I would recommend not installing and running this on your computer. There is no advantage over other tools and everything about the account and project looks like low effort AI generated content.
There is apparently a marketing.md file that was deleted 25min ago with the strategy to post on HN. https://github.com/Andyyyy64/whichllm/commit/2cefaea1cc5d2de...
I think your hunch is very much spot on. It doesn’t look trustworthy at all.
Man. Every time I see a Reddit or HN post starting with "I got tired of..." I already know that it's going to be self-promoted AI slop.
That’s amazing. Everyone has noticed all of the AI slop projects being posted to Show HN and r/localllama but this is the first time I’m seeing the AI text that is telling people to do this. I was naive enough to think these people had the idea to post the project on their own, but even the idea to post it is coming from the LLMs. Amazing.
EDIT: r/selfhosted is in there too. This explains why that subreddit is having such a problem with AI slop project spam.
/r/selfhosted in particular has gone from a fun community of computing enthusiasts and developers having fun with their software and data on their own terms, to a literal self-promotion wasteland of AI-generated slop.
(And I say this as someone who is not actually against the _intelligent_ use of LLMs for software development.)
And it seems to work! First page on HN
Hilariously the commit says "delete marketingmd ai slop".
Hosting a website costs money and headache. Pretty cheap but still, it's easier to just do a CLI.
Hosting to GitHub Pages is free and easy for a project that is already on GitHub.
The premise of "best" makes this a non-starter for me right away. The site says their definition of "best" is 1) fits in RAM 2) has a high benchmark score.
Best for what? All models have their strengths and weaknesses. Controlling for number of parameters, some are better at general knowledge, some are better at writing and planning, some have more creativity, some are better at writing code, some are better at debugging code, etc, et al, and so on.
The "best" model is not "whatever fits into VRAM." You can do lots of useful stuff with a small CPU-only model. Just a few days ago, there was a 29M model optimized for nothing but tool calling.
Last and probably most controversial, the idea that LLM benchmarks scores have any actual real-world value whatsoever is a collective hallucination. They are for marketing and serve no other purpose. New LLMs are always specifically trained to score high on the benchmarks the developers want. Somehow, every new release of every new model _always_ show it scoring slightly above the models it claims are its competitors on most tests. Since LLM output is non-deterministic, you often get wildly different responses to identical prompts, and it is trivial for the developers to cherry-pick results. Since they never show their work, we are expected to take them for their word.
Yes, you need to know if the model will fit into your RAM, and whether the speed will be acceptable. But the _only_ way to know whether a model is suitable for your specific task is to try it out for yourself and see if it does (most of the time) the thing you need it to do.
This is very helpful too: https://www.canirun.ai/
Love that it defaults to the GPU being "NVIDIA GeForce 8800 GTX", a GPU released in 2006 with ~700MB of VRAM...
The estimates seems far off as well, took https://www.canirun.ai/model/gpt-oss-120b as an example, with a RTX Pro 6000 and every single number is off, and notably misses estimation for the most important quant for GPT-OSS, the MXFP4 variant.
The default for me was M1. I think it tries to guess using WebGPU.
> canirun.ai
I run dgx spark, and the results here are soooo incomplete for my platform that I can’t trust this site (for my usecase).
Yes, I really like this site too, but it's a bit outdated.
"39d ago" in AI time is like 1 year outdated info.
Doesn’t have qwen3 coder next. What else is it missing?
Every browser gives me a different result, I guess I can't blame the site for that. But it should perhaps mention which browser would be the most accurate.
I also have a script to find the best LLM for your hardware. Here:
echo "Qwen3.6-27B"
works awesome, and it picked the best model for my machine, Qwen3.6-27B...
For next version can you add install as well ;)
Why can't you use a web page instead ?
Came here to ask the same question. This could be a static HTML web page with a table.
Not perfect, but I find the artificialanalysis.ai "Intelligence vs. Output Tokens Used in Artificial Analysis Intelligence Index" chart[0] (scroll down to the titled chart) to be of great use. A proper evaluation needs to compare 3 things together: score, speed, and verbosity. This chart plots score vs verbosity.
[0] https://artificialanalysis.ai/?models=gpt-oss-120b%2Cgemma-4...
It looks nice. I've been searching for something like this recently, and was frustrated with rankings that lack latest models or don't clearly distinguish quantizations.
Showing quality loss per quantization is nice.
I'd prefer this as a website, since I'd handle running of the model with a dedicated inference server anyway.
It would be nice to see what's the maximum context length that can fit on top of the baseline.
I was surprised how much token generation speed tanks when using very long context. 30/s can drop down to 2/s. A single speed metric didn't prepare me for that.
I was also positively surprised that some models scale well with batch parallelism. I can get 4x speed improvement by running 8 requests in parallel. But this affects memory requirements, and doesn't apply to all models and inference engines. It would be nice to show that. Some sites fold it into "what's your workflow", but that's too opaque.
KV cache quantization also makes a difference for speed, VRAM usage and max usable context.
On Apple Silicon MLX-compatible model builds make a difference, so I'd like to see benchmarks reassure they're based on the fastest implementation.
Multi-token-prediction is another aspect that may substantially change speed.
Brew install is broken
It seems pretty rubbish I have to say, its recommending me loads of qwen 2.5 which are really old and I'm easy running qwen3.5 and 3.6 models on this mac at decent quants
AI slop quality software for ya.
“I release software now, good luck everyone”
Cool idea – the local LLM space really needs a tool like this that actually understands hardware and real benchmarks instead of just “biggest model that fits.”
Only thing I would add is the ability to point to new/uncataloged benchmarks. If I have a favorite benchmark that best matches my use-case, the ability to point to it and have it fuzzy match model names or what have you would be a neat feature.
Interesting concept! A suggestion: `whichllm <USE_CASE>` would be more beneficial, i.e. `which coding` or `which text-to-video`.
What’s new regarding llmfit?
https://github.com/AlexsJones/llmfit
This has a web version[0] which I wish they'd host on a free site.
[0] https://github.com/AlexsJones/llmfit/tree/main/llmfit-web
Edit: I tried to deploy a snapshot of the llmfit-web files on Netlify but it seems to need/want to talk to a backend[1]
[1] https://llmfit.netlify.app/
Other than it (whichllm) being written in Python, nothing else.
I just use llmfit.
Fair question. llmfit answers "will this model fit in my memory?" — it's a fit/size calculator, and a good one. whichllm answers a different question: "of the models that fit, which is actually best?" It pulls candidates, then ranks them by merged real benchmarks (LiveBench / Artificial Analysis / Aider / Arena ELO / Open LLM Leaderboard) with a recency penalty, so a newer 27B beats an older 32B even though both fit — on a 24GB card it puts Qwen3.6-27B above Qwen3-32B on benchmarks, not size.
If "biggest that fits" is the answer you want, llmfit is the simpler tool and Python won't matter to you. If you want "which fitting model is worth running," that ranking layer is the whole reason whichllm exists. Different jobs — I'd genuinely send fit-only users to llmfit.
Your LLM should have bothered to notice that llmfit also has quality scores (and defaults to sorting by them). One might quibble about weighting of factors -- llmfit favors Qwen3.6-35B-A3B over Qwen3.6-27B, whereas I found the quality of the latter to be worth waiting for -- but it absolutely ranks models by quality.
I thought that Qwen3.6-35B-A3B was notably missing from whichllm output too.
Seriously going to use AI to write a reply that would take you 30 seconds?
AI response
Where are the metrics being sourced from exactly? Externally? or does this project have benchmarks running somewhere for the purposes of this project. Latter would likely be more apples to apples comparisons in case some external sources are biasing etc
The plan command is clever. How do you handle the VRAM estimation for models with sliding window attention vs full context? Something like Mistral at 32k context uses way less KV cache than Llama at the same context length, but from the README it looks like the estimation is based on a fixed context size. Does it account for that?
Good catch that's a real gap. The KV estimate is GQA/MQA-aware (per-model head config) but currently assumes dense full-context attention; it does not model sliding-window / chunked attention, so for SWA models like Mistral or Gemma at long context it over-estimates KV. The error is conservative — it tells you a model needs more than it does, not less, so it won't push you into an OOM — but it's still wrong. I'll open a tracking issue with per-architecture window sizes; if you have a reference for the exact SWA configs you care about it'll speed the fix. This is the kind of report I posted for.
"Best LLM" doesn't really depend on hardware alone. It actually depends more on your needs - type of workload, context length needed etc.
I love this community, I started building a simple website for this exactly a couple of hours ago and you made an even more advanced version already. Hats off to you sir.
If i ever decide to actually publish the site, is it alright if I mention you somewhere as a "If you want a more accurate estimation, check out this project:<your repo>", as i think there is value in having a simple website estimate this information for you, and give you instructions/ common flags on how to start it yourself (also a prompt crafted for you to optionally give to an llm to set it up for you), but im going off simple "choose an os, gpu/vram, here's a list of options" and not actually scanning (which is a lot more accurate).
Read the rest of the comments - your project would still be valued.
Is there any free hosting for Python scripts? That would be much more convenient for casual use.
Tried it on a 4060, got Qwen3-14B Q3_K_M. Matches what I actually run. Brew install failed for me too though (macOS 14.5).
OP is a newish user, all of their responses here are copied straight from Claude, and this project has an LLM slop readme (count them: 48 em-dashes on the page!) and LLM code. Just not very interesting.
I got really good results when I asked Pi (agent) to install
https://github.com/AlexsJones/llmfit
and tell me which ones are good for me. It would organize it per use and selection was solid.
How it select model? using AI?
canirun.ai does similar, but doesn't require any installs. I've found it to be pretty accurate for my setup too.
For me (64GB Apple Silicon), it recommended a year-old Llama model as "best for coding". That's a terrible first impression.
Has anyone gotten the old gpt-oss models running? They scored very high on benchmarks but I constantly had strange problems with them.
So two questions there:
(1) is it actually possible to get good results with them (some people said they got good results, which implies that it might have been hard to get them running properly, but if you can, then they're actually good?). Which also implies the second question,
(2) are benchmarks a spook?
---
...Also, is OP Claude?
I had it running on my 128gb strix halo - it ran around 40 tokens per second I think but I found it to be obnoxiously lobotomized.
An uncensored qwen3.5/3.6 is more fun
This AI slop should not be here.
it'd be nice if it had igpu support, it cant even detect it. overall great tool though. happy this exists.
Cool, but it looks like it doesn’t actually test anything on your machine? It does hardware detection and then some lookups. Maybe I missed it but I really want a tool like this to actually run a model on my machine to get the speed numbers.
I’ve been using RapidMLX for this. The integrated speed tests matter because the quality of the backend is a moving target and the quantization / MLX format conversion also matter. It’s not enough to say “oh use this model family with X parameters” you have to add the architecture specific quantization too.
https://github.com/raullenchai/Rapid-MLX
You're right that it doesn't run anything — it's a pre-download / pre-purchase decision tool, so it estimates rather than measures by design (you can simulate a GPU you don't own with --gpu). That's a genuine limitation vs running the model: a measured t/s on your exact backend/quant will always beat my estimate. The estimate is bandwidth-bound, per-quant and per-backend, and deliberately conservative on VRAM (weights + GQA-aware KV + activation) so it errs toward "won't fit" rather than crashing you mid-run. Where I can get real measurements I fold them in — calibration data / PRs for specific hardware are very welcome; that's the path to numbers you can trust rather than just plausible ones. On-device measurers like RapidMLX are complementary, a different point in the workflow.
Find the best LLM for your local hardware.
lifts mask
It's qwen.
accurate memory estimation is key here. it will crash if that accurate and it cant be generic for all local llm. each local llm has different context estimates.
can you add in the other quants like IQ3_M?
also my personal simple rule of thumb for local ai sizing is:
max model size (GB) = ram (GB) / 1.65
This doesn't correclty detect the unified memory architecture for
GPU 0: STRXLGEN — 8.0 GB (ROCm 6.19.8-200.fc43.x86_64) — BW: N/A CPU: AMD RYZEN AI MAX+ 395 w/ Radeon 8060S — 16 cores (AVX2, AVX-512)
The 8GB is the reserved memory, but it's not the total available memory to the GPU.
Linux sets the unified memory like this on linux: https://www.jeffgeerling.com/blog/2025/increasing-vram-alloc...
Don't feel bad though, nvtop doesn't do it correctly either.
Cool idea, thanks for making this
Good job.