I just found out about this last week, but the good news is a new PC with a better GPU will arrive in about two weeks so I’ve decided to install a pair of local LLMs that are similar in competency to the free GPT-5 mini model I usually used in CoPilot. qwen2.5-coder:14b for chat and deepseek-coder:3b for autocomplete. I’ll switch to a Claude API for the really tough stuff, which I was doing with CoPilot anyway. The Continue plugin for VSCode gets all of this accomplished.
Continue on vscode with Ollama running (start it with "ollama serve") is great. There are some offline models like these that im using but not forget the qwen3.5 coder also.
"ollama list
NAME ID SIZE MODIFIED
laguna-xs.2:latest ba9ecde43b0e 23 GB 12 hours ago
nemotron3:33b f6d8b7ff496c 27 GB 4 days ago
qwen3.6:latest 07d35212591f 23 GB 6 weeks ago
gemma4:e2b 7fbdbf8f5e45 7.2 GB 7 weeks ago
gemma4:e4b c6eb396dbd59 9.6 GB 7 weeks ago "
You can download it from Continue or just use "Ollama pull <name>" from what you choose from ollama.com site and search on models. these run mostly on cpu as my 3080 cannot load those with more than 10gb but the cpu speed is amazing, it outputs faster than I can read!
The new laptop has only 32GB DDR5 and a RTX 4070 with 8 GB GDDR6 on Xubuntu, so Gemini recommended qwen2.5. I don’t think I want to run anything larger because as you said it’d run on CPU and system RAM. As it is, the 14B model will still spill over some and not entirely fit into the GPU.
I just found out about this last week, but the good news is a new PC with a better GPU will arrive in about two weeks so I’ve decided to install a pair of local LLMs that are similar in competency to the free GPT-5 mini model I usually used in CoPilot. qwen2.5-coder:14b for chat and deepseek-coder:3b for autocomplete. I’ll switch to a Claude API for the really tough stuff, which I was doing with CoPilot anyway. The Continue plugin for VSCode gets all of this accomplished.
Continue on vscode with Ollama running (start it with "ollama serve") is great. There are some offline models like these that im using but not forget the qwen3.5 coder also.
"ollama list NAME ID SIZE MODIFIED laguna-xs.2:latest ba9ecde43b0e 23 GB 12 hours ago nemotron3:33b f6d8b7ff496c 27 GB 4 days ago qwen3.6:latest 07d35212591f 23 GB 6 weeks ago gemma4:e2b 7fbdbf8f5e45 7.2 GB 7 weeks ago gemma4:e4b c6eb396dbd59 9.6 GB 7 weeks ago "
You can download it from Continue or just use "Ollama pull <name>" from what you choose from ollama.com site and search on models. these run mostly on cpu as my 3080 cannot load those with more than 10gb but the cpu speed is amazing, it outputs faster than I can read!
The new laptop has only 32GB DDR5 and a RTX 4070 with 8 GB GDDR6 on Xubuntu, so Gemini recommended qwen2.5. I don’t think I want to run anything larger because as you said it’d run on CPU and system RAM. As it is, the 14B model will still spill over some and not entirely fit into the GPU.