1 comments

  • tim-tday 18 minutes ago ago

    How specifically do you propose this? Have you tried running inference on a local machine? (You should try that) I run local models on desktop machines, laptops, dedicated servers, cloud servers.

    I’ve built my own LLM containers, ive built orchestration systems for fine tuning and model management. I’ve tried quantized models. I’ve tested a dozen or so models of different sizes.

    You can’t really get around the fact that inference on cpu is slow, inference on gpu is gated by nvram (you need about 1gb of nvram per billion parameters, quantization reduces quality and increases operational toil). If you know of a consumer level gpu with 80-128gb of nvram that I can buy for less than $10k do please let me know.

    Short of a specific proposal I’m going to classify your suggestion as not knowing enough about what you’re talking about for your proposal to make any sense.