• 2 Posts
  • 130 Comments
Joined 6 months ago
cake
Cake day: March 22nd, 2024

help-circle
  • The problem is that splitting models up over a network, even over LAN, is not super efficient. The entire weights need to be run through for every half word.

    And the other problem is that petals just can’t keep up with the crazy dev pace of the LLM community. Honestly they should dump it and fork or contribute to llama.cpp or exllama, as TBH no one wants to split up LLAMA 2 (or even llama 3) 70B, and be a generation or two behind for a base instruct model instead of a finetune.

    Even the horde has very few hosts relative to users, even though hosting a small model on a 6GB GPU would get you lots of karma.

    The diffusion community is very different, as the output is one image and even the largest open models are much smaller. Lora usage is also standardized there, while it is not on LLM land.









  • Jokes aside (and this whole AI search results thing is a joke) this seems like an artifact of sampling and tokenization.

    I wouldn’t be surprised if the Gemini tokens for XTX are “XT” and “X” or something like that, so it’s got quite a chance of mixing them up after it writes out XT. Add in sampling (literally randomizing the token outputs a little), and I’m surprised it gets any of it right.





  • but what am I realistically looking at being able to run locally that won’t go above like 60-75% usage so I can still eventually get a couple game servers, network storage, and Jellyfin working?

    Honestly, not much. Llama 8B, but very slowly, or maybe deepseek v2 chat, preprocessed on the 270 with vulkan but mostly running on CPU. And I guess just limit it to 6 threads? I’d host it with kobold.cpp vulkan, or maybe the llama.cpp server if there will be multiple users.

    You can try them to see if they feel OK, but llms are just not something that like old hardware. An RTX 3060 (or a Mac, or a 12GB+ AMD GPU) is considered bare minimum in the community, a 3090 or 7900 XTX standard.



  • On my G14, I just uses the ROG utility to disable turbo and make some kernel tweaks. I’ve used ryzenadj before, but its been awhile. And yes I measured battery drain in the terminal (but again its been awhile).

    Also throttling often produces the opposite result in terms of extended battery life as it likely takes more time in the higher states to do the same amount of work whereas running at a faster clock speed, the work is completed faster and the CPU returns to a lower less energy using state quicker and resides there more of the time.

    “Race to sleep” is true to some extent, but after a certain point the extra voltage one needs for higher clocks dramatically outweighs the benefit of the CPU sleeping longer. Modern CPUs turbo to ridiculously inefficient frequencies by default before they thermally throttle themselves.