• 3 Posts
  • 295 Comments
Joined 9 months ago
cake
Cake day: March 22nd, 2024

help-circle




  • Just that bursts of inference for a small model on a phone or even a desktop is less power hungry than a huge model on A100s/H100s servers. The hardware is already spun up anyway, and (even with the efficiency advantage of batching) Nvidia runs their cloud GPUs in crazy inefficient voltages/power bands just to get more raw performance per chip and squeak out more interactive gains, while phones and such run at extremely efficient voltages.

    There are also lots of tricks that can help “local” models like speculative decoding or (theoretically) bitnet models that aren’t great for cloud usage.

    Also… GPT-4 is very inefficient. Open 32B models are almost matching it at a fraction of the power usage and cost, even in servers. OpenAI kind of sucks now, but the larger public hasn’t caught on yet.