to even get speeds matching chatgpt during peak hours, my (admittedly underpowered) laptop has to run ~7b models... quantized to like, 4 bits1. AND gpt-3.5-turbo is clearly much bigger than 7b.
AND now gpt-4o is available to free users (albiet with a usage limit).
i highly doubt what they earn is enough to cover all of that. i remember chatgpt's free plan used to be temporary thing. they probably decided the optimum way to grow is to be unsustainable, so they kept it free in order to stay relevant.
and it has a ripple effect too. the standards for free LLM services has been significantly raised. sorry to "back in my day", but back in MY day before techbros discovered LLMs, all we have is AI Dungeon's gpt-2 based model (1.5b)2.
i don't understand big tech.
-
tbh the main limitation is my vram. the layers have to be split ~50/50 between ram+cpu and vram+gpu.
-
you could run gpt-neox or gpt-j based models through KoboldAI on google colab, but i'm referring to what companies has to offer.




