firstly you have the absurd primacy of the transformer - or really, scaled dot product multi-head self attention, because the real magic happens in the feedforward layers, but fuck if we know what's going on there. you have super-specialized kernels for that attention mechanism. you have hardware for that little fucker. the transformer has sucked all the oxygen out of the room and in future i have no doubt this will be viewed as a baffling misstep - why on earth did we waste so much time and energy on this one architecture that clearly wasn't going to deliver what everyone is loudly promising? (prediction: openai's o1 model will be regarded in retrospect as the moment this bubble burst)
from which proceeds the argument that yann lecunn keeps making (to the point where it's becoming a carthago delenda est bit. he's right though) - an autoregressive objective with fixed compute at each timestep is fucked. writing requires introspection. word choices cause us to re-evaluate our previous statements. we see connections only as we articulate. language is a surface form, and as a consequence LLMs are all surface.
crucially, creative writing especially enjoys a kind of 'deliberate confusion' emerging from our nature as pattern-seeking creatures. this emerges from direct experience of the world. it can be mimicked only superficially. you can't, as wittgenstein (IIRC) says, retrieve from a piece of music the interiority that engendered its composition. you also can't retrieve from a piece of music what it makes people feel. you can't be creative if all you can do is follow a set of implicit rules. as vincent persichetti puts it (emphasis mine):
Any tone can succeed any other tone, any tone can sound simultaneously with any other tone or tones, and any group of tones can be followed by any other group of tones, just as any degree of tension or nuance can occur in any medium under any kind of stress or duration. Successful projection will depend upon the contextual and formal conditions that prevail, and upon the skill and the soul of the composer.
if you have no interiority and all you're doing is sampling from a multinomial distribution over the next word step by step you're not going to project anything. multimodality won't save you. expression emerges from embodied cognition.
i believe these problems can be 'fixed' for the current generation of LLMs to an extent (subordinating them as parts of larger systems, o1-style RL or Quiet-STaR, idk maybe taking a cursory look at the past fifty years of linguistics and computational narratology etc) but i also don't believe that anyone who can do this gives enough of a shit, which is hilarious