I don't always say this, but Bertrand Meyer is right:
AI in its modern form, however, does not generate correct programs: it generates programs inferred from many earlier programs it has seen. These programs look correct but have no guarantee of correctness. (I am talking about "modern" AI to distinguish it from the earlier kind—largely considered to have failed—which tried to reproduce human logical thinking, for example through expert systems. Today's AI works by statistical inference.)
The phrase that Meyer doesn't use in his article but that's dying to get out is "confidently wrong".
Meyer uses the example of binary search, an algorithm that seems simple but is notoriously difficult to implement correctly, even for experts. Each time he interacted with ChatGPT, he noted a bug in the program that the chatbot had previously generated. Each time, the bot generated a response that acknowledged the bug and offered a solution that didn't have that bug, and introduced others. To personify the bot's output: it was willing to admit it was wrong, but only about the mistakes Meyer had already pointed out. The onus was entirely on the human interacting with the bot to be skeptical about the results.
Meyer, a computer scientist who specializes in programming languages and formal verification, is primed to be skeptical. I've worked on formal methods for verifying languages, and it's an exercise in both skepticism and self-doubt. Proving that 1 + 1 = 2 can make you question everything you think you know. Formal theorem proving tools don't offer an alternate proof when yours is wrong; they just tell you "no" until the answer is "yes".
Proof assistants aren't for everyone, at least not in their current state. Think about a third way to get help writing programs: working with an experienced computer science tutor or teacher. A good tutor won't do your homework for you; they may not even know how to solve your problem. Instead, they'll ask you questions to help you debug your own thinking. Tell me how you arrived at this conclusion. What happens if the input is this? And so on.
Theorem provers and teachers both work by increasing doubt, not decreasing it. (Though a good teacher is a lot more pleasant to interact with.) A theorem prover forces you to ask yourself: why might this be wrong? What counterexamples did I ignore? You can't walk away from the experience without a keen awareness of the limits of your own cognition. A teacher asks questions until you internalize the teacher and can ask yourself your own questions.
If you're a working programmer, automated tools (whether they are statistical guess-bots like ChatGPT, or carefully crafted systems based on deep understanding of the meaning of a programming language, like Coq or Isabelle) can only take you so far; there are inherent theoretical limits as well as practical ones. Teaching and mentoring exist within software companies and projects, but a fourth option is pair programming or other forms of peer collaboration. While peers teach and learn from each other, they also share a common goal and (hopefully) offer each other tentative solutions. People make mistakes. A fan of LLMs might ask: "How is it any different to work with a fallible person than to use a computer program that sometimes gives wrong answers?"
I don't mind working with a person who's sloppy. They might still have ideas that wouldn't have occurred to me, or spot problems I wouldn't notice as quickly. Their mistakes don't lessen the value of those contributions.
I do mind working with a person who's confidently wrong. Confident wrongness has a corrosive, demoralizing effect on me and even if such a person is right 95% of the time, the extent to which they destroy my current and future productivity negates anything they could contribute to my work.
As Edsger Dijkstra said, testing can reveal the presence of bugs, but never their absence. This is a reason why LLMs, when applied to the domain of programming, are machines for being confidently wrong. As a programmer, your ability to assure the correctness of your code through testing is limited by your imagination for generating tests. If you believe you've tested your code exhaustively but actually missed one or many edge cases, ChatGPT won't help you. When using a tool like this, the onus is entirely on you to remind yourself, again and again, that maybe you haven't thought of everything.
Could we take an entirely different approach and automate teaching? Plenty of people have tried. So far, teachers still exist, because teaching is not just about subject knowledge or pattern-matching, but relationships and trust. It's often easy to give a correct answer, but delivering it in a way that will actually be heard requires you to understand who you're talking to and how they're feeling right now. This is the skill that differentiates more helpful teachers from less helpful ones, and why teaching is still a job that people can get paid to do. In principle, there's nothing wrong with trying to automate teaching, at least nothing more wrong about it than trying to automate any other job.
But what we don't need is a machine that automates "confidently wrong". We have enough of that in tech already. Skepticism is a two-way street: I am not gullible if I trust my confidently-wrong co-worker and let them lead me into unsoundness. I am obligated to be skeptical, but they are also obligated to be open with me about their own doubts and limitations. In general, that's how human relationships go. We expect other humans to be self-critical, though that expectation is modulated by class, age, race, gender, and a myriad of other social categories. We expect nothing of the sort from machines, even though a program is just an encoding of what one or many people were thinking and feeling (both consciously and unconsciously) when they wrote it.
If an LLM is just a fancy search engine, an exercise in information retrieval rather than knowledge production, then you might also ask: "What's the difference between Googling for code examples to copy and paste, and asking ChatGPT to find those examples for you?" If you do a Google search for "binary search program", the top result might also be wrong, after all. The difference is that it takes very little experience to learn that the code you Googled for is probably wrong. For all its faults, Google does not explicitly editorialize about search results. If you ever happen to talk to someone who does public relations for Google search, they will go out of your way to tell you that search results are automated and Google makes no truth claims about them. So far, I do not see people applying the same level of skepticism to the output of LLMs. Perhaps this is because LLMs are a newer technology. But I think there's more to it: because an LLM generates a (syntactically) coherent natural language response, not just a list of disparate results, you construct your own meaning out of what you see. It is up to you to create a coherent story when you're researching something and find ten different answers when you query a search engine (or for that matter, find ten different books about it in a physical library). Because you create that story more quickly and less consciously when you feel like you're having a conversation with a single person, it's easier to attribute that story to the other (an authoritative other, no less) than to yourself.
So far I've done my best to emulate Meyer and focus on the differences between using LLMs to generate text in a programming language and using them to generate text in a natural language. Programming languages have tiny, controlled vocabularies, rigid syntactic and semantic rules, and meanings that can be described mathematically. It is hard to overstate the differences between programming languages and natural languages. At the same time, programming is often misconceived as solely an exercise in getting machines to do what you want, rather than an exercise in precise communication (albeit of a limited set of concepts) between humans. I think Meyer is right to emphasize the value of mathematical proof in assuring the correctness of programs. Proof is also a social process, and proofs are analogous to programs. Theorem-proving and proof-checking tools based on domain knowledge of mathematics are arguably more efficient and reliable than LLMs based on statistical models of natural language, but natural language is never out of the loop, because it is the medium for proof. If programming is a social process, what happens when increasingly autonomous(-seeming) programs enter the chat?
If ChatGPT is crafted to be polite and friendly, but not to communicate self-doubt, perhaps that's an expression of the invisible hand of the free market. It doesn't really matter whether this happened through automated feedback loops (users engaging more when they receive responses that create positive affect) or manual training. Cult leaders and authoritarian politicians succeed for reasons. One of those reasons is that many people crave sources of both unconditional love and unlimited power. Knowledge is power. Not to get too vulgar-Freudian here: what if people find chatbots compelling because they fill the void left by an inadequate or nonexistent parent (or, for that matter, a god whose existence is in doubt)? A confidently wrong machine, a paternalistic machine, is one that creates and reinforces dependency. Another model of parenting is to guide your children to the point where they no longer need you. What would a chatbot based on that model look like? I don't know, and I also don't know how you would get investors to buy into a product that contains the seeds of its own destruction.
