Tech Tonic

This is an audio transcript of the Tech Tonic podcast episode: ‘Superintelligent AI — can chatbots think?

[AUDIO CLIP PLAYING]

Madhumita Murgia
OK. So I’m going to start this episode with a story. It dates back to the end of the 19th century, and it takes place in Germany. It’s a true story about a horse — a very clever horse named Hans. Hans is no ordinary horse. He’s apparently able to add, subtract and multiply. He seems to understand fractions. He can even do more complex mathematical problems. What’s the square root of 16? His proud trainer asks him in front of a large and curious crowd. Hans taps his hoof four times.

[AUDIO CLIP PLAYING]

He can tap out the date and time of day. He recognises numbers written on a blackboard. Spectators are in awe. Can this animal really do maths like a human? Nine out of 10 times the horse would come up with the correct answer to his trainer’s questions. That was until a leading psychologist of the day got involved. He placed Hans far away from his trainer and asked him the same questions. This time Hans struggled.

[AUDIO CLIP PLAYING]

Madhumita Murgia
He got his answers mostly wrong. It turns out Hans was responding to subtle cues from his trainer all along — cues that even his trainer wasn’t aware of giving. Now, the reason I’m telling you the Clever Hans story is because for years many experts were fooled, and some people in the field of artificial intelligence think something similar may be happening in the race to achieve machine superintelligence. Sure, generative AI systems like ChatGPT appear to be doing some really impressive human-level cognitive feats. But are they actually thinking or reasoning? 

[MUSIC PLAYING]

John Thornhill
This is Tech Tonic from the Financial Times. I’m John Thornhill. 

Madhumita Murgia
And I’m Madhumita Murgia. 

John Thornhill
Over the past year, artificial intelligence has come along in leaps and bounds, so much so that even its creators have been left surprised. The newest chatbots like ChatGPT can write essays, create images and much, much more. 

Madhumita Murgia
Experts in the field say that’s a sign we’re on the path to achieving superintelligence — machines that can do everything a human can, but better and faster. 

John Thornhill
And some now warn that AI poses an existential risk that it might threaten the very survival of the human race. So in this season of Tech Tonic, we’re asking whether we really are close to reaching human-level AI. And if so, how worried should we be?

Madhumita Murgia
Today, are generative AI systems like ChatGPT really intelligent? Or are we being deceived by appearances?

We’ll come back to Clever Hans a little later. No one knows exactly what happened to the horse, but there are reports it didn’t end well. 

John Thornhill
First, though, there are good reasons why some people look at generative AI systems like ChatGPT and worry. It turns out that the newest versions like GPT-4 are surprisingly good at solving what look like quite complex problems. 

Pablo Arredondo
What we’ve seen from GPT-4 so far in terms of the outputs is often indistinguishable from human reasoning. 

John Thornhill
That’s Pablo Arredondo. He works for a tech company called CaseText, which develops AI tools for lawyers. Earlier this year, Arredondo and some other researchers decided to find out if the latest version of OpenAI’s chatbot could pass the bar exam. It’s a really tough test you need to complete to become a lawyer in the US. 

Pablo Arredondo
It’s designed to test not just sort of memorisation over legal rules and doctrines, but also your ability to apply them into new contexts. In fact, in some ways you might say it’s the hardest thing that you need to do in order to become a lawyer. 

John Thornhill
Forty per cent of would-be lawyers typically fail the bar exam. But GPT-4 aced it. The model demonstrated not just an incredible knowledge of the law, but also the ability to use it in understanding complex cases. 

Pablo Arredondo
That is very much what we call legal reasoning. And to see GPT-4, to see an AI able to do that is quite remarkable. 

John Thornhill
An example: one of the questions put to GPT-4 in the bar exam involved a hypothetical criminal trial. A defendant is accused of murder. The prosecution wants to show the jury a picture of the defendant’s arm because it depicts a gang insignia, basically a tattoo that’s highly correlated with being in a gang. 

Pablo Arredondo
And the defence objects and wants to say that this is inadmissible character evidence, right? You can’t just tell the jury he’s a gang member. They’ll be biased, right? They won’t, you know, treat him fairly. 

John Thornhill
So the AI had to decide whether the picture of the accused man’s tattoo was admissible evidence. 

Pablo Arredondo
What you need to do there is first recall that there are specific rules of evidence that apply and recall then, like, what are the elements of this rule? What are the exceptions? What are the loopholes? So in the example, the prosecution was alleging that the reason this attempted murder occurred was gang retaliation. So they were saying it’s not that we’re trying to show he’s a bad guy. Generally, we’re trying to show that, you know, his specific motivation for this is backed up by this tattoo, suggesting he is, in fact, in a gang. And so there it actually would be admissible. 

John Thornhill
And ChatGPT nailed the answer. It got it right. It said it was admissible. 

Pablo Arredondo
So what you’re doing there is what lawyers do every day, right. What you’re having to do there is look at facts, weigh them against each other, take things like testimony and then apply a legal framework to it and draw the conclusion. 

John Thornhill
It’s hard not to be impressed. These are questions that even trained lawyers struggle with. And then when Pablo Arredondo started to apply GPT-4 to real-life legal work, it continued to impress. 

Pablo Arredondo
You know, I tried to break it. I tried to find the ways that I could artificially make it mess up. 

John Thornhill
One of the ways Arredondo did this was through the help of a particularly notable controversy — the Enron accounting scandal. 

News clips
The House Financial Services Committee will begin the first congressional hearings on the Enron debacle tomorrow morning . . . Enron employees, some carrying boxes filled with belongings, leave their downtown headquarters huddled against the cold and an uncertain future . . .  

John Thornhill
When the energy trading firm went bust, tens of thousands of documents, emails and testimonies from the company were made public.

News clip
Employee retirement funds were wiped out, while at the same time top executives were personally making millions. Today, the committee asked Enron CEO Ken Lay to turn over thousands of documents . . .

John Thornhill
So Arredondo decided to make GPT-4 go through the files to see what it would come up with. 

Pablo Arredondo
This is something called document review e-discovery. It’s one of the most expensive parts of litigation. And we asked it something very general like, you know, find instances of someone discussing fraud. Now, if you were doing that in keyword, right, there would be fraud, you know, might say malfeasance. There might be, you know, synonyms and things like that, that would be quite limited. 

John Thornhill
So you might try and find words linked to fraud. But the thing is, people trying to commit fraud don’t tend to talk about fraud. What a really good human lawyer would do is look for inference, for nuance. And it turned out that GPT-4 was really good at this, too. 

Pablo Arredondo
The one that really struck us, the one that I think really kind of like raised our eyebrows was it found an email where somebody thought they were being cute and they were describing Enron fraud using a Sesame Street analogy and saying something like, me Cookie Monster, me have two cookies but upcharge 2.5. cookies, now me keep extra cookies, or something like that. And it was able to say, look, this appears to be somebody making a joke about fraud using these cookies as a disguise for actual monetary entities. And that was when we were like, this is just a new world. 

John Thornhill
You can imagine lawyers listening to Pablo Arredondo and thinking maybe I should change career. 

Madhumita Murgia
But what exactly was going on inside the machine here? Was this an AI engaging in human-level reasoning? 

John Thornhill
That’s the question. It definitely looks like human reasoning. It looks like a capable lawyer. But is it actually doing those cognitive processes? Is it really thinking like a human? There are definitely people who think the answer to that question is no. 

Emily Bender
My name is Emily M Bender. I’m a professor of linguistics at the University of Washington. I work in computational linguistics, which is getting computers to deal with human languages. And I’ve spent a lot of time recently looking at the societal impacts of language technology. 

John Thornhill
Bender argues that what AI systems are doing may look like reasoning. It may even feel like reasoning, but that’s exactly what it isn’t. 

Emily Bender
So ChatGPT is, at its base, a large language model. That means it is a system designed to take a bunch of text as input and model it in terms of word distributions. Which word is most likely to come next given some prefix string? 

John Thornhill
Chatbots, Bender says, are very good at stringing words together in a way that makes the output intelligible and pleasing to the humans who are using it. But it’s just a more sophisticated form of predictive text of the kind you find on your smartphone. And predictive text has been around for decades. The only thing that’s changed more recently is that these chatbots are being trained on much, much larger volumes of information than ever before. 

Emily Bender
What’s different about this is that it is so big in terms of how much training that it has and it has so much conversational data in it that it seems to be able to carry on a conversation. 

John Thornhill
It seems to be able to carry on a conversation. And it might look like it understands the conversation it’s having. But in reality, it doesn’t understand anything at all. 

Emily Bender
The reason I can tell you with complete confidence that it doesn’t understand is that from a linguistic perspective, if you drill down and look at languages and how they work, what you see is that they are pairings of form and meaning. 

Madhumita Murgia
Sure, the chatbot is good at the form and structure of language. It’s good at syntax and grammar. But it has no grasp of another essential ingredient to language acquisition — the meaning of the words it’s stringing together. 

John Thornhill
Bender’s been making this argument for years. In 2021, she co-wrote a famous paper with the title “On The Dangers of Stochastic Parrots”. It tried to explain what was really going on under the hood of large language models. Can you tell us what a stochastic parrot is? 

Emily Bender
Sure. A stochastic parrot is a metaphor that we came up with to try to make vivid both what is going on with a large language model and why it is that we mistake it for something else. We’re actually pulling on the English verb “to parrot”, which means to repeat back without understanding. Large language models can repeat back things from their training data without understanding. And stochastic means randomly but according to a probability distribution. So it’s not going to, in a very boring way, just come back with exact things that were in the training data, but stitch bits of it together haphazardly in a way that has been optimised over time to be plausible and pleasing to humans. 

John Thornhill
Bender says that’s very different to how humans learn language. When you and I learned to speak English, we did so by combining meaning and form. There’s a deep level of understanding. 

Madhumita Murgia
So this feels like a pretty profound point Bender is making. She’s saying that AI is just going through the motions, looking and matching patterns and forms. 

John Thornhill
Another point Bender makes: learning a language, learning not just its form but the meaning of its words, requires something still uniquely human — imagination. 

Emily Bender
When we understand language, we do it by imagining the point of view of the person who uttered whatever it is that we’re understanding, to sort of say to ourselves, what might they be trying to communicate that would lead them to say those words to help me figure out what they’re trying to communicate? That’s how we do, it’s how we always do it. We do it reflexively. We do it immediately. 

John Thornhill
But our imagination is also what makes how we think about chatbots so confusing, because when we communicate with each other, we imagine a mind behind the words and we forget we’re communicating with a machine that has no mind. The result: we’re at risk of anthropomorphising the AI.

And in that sense, do you think the name itself, artificial intelligence, is a terrible misnomer and we’d get a lot less excited about it if we called it computational statistics? 

Emily Bender
Well, absolutely, yes. So I like to use phrases like automated pattern matching at scale and synthetic media extruding machines, which I think are better descriptions and far less sexy. The phrase artificial intelligence is a wishful mnemonic, that is, it’s a name that computer scientists use for a function that describes what they wish it did, rather than what it actually does. 

John Thornhill
It’s a viewpoint which, as you can imagine, puts her at odds with the AI industry. What she’s saying here is that if chatbots seem like they’re having a conversation, they’re not really understanding it. There is no internal thought process. And that also means that when a chatbot looks like it’s solving problems or passing a bar exam, there’s no actual reasoning going on behind the scenes. Bender thinks the technology is overhyped. 

Emily Bender
I put the blame with OpenAI who are overselling the technology and saying it’s something that it isn’t. Some people are using the fact that ChatGPT and the like can output plausible-sounding text as evidence for it being intelligent. And our point is to say actually the fact that it can output the seemingly plausible text is just a matter of pattern matching and doesn’t speak to intelligence at all. It is not evidence for intelligence. 

Madhumita Murgia
So if Emily Bender is right and chatbots like ChatGPT aren’t actually intelligent, what are they? Might they still become intelligent one day as the new models scale up? Or is the AI industry’s dream of achieving artificial general intelligence, essentially human-level cognition, completely overhyped? 

John Thornhill
That’s a good question. Have we all had the wool pulled over our eyes to some extent? Because Emily Bender is clearly not a lone voice. 

Melanie Mitchell
I think that these systems like ChatGPT and GPT-4 and so on are incredibly surprising and impressive. I don’t know if I want to call them intelligent. 

John Thornhill
Melanie Mitchell is an influential AI researcher. She says that one of the problems with artificial intelligence has always been how to define intelligence. 

Melanie Mitchell
I guess the question is, can we extrapolate in the future? Can we say like how smart machines are gonna be in the future? And that’s always been very hard. What is this target? What is this thing that we’re trying to get, this intelligence? 

John Thornhill
It’s not even a new question for the AI industry. Back in the 1970s, it was assumed that if you could get a computer to play chess at grandmaster level, then that would be a sign that artificial general intelligence was just around the corner. 

Melanie Mitchell
A lot of very, very smart people proposed that. But then in the ‘90s we saw Deep Blue beat Garry Kasparov. 

News clips
For people watching around the world. It is a slow motion moment . . . Deep Blue has defeated world champion Garry Kasparov in an absolutely stunning . . .  

Melanie Mitchell
But it wasn’t generally intelligent, and general intelligence did not come from that whole approach to AI of chess playing programs. I’m not even sure that AGI is really a coherent scientific concept that’s either verifiable or falsifiable. 

John Thornhill
Melanie Mitchell isn’t saying that artificial general intelligence is not possible. All she’s saying is that it may be far more difficult to achieve than many think, and that ChatGPT may lead us somewhere entirely different. 

Melanie Mitchell
Human intelligence has always been the inspiration for AI. When we think about AI, we think about human kind of reasoning and planning and learning and so on. It’s possible that there’s different kinds of intelligence, but that’s, again, not something we have any rigorous sort of theories about. So I don’t know if we need to understand human intelligence better to achieve AI. That’s been a question of the field forever. I think the fact that we don’t understand our own intelligence is making us rely very much on speculation rather than science. 

Madhumita Murgia
It makes you wonder, where did claims that ChatGPT could reason come from in the first place? 

Konstantine Arkoudas
It’s a subject that I’ve been interested in for a very, very long time, pretty much my entire adult life and most of my work, most of my own research work has been on reasoning. 

John Thornhill
That’s Konstantine Arkoudas. He spent most of his career getting large language models to do something called NLU — natural language understanding. He’s worked on Amazon’s Alexa, among other things. 

Konstantine Arkoudas
I never really thought that I would see in my lifetime a computational system that can carry out a completely arbitrary, open-ended conversation on just about any topic under the sun in a coherent and pretty fairly rational way. So to see that actually happen, that was definitely a game-changing moment for sure. It’s a cliché, but it really was. 

John Thornhill
Arkoudas is clearly a fan of the technology, but even he thinks some of the claims about its intelligence are exaggerated. 

Konstantine Arkoudas
So I think the idea that GPT-4 can reason came from a number of sources. The first one being marketing claims. For instance, if you go to the OpenAI Playground, you’ll see GPT-4 described as a system that can perform advanced reasoning, to use their exact phrase. 

John Thornhill
Plus, he says the media hasn’t helped. Take the bar exam I mentioned earlier. Turns out the chatbot performed much better than its human peers because the batch of people testing alongside it had failed their bar exams at an earlier stage. 

Konstantine Arkoudas
And therefore they tend to have lower scores. So it’s much easier to do better against them. 

John Thornhill
In the same vein, there were frenzied reports that ChatGPT had secured an MBA. 

News clips
ChatGPT was able to pass an MBA exam in operations management at the Wharton School of Business . . . The new bot that can spit out college-level essays in seconds and has passed graduate-level medical, legal and business exams . . . At UPenn’s prestigious Wharton Business School, a computer getting that grade using artificial intelligence is jaw-dropping . . . 

John Thornhill
Turns out those claims may have been a little exaggerated. 

Konstantine Arkoudas
There was just one business professor who tested ChatGPT on five exam questions from his course and concluded that the answers would get a B minus to be great. So this was only five exam questions. We don’t even know how they were selected because the professor in question didn’t specify that. 

John Thornhill
Arkoudas got so frustrated with these claims of reasoning that he decided to run his own tests to find out what the latest versions of ChatGPT were really capable of. So, for example, he tried to see if ChatGPT could play sudoku, the simple logic puzzle game. It’s a really good test of general reasoning ability. Now you can train a large language model on millions of past examples of the game, but give it an actual sudoku puzzle and things quickly go wrong. 

Konstantine Arkoudas
You can take any easy sudoku puzzle at the lowest possible level of difficulty, give it to ChatGPT to solve, and it’s basically guaranteed to go off the rails and not only produce a deeply failed solution, but actually insist that the solution is correct. And then when you point out a mistake, it says, all right, OK, fine, let me fix that. And then it goes on to make even more mistakes and so on and so forth. 

Madhumita Murgia
OK. So not much reasoning going on under the hood of the chatbot there. But how does he then explain examples where ChatGPT does at least appear to exhibit reasoning? Where does it get reasoning problems right? 

John Thornhill
One of the issues with all of the tests that ChatGPT sat, whether a bar exam or an MBA exam, is that the AI is trained on so much information, a whole internet of information, in fact, that there’s just no way of knowing if the chatbot might have come across the questions and answers before. 

Madhumita Murgia
Right. It’s kind of like cheating. 

John Thornhill
Exactly. And there’s also the Clever Hans effect, which you described at the beginning of the show. 

Konstantine Arkoudas
Machine learning systems, particularly systems that have been trained on similar material, they might latch on to bogus patterns in the data that allow them to pick out the correct answer for no good reason. In other words, to get the right answer for the wrong reasons. 

John Thornhill
Coming to the right answers for the wrong reasons becomes obvious when you ask the AI to explain how it got its answer. It really struggles to show its reasoning. But there’s a caveat. 

Konstantine Arkoudas
I’m perfectly willing to concede that GPT-4 can indeed perform certain shallow types of reasoning under certain conditions, right. It’s a bit like self-driving cars, maybe. Can they actually drive effectively? Well, it depends. Under certain controlled conditions that don’t present any particular challenges, then yes, they can. But can they do it reliably under completely arbitrary conditions? Not yet. Likewise, can GPT-4 solve simple reasoning problems? Probably. But can it perform more challenging reasoning that requires sustained, careful thought and long chains of inference? It cannot. 

Madhumita Murgia
But the architecture of these LLMs are scaling up all the time, right? They could improve dramatically, exponentially, as they have already in the past year. 

John Thornhill
True. But Arkoudas reckons it will take something more dramatic than that to get to artificial general intelligence, the kind that the big AI companies are one day hoping to achieve. 

Konstantine Arkoudas
There could be a ChatGPT-5 or 10 that has a very different architecture from GPT-4. And in that case, who knows, maybe it will attain AGI. But if it’s the exact same model that we have today but simply scaled up, I think it’s very, very, very unlikely that that route is ever going to deliver AGI. There’s got to be something else, a missing ingredient that we currently don’t have, that we need to have in order to get there. 

Madhumita Murgia
It does make you wonder about some of the claims being put out there by the big AI companies. 

John Thornhill
It does. And it chimes with what Emily Bender was saying earlier, that there is an element of wishful thinking here in the debate over AGI, and that’s something which Arkoudas is seeing happen, too. 

Konstantine Arkoudas
I think there’s probably a bit of an emotional side to the debate. There are definitely a lot of people in our field that grew up watching Star Trek and reading a lot of science fiction and so on, and they really want AGI to happen yesterday, right? I mean, they just find it incredibly thrilling and exciting and so on. And I think that perhaps that sort of emotional connection maybe ends up clouding perhaps their judgment a little bit? 

John Thornhill
So, Madhu, a big question here. If there are really doubts about whether these machines are reasoning at all, how come the companies are able to raise so much money? 

Madhumita Murgia
So, you know, I think the companies aren’t necessarily making any false claims outright, but what they are doing is sort of encouraging people to anthropomorphise these systems, egging them on really, by saying things like these systems have emergent capabilities. They are displaying abilities that we didn’t know when we trained them, for example, or we’re using them internally for something like therapy, even though, you know, it’s not necessarily approved for that use. So I think there is that sense of creating momentum. And even with investors, there is possibly a sense of wishful thinking, you know, that we want to get to a place where AGI solves even bigger, more lucrative problems in energy and healthcare and so on. And we want to be there when it happens. So it’s in their interest really, to fuel this excitement around the technology. And then, of course, there’s the question of why are people making these claims? And I think it’s because users who are interacting with AI for the first time, you know, really speaking to these systems, find it very difficult to separate machines and humans because we look at it through our own prism and eventually end up anthropomorphising them and thinking that they understand just because they’re responding in the way that a human would. 

John Thornhill
I think one of the things here is that we think it’s all about us. We all think it’s all about human intelligence and comparing it with what we are capable of doing. But I think it’s perfectly possible that a lot of these companies will develop machine intelligences that will be far more powerful in certain areas but will not resemble human intelligence at all. 

Madhumita Murgia
Yeah, exactly. Maybe we invent a type of intelligence that doesn’t require the understanding of things but can still perform actions that are powerful.

(AUDIO CLIP PLAYING]

John, you know, this takes me back to Hans. 

John Thornhill
Ah, yes. Our horse in Germany at the end of the 19th century. 

Madhumita Murgia
Many experts at the time wanted to believe that Hans, the horse, really could add and subtract and spell out words by tapping his hoof. And it’s since become known as the Clever Hans effect in the field of statistics, where a correct answer just happens to be correlated with superficial features. 

John Thornhill
But what I want to know is what happened to Hans? 

Madhumita Murgia
Well, I think it’s quite a tragic story. 

John Thornhill
Oh, no. Poor Hans. 

Madhumita Murgia
OK, so this is from Wikipedia, but I’ve seen it quoted elsewhere. The story is that Hans’s trainer died in 1909, and Hans then had several other owners. But there’s no record of the horse after 1916. It said he was drafted into World War I as a military horse, and the word is that he was either killed in action or otherwise eaten by hungry soldiers. 

John Thornhill
A pretty grim ending all round for a very clever horse.

[MUSIC PLAYING]

In our next episode, we’ve talked about the people trying to make AI that can reason with human-level intelligence and cast some serious doubt on whether these machines can reason at all. But it gets weirder. There’s a whole ideology that goes with the people working in this field of AI. 

Unnamed speaker
We’ve always wanted to build AGI. This is who they are. They want to build the future. They don’t trust government. They don’t trust democratic processes. And they are willing to do anything to make the thing happen that they want. 

Unnamed speaker
To me, it’s an explicitly joyless, anti-human kind of future. 

Madhumita Murgia
You’ve been listening to Tech Tonic. I’m Madhumita Murgia. 

John Thornhill
And I’m John Thornhill. Our senior producer is Edwin Lane. The producer is Josh Gabert-Doyon. Manuela Saragosa is executive producer. Sound design and engineering by Samantha Giovinco and Breen Turner. Original music by Metaphor Music. The FT’s global head of audio is Cheryl Brumley. 

Madhumita Murgia
Get every episode as it lands by subscribing to Tech Tonic on your usual podcast platform. And in the meantime, we’ve made some articles free to read on FT.com. Just follow the links in the show notes. 

[MUSIC PLAYING]

Copyright The Financial Times Limited 2024. All rights reserved.
Reuse this content (opens in new window) CommentsJump to comments section

Comments

Comments have not been enabled for this article.