Unlike Deep Blue, the IBM computer that in 1997 defeated the world chess champion Garry Kasparov, I saw no indication that the Jeopardy! victory constituted any remarkable innovation in artificial intelligence methods. IBM’s Watson computer is essentially search engine technology with some basic natural language processing (NLP) capability sprinkled on top. Most Jeopardy! clues contain definite, specific keywords associated with the correct response — such that you could probably Google those keywords, and the correct response would be contained somewhere in the first page of results. The game is already very amenable to what computers do well.
In fact, Stephen Wolfram shows that you can get a remarkable amount of the way to building a system like Watson just by putting Jeopardy! clues straight into Google:
Once you’ve got that, it only requires a little NLP to extract a list of candidate responses, some statistical training to weight those responses properly, and then a variety of purpose-built tricks to accommodate the various quirks of Jeopardy!-style categories and jokes. Watching Watson perform, it’s not too difficult to imagine the combination of algorithms used.
Compiling Watson’s Errors
On that large share of search-engine-amenable clues, Watson almost always did very well. What’s more interesting to note is the various types of clues on which Watson performed very poorly. Perhaps the best example was the Final Jeopardy clue from the first game (which was broadcast on the second of three nights). The category was “U.S. Cities,” and the clue was “Its largest airport is named for a World War II hero; its second largest, for a World War II battle.” Both of the human players correctly responded Chicago, but Watson incorrectly responded Toronto — and the audience audibly gasped when it did.
Watson performed poorly on this Final Jeopardy because there were no words in either the clue or the category that are strongly and specifically associated with Chicago — that is, you wouldn’t expect “Chicago” to come up if you were to stick something like this clue into Google (unless you included pages talking about this week’s tournament). But there was an even more glaring error here: anyone who knows enough about Toronto to know about its airports will know that it is not a U.S. city.
There were a variety of other instances like this of “dumb” behavior on Watson’s part. The partial list that follows gives a flavor of the kinds of mistakes the machine made, and can help us understand their causes.
- With the category “Beatles People” and the clue “‘Bang bang’ his ‘silver hammer came down upon her head,’” Watson responded, “What is Maxwell’s silver hammer.” Surprisingly, Alex Trebek accepted this response as correct, even though the category and clue were clearly asking for the name of a person, not a thing.
- With the category “Olympic Oddities” and the clue “It was the anatomical oddity of U.S. gymnast George Eyser, who won a gold medal on the parallel bars in 1904,” Watson responded, “What is leg.” The correct response was, “What is he was missing a leg.”
- In the “Name the Decade” category, Watson at one point didn’t seem to know what the category was asking for. With the clue “Klaus Barbie is sentenced to life in prison & DNA is first used to convict a criminal,” none of its top three responses was a decade. (Correct response: “What is the 1980s?”)
- Also in the category “Name the Decade,” there was the clue, “The first modern crossword puzzle is published & Oreo cookies are introduced.” Ken responded, “What are the twenties.” Trebek said no, and then Watson rang in and responded, “What is 1920s.” (Trebek came back with, “No, Ken said that.”)
- With the category “Literary Character APB,” and the clue “His victims include Charity Burbage, Mad Eye Moody & Severus Snape; he’d be easier to catch if you’d just name him!” Watson didn’t ring in because his top option was Harry Potter, with only 37% confidence. His second option was Voldemort, with 20% confidence.
- On one clue, Watson’s top option (which was correct) was “Steve Wynn.” Its second-ranked option was “Stephen A. Wynn” — the full name of the same person.
- With the clue “In 2002, Eminem signed this rapper to a 7-figure deal, obviously worth a lot more than his name implies,” Watson’s top option was the correct one — 50 Cent — but its confidence was too low to ring in.
- With the clue “The Schengen Agreement removes any controls at these between most EU neighbors,” Watson’s first choice was “passport” with 33% confidence. Its second choice was “Border” with 14%, which would have been correct. (Incidentally, it’s curious to note that one answer was capitalized and the other was not.)
- In the category “Computer Keys” with the clue “A loose-fitting dress hanging from the shoulders to below the waist,” Watson incorrectly responded “Chemise.” (Ken then incorrectly responded “A,” thinking of an A-line skirt. The correct response was a “shift.”)
- Also in “Computer Keys,” with the clue “Proverbially, it’s ‘where the heart is,’” Watson’s top option (though it did not ring in) was “Home Is Where the Heart Is.”
- With the clue “It was 103 degrees in July 2010 & Con Ed’s command center in this N.Y. borough showed 12,963 megawatts consumed at 1 time,” Watson’s first choice (though it did have enough confidence to ring in) was “New York City.”
- In the category “Nonfiction,” with the clue “The New Yorker’s 1959 review of this said in its brevity & clarity it is ‘unlike most such manuals, a book as well as a tool.’” Watson incorrectly responded “Dorothy Parker.” The correct response was “The Elements of Style.”
- For the clue “One definition of this is entering a private place with the intent of listening secretly to private conversations,” Watson’s first choice was “eavesdropper,” with 79% confidence. Second was “eavesdropping,” with 49% confidence.
- For the clue “In May 2010 5 paintings worth $125 million by Braque, Matisse & 3 others left Paris’ museum of this art period,” Watson responded, “Picasso.”
We can group these errors into a few broad, somewhat overlapping categories:
- Failure to understand what type of thing the clue was pointing to, e.g. “Maxwell’s silver hammer” instead of “Maxwell”; “leg” instead of “he was missing a leg”; “eavesdropper” instead of “eavesdropping.”
- Failure to understand what type of thing the category was pointing to, e.g.,“Home Is Where the Heart Is” for “Computer Keys”; “Toronto” for “U.S. cities.”
- Basic errors in worldly logic, e.g. repeating Ken’s wrong response; considering “Steve Wynn” and “Stephen A. Wynn” to be different responses.
- Inability to understand jokes or puns in clues, e.g. 50 Cent being “worth” “more than his name implies”; “he’d be easier to catch if you’d just name him!” about Voldemort.
- Inability to respond to clues lacking keywords specifically associated with the correct respone, e.g. the Voldemort clue; “Dorothy Parker” instead of “The Elements of Style.”
- Inability to correctly respond to complicated clues that involve inference and combining facts in subsequent stages, rather than combining independent associated clues; e.g. the Chicago airport clue.
What these errors add up to is that Watson really cannot process natural language in a very sophisticated way — if it did, it would not suffer from the category errors that marked so many of its wrong responses. Nor does it have much ability to perform the inference required to integrate several discrete pieces of knowledge, as required for understanding puns, jokes, wordplay, and allusions. On clues involving these skills and lacking search-engine-friendly keywords, Watson stumbled. And when it stumbled, it often seemed not just ignorant, but completely thoughtless.
I expect you could create an unbeatable Jeopardy! champion by allowing a human player to look at Watson’s weighted list of possible responses, even without the weights being nearly as accurate as Watson has them. While Watson assigns percentage-based confidence levels, any moderately educated human will be immediately be able to discriminate potential responses into the three relatively discrete categories “makes no sense,” “yes, that sounds right,” and “don’t know, but maybe.” Watson hasn’t come close to touching this.
The Significance of Watson’s Jeopardy! Win
In short, Watson is not anywhere close to possessing true understanding of its knowledge — neither conscious understanding of the sort humans experience, nor unconscious, rule-based syntactic and semantic understanding sufficient to imitate the conscious variety. (Stephen Wolfram’s post accessibly explains his effort to achieve the latter.) Watson does not bring us any closer, in other words, to building a Mr. Data, even if such a thing is possible. Nor does it put us much closer to an Enterprise ship’s computer, as many have suggested.
In the meantime, of course, there were some singularly human characteristics on display in the Jeopardy! tournament, and evident only in the human participants. Particularly notable was the affability, charm, and grace of Ken Jennings and Brad Rutter. But the best part was the touches of genuine, often self-deprecating humor by the two contestants as they tried their best against the computer. This culminated in Ken Jennings’s joke on his last Final Jeopardy response:
Nicely done, sir. The closing credits, which usually show the contestants chatting with Trebek onstage, instead showed Jennings and Rutter attempting to “high-five” Watson and show it other gestures of goodwill:
I’m not saying it couldn’t ever be done by a computer, but it seems like joking around will have to be just about the last thing A.I. will achieve. There’s a reason Mr. Data couldn’t crack jokes. Because, well, humor — it is a difficult concept. It is not logical. All the more reason, though, why I can’t wait for Saturday Night Live’s inevitable “Celebrity Jeopardy” segment where Watson joins in with Sean Connery to torment Alex Trebek.