Watson, Can You Hear Me? (The Significance of the “Jeopardy” AI win)

Evan Driscoll says:

February 18, 2011 at 6:08 am

I think a lot of this article is a bit is dumb. Disclaimer: I'm not super familiar with AI stuff and particularly NLP.

ARTICLE: "Unlike Deep Blue, the IBM computer that in 1997 defeated the world chess champion Garry Kasparov, I saw no indication that the Jeopardy! victory constituted any remarkable innovation in artificial intelligence methods."

This statement I have a huge problem with… because I *am* familiar with chess in general and somewhat familiar with computer chess in particular. (Although I wasn't particularly good, I used to play competitively; you can still look me up in the USCF ratings database if you know my full name, and I've got a few dozen games. I've also considered writing my own chess AI, and investigated what it would take.) And even after reading the Wikipedia article on Deep Blue to review, the only thing I can come up with that may have been new to is using a learning algorithm to improve the board evaluation function. (There were some other unique things about it like chess-specific hardware, but that's not really AI.) So I have no idea what "remarkable innovation" Deep Blue presented that you're talking about.

The way I look at Watson is this: if you gave me the computer from the USS Enterprise, wiped it of its software, then showed me how to program it and let me get the hang of it, then gave me a day, I'm virtually positive I could write a chess program that would beat Deep Blue. Heck, I might be able to do that on, say, one of IBM's Blue Gene/Ls, though not within a day. But if you gave me the Enterprise's computer and told me to write something that will beat Watson at Jeopardy, my response would be "okay, let me read NLP papers for a month" because I'd have little clue how to do it otherwise.

This isn't necessarily a good indication that Watson progressed from the state of the art… but to suggest that Deep Blue beats Watson in innovation seems dubious at best.

ARTICLE: "In fact, Stephen Wolfram shows that you can get a remarkable amount of the way to building a system like Watson just by putting Jeopardy! clues straight into Google"

Wolfram's statistics are not very relevant here: there's an enormous gap between showing you a page with the answer somewhere on it and actually producing a definitive answer. And even if you count Watson's harder task vs Google's easier task, Watson still cuts out about 1/3 of the missed answers. And tells you when it doesn't know, most of the time.
Evan Driscoll says:

February 18, 2011 at 6:09 am

ARTICLE: "Watson performed poorly on this Final Jeopardy because there were no words in either the clue or the category that are strongly and specifically associated with Chicago — that is, you wouldn’t expect “Chicago” to come up if you were to stick something like this clue into Google (unless you included pages talking about this week’s tournament). But there was an even more glaring error here: anyone who knows enough about Toronto to know about its airports will know that it is not a U.S. city."

This was partially explained by one of the IBM folks. The explanation is that if the category is "U.S. Cities", that does NOT mean that the answer (question) is actually a US City, just that the clue is about a US City in some manner. (The example given was that a clue like "This river is west of New York City" would have fit fine in that category (though too easy the way I presented it).) And so while Watson does pay attention to the category (throughout the game), it's not used as a "type" of the answer.

If the "US" fact is moved into the clue: "This US city's largest airport…", they claim Watson gets it right.

ARTICLE: "Also in the category “Name the Decade,” there was the clue, “The first modern crossword puzzle is published & Oreo cookies are introduced.” Ken responsed, “What are the twenties.” Trebek said no, and then Watson rang in and responded, “What is 1920s.” (Trebek came back with, “No, Ken said that.”)"

This doesn't speak at all to Watson's NLP power or lack thereof at all, since it had no access to Ken's answer at all. Your categorization of that error alongside the Stephen Wynn and Stephen A. Winn thing is inappropriate; it was not a case of Watson considering "the 1920s" and "the twenties" as distinct.

ARTICLE: "In short, Watson is not anywhere close to possessing true understanding of its knowledge — neither conscious understanding of the sort humans experience, nor unconscious, rule-based syntactic and semantic understanding sufficient to imitate the conscious variety. "

Now this is definitely true. That said, no OTHER system is either, so you can't use that point to argue "Watson doesn't advance the state-of-the art."

In short, I think this analysis puts forth a shaky supposition from the start and then does a poor job at arguing it. I'm not qualified to judge how novel what IBM did was, nor how Jeopardy specific it is. (I suspect it's more Jeopardy-specific than IBM'd care to admit.) But nor am I convinced that this article does a good job.

One final comment:

ARTICLE: "In the meantime, of course, there were some singularly human characteristics on display in the Jeopardy! tournament, and evident only in the human participants. Particularly notable was the affability, charm, and grace of Ken Jennings and Brad Rutter. But the best part was the touches of genuine, often self-deprecating humor by the two contestants as they tried their best against the computer."

I also liked the time that Ken seemed to play off of Watson's wrong answer — when the clue was asking for a style of art, and Watson said "Picasso", Ken jumped right in and said "cubism".
Unknown says:

February 18, 2011 at 1:33 pm

There certainly has been an overload of attempts to cast Watson in a Kurzweilian light and it does not at all fit (I would blame Kurzweil more than Watson or the media covering it). I think significance is getting missed because of this. It isn't particularly fair to compare with search engines as they are quite sophisticated themselves and they don't exactly fit in the same sized cabinet as Watson does. It is important to note that Watson is not searching the Internet, but is using resources chosen ahead of time. This makes Watson more applicable to certain real world problems. It is precisely the task of extracting the important keywords from the clue and extracting the important part of the result that distinguishes Watson. I would not call this "a little NLP". Just as with Machine Vision it is the correct selection of data to throw away that distinguishes the successes from the failures. But most importantly the IBM team was able to measure their starting point with their first implementation (which probably was similar to the results you would get by scraping search engine results) and steadily improve the system.

This brings me to the point I want to make. What IBM has demonstrated is not they have made this thing called Watson that is going to solve our problems, but that they have a team of Computer Scientists that can work with an information domain and fine tune and refine a system to the point that it operates at expert level. The "unbeatable Jeopardy! champion" that you envision is exactly what IBM is looking to provide, a system that augments a person already knowledgeable in the domain.
Ari S says:

February 18, 2011 at 7:47 pm

Mr. Yates,

In response to your comment:

It isn't particularly fair to compare with search engines as they are quite sophisticated themselves and they don't exactly fit in the same sized cabinet as Watson does. It is important to note that Watson is not searching the Internet, but is using resources chosen ahead of time.

The comparison to search engines is entirely fair. No one disputes that search engines are sophisticated and are quite an accomplishment in and of themselves; the question is whether Watson is an innovation. Watson is clearly built upon what's effectively a search engine. It's true that it didn't have a connection to the Internet while playing the game, but that's not a significant technological accomplishment, because Watson effectively had a subset of the Internet cached in its memory — a subset, as you note, that was carefully chosen by humans. Nor is the size of the machine a significant accomplishment — the "cabinet" in which Watson fit was actually a large room hidden backstage, a pretty standard-size server room.

This brings me to the point I want to make. What IBM has demonstrated is not they have made this thing called Watson that is going to solve our problems, but that they have a team of Computer Scientists that can work with an information domain and fine tune and refine a system to the point that it operates at expert level.

This isn't new, either. Teams of computer scientists creating and refining computer systems to solve a specific problem is what computer scientists have been doing for as long as the profession has existed. (Which isn't to say that's all they've been doing.) And creating AI-based programs that are able to perform feats similar to human experts in very specific and complicated technical domains — that's been done since the 1980s.

I'm not at all denying that Watson is quite an accomplishment. It was indeed very impressive to watch. My point is that it's essentially applying existing AI methods to a new domain in which the power of those methods can really be expressed, and can become very publicly apparent. In other words, the biggest accomplishment here is having thought to put a computer on Jeopardy! in the first place.

But as far as I can tell, Watson did not exhibit a giant advance in natural language-processing abilities — which its creators emphasize is supposed to be its big technical feat. Watson can disambiguate word usage — as Ken Jennings wrote, "When it sees the word 'Blondie,' it's very good at figuring out whether Jeopardy! means the cookie, the comic strip, or the new-wave band" — and it has a few other similar capabilities; but those are a very small piece of the puzzle, and don't come anywhere close to truly understanding natural language. Again, I recommend Stephen Wolfram's post outlining the differences.
Ari S says:

February 18, 2011 at 9:35 pm

Mr. Driscoll,

Thanks for your two comments. Your objection to my post seems to be based on a broad-level assertion about the difficulty of designing something like Watson, and a variety of very specific objections to arguments I made. So I'll attempt to respond point-by-point to a few of your more substantial objections.

One note upfront: you've disclaimed that you're not very familiar with AI. While I don't want to argue from authority, I am actually quite familiar with AI — I've studied it academically and independently, and programmed it professionally — and my impression, for the reasons I've outlined, is that there is still an enormous gap in Watson's ability to understand natural language, and that Watson mostly uses existing AI and natural language process (NLP) capabilities in a novel way. This is not to discount the impressiveness of those techniques, and I think Watson is the finest display of them we've seen so far. I think if you were familiar with pre-Watson AI and NLP techniques, your impression would be similar. While I'm not an AI researcher, there are at least two respected AI researchers who have offered very similar assessments to mine. One is Ben Goertzel. Another, as I mentioned, is Stephen Wolfram, whose post offers the useful distinction between linguistic matching and linguistic understanding. As Wolfram notes, Watson does not attempt to form a definite internal representation of whole sentences; what it does is use some recognition of parts of speech and attributes of categories to adjust statistical matching techniques — and this means that it is an advancement on, but not different in kind from, techniques that have been around for decades.

(continued in the next comment…)
Ari S says:

February 18, 2011 at 9:51 pm

(…continued from the previous comment)

Now to a few specific points:

…if the category is "U.S. Cities", that does NOT mean that the answer (question) is actually a US City, just that the clue is about a US City in some manner…. And so while Watson does pay attention to the category (throughout the game), it's not used as a " type" of the answer.

There are some Jeopardy categories in which you can't tell until you see the clue whether the category is a sort of hint, or whether it's actually telling you the type of the answer. But there are plenty of categories that restrict the type of response, and I think "U.S. Cities" for a Final Jeopardy clue is clearly one of them. It's doubtful, also, that Watson would or could have the ability to distinguish based on the context of the clue itself whether the clue is meant to determine the type of the response. Perhaps it's true that Watson would have gotten the question right if it had been reformulated as you describe, but it's still clear that Watson has severe shortcomings in its ability to understand natural language, and that most of its work happens by statistical association rather than by employing internal representations of predicates and whole sentences.

[Watson repeating Ken's wrong answer] doesn't speak at all to Watson's NLP power or lack thereof at all, since it had no access to Ken's answer at all.

True, but I just categorized this as "errors in worldly logic." It doesn't seem insignificant that Watson has no idea what's going on in the game aside from being fed the textual questions and the state of the question and score boards. If they had fed Watson the other players' answers, it would have been trivial for the programmers to add a simple "if (previous player answered this) then (eliminate this as a possible answer)" command. Even so, that would speak to the program's nature as a fine-tuned, rule-based statistical system rather than an instance of general intelligence.

"In short, Watson is not anywhere close to possessing true understanding of its knowledge…"

Now this is definitely true. That said, no OTHER system is either, so you can't use that point to argue "Watson doesn't advance the state-of-the art."

That was a conclusion from, not the basis of, my argument that Watson doesn't advance the state-of-the-art (nor was I saying it entirely doesn't. And I just wanted to specifically emphasize that Watson doesn't come close to achieving artifical general intelligence or true understanding of natural language, as many have been implying.

But if you gave me the Enterprise's computer and told me to write something that will beat Watson at Jeopardy, my response would be "okay, let me read NLP papers for a month" because I'd have little clue how to do it otherwise.

I'm not sure I entirely understand your point here, but to the extent I do, it seems you've got it just backwards. The Enterprise computer is the idealized realization of a non-conscious machine with complete natural language processing capability. This allows people to communicate with the computer using natural language, both to control the ship, and to ask questions of a machine with access to basically all available knowledge. It's unclear whether it has much intelligence, so I don't know how much luck you'd have using the computer's own programming to make a chess champion (though it could probably tell you how). But the Enterprise computer is already the perfect, generalized Jeopardy-playing machine. You could take a look at its programming to make a Watson — or you could trivially create a Watson by writing the one-line command: "Send the following query to the Enterprise computer: 'Computer, to what question is this sentence the answer: [Jeopardy clue]?'"
Evan Driscoll says:

February 20, 2011 at 4:31 am

Watson's ability to understand natural language, and that Watson mostly uses existing AI and natural language process (NLP) capabilities in a novel way. This is not to discount the impressiveness of those techniques, and I think Watson is the finest display of them we've seen so far.

That's fair enough. And I don't dispute some of the things you say later about things like "Watson doesn't have a true understanding of language". I probably counted your opinion of Watson as more negative than it actually is.

"there are plenty of categories that restrict the type of response, and I think "U.S. Cities" for a Final Jeopardy clue is clearly one of them"

So I looked through a ton of games here to get an idea of how often a category like that would be used with answers that aren't actually what the clue says. And while they almost always are, it isn't a sure thing. There was one game with a "U.S. Cities" category in double jeopardy where the question for the $2000 clue was "what is Maine?" And other things like a "national historic sites" category having a question asking for "minuteman" missles and the "Brown v Topeka Board of Education" supreme court case, a "National Parks" catagory where one question asked what mountain range Yosemite is in, and even an entire "National Parks of the World" category where each clue was just the name of the park and the question was the name of the country it was in. (Alex did provide an explanation for that, so it only sort of counts.)

I'm not sure I entirely understand your point here, but to the extent I do, it seems you've got it just backwards. The Enterprise computer is the idealized realization of a non-conscious machine with complete natural language processing capability.

Sorry, I wasn't very clear. I originally wrote this response for a reply to a friend who shared the article over RSS on Google Reader, and decided I might as well post it here too. I made some minor changes, and a key bit got lost in this part: what I originally wrote was "if you gave me the computer from the USS Enterprise, wiped it of its software, then showed me how to program it…".

So what I meant by "the computer from the USS Enterprise" was basically just "give me lots of cycles". (I'm sort of assuming the NLP part of that computer system is "software only".)

So phrased a different way, what I was trying to argue was that I think most undergrad CS graduates could write a chess AI that, given a sufficiently fast comptuer, could reasonably easily write an unbeatable chess AI. (Taken to the extreme, if the computer is infinitely fast, you don't even need to do anything remotely intelligent at all: just expand out the whole minimax tree. If there's a library with board representation, "get the available moves" function, etc. available, this would take what, a few minutes?)

By contrast, it seems to the somewhat lay observer that even some "generic NLP researcher" (who hasn't put specific thought into solving Jeopardy) would still have very difficult time beating Watson in a "somewhat-timed" contest, even given as much computing power as they want. (By "somewhat-timed" I mean it has an almost-guaranteed time period of a few seconds to think.) It'd become even more difficult if Watson's software was allowed to run on the same machine.

(I've got a few more comments I'll post in a moment; I suspect I'm near the character limit.)
Evan Driscoll says:

February 20, 2011 at 4:55 am

The other thing to keep in mind is that it's almost never the case that someone invents some brand new technique out of the blue. Even for "groundbreaking projects", if you go and look at the papers that proceeded them (whether they be by the same or different groups), they're almost always most of the way there. This is why we can have even seemingly-complex, revolutionary things like calculus be independently invented by two people at the same time.

And so you say that Watson is "it is an advancement on, but not different in kind from, techniques that have been around for decades"… but this is true of essentially everything out there. I'd give $50 to someone who could demonstrate that Wolfram Alpha is fundamentally different from things that have been around for a couple decades. (Of course the catch there is that it's sort of my opinion of what counts as "fundamentally different.") Wolfram is not the first to try a less statistical approach to NLP, of course.

The big question in my mind is not "how big of an advance is Watson", because in the narrow space of Jeopardy, I think it's pretty clearly an enormous advance. The big question is "how generally applicable is the technology underlying Watson?"

If the evidence-weighing and scoring functions they have turn out to be general, then it seems to me all this "they're taking existing techniques and appling it in a different setting" argument, no matter how true, is basically meaningless. What I fear is that this won't be the case.
Dave Turner says:

February 22, 2011 at 3:02 pm

Me thinks the author doth protest too much.

Let's bottom line this. This article takes every opportunity to point out weaknesses and failures in Watson from the perspective of what the author believes Watson should have been doing to be proven successful and does not consider the actual stated goal from IBM that was being tested.

Per IBM, "The goal is to build a computer that can be more effective in understanding and interacting in natural language, but not necessarily the same way humans do it."

The reason this is significant, especially compared to Deep Blue, is that a computer is inherently good at processing the 1 and 0's required to analyze a chess game and the finite legal moves available. Those moves are extremely finite when compared to the diversity of human language and, I might add, especially American English which is a conglomerate of many languages and dialects, not to mention slang and syntax.

But let's also take a look at the article's claims about Deep Blue. Deep Blue DID NOT win it's first match against Kasparov, it simply one a single game and tied twice thus losing by a score of 4 to 2. It was then re-designed, re-built and officially renamed Deeper Blue. In the subseqent match Deep(er) Blue beat Kasparov in what was a very close match scoring 3 1/2 to 2 1/2. Not exactly the decisive victory of Watson.

The point is clear, Watson is very significant because it attempts, and has deomonstrated significant success, at doing what computers are NOT good at. Which is understanding human language. Was the success limited? Of course it was. Just as was Deep Blue's, which the Author gives as a valid example of a significant computer break-through. So again, why is Watson 's achievment not significant?

Further, the author says search engines do to a pretty good job of what Watson was designed to do. Well, if the goal was to find the right answer this arguement would have some validity, but since the goal to was to process language, that is take information, research that information, then provide a resonably valid answer, then this is far off the mark. A search engine sorts and prioritizes the research. At no point does it take the research and summarize the data into an attempt to provide an answer. Google points you in the right direction and we are required to evaluate the answers and look for the answer we need.

No, the author got this backwards. Deep Blue and Deeper Blue were an affirmation of what we already new computers did well and perhaps proof that programmers were getting better at taking advantage of that ability. Watson is the first significant step towards a computer that truly processes human language. It was far from perfect, but if you watch enough Jeopardy you'll see humans making similiar errors. And this was IBM's first demonstration. I can't wait for Watson 2.0!
Ari S says:

February 22, 2011 at 5:46 pm

Methinks the commenter doth protest too much! Unfortunately, Zyradul, your comment seems based on a lack of knowledge about the history and methods of artificial intelligence.

First, to the pedantic question about whether Deep Blue or Deeper Blue beat Garry Kasparov: of course Deep Blue, like every piece of software, had many iterations to improve upon the failings of the initial version. But "Deeper Blue" was just a joke nickname IBM came up with for the second version that played Kasparov, while "Deep Blue" was the official name of every version of the program, including the one that beat Kasparov.

It's also not quite useful to compare the points spreads of a chess tournament versus a Jeopardy tournament, since the games and scoring systems are so different. The chess scores you cite are based on counting the winners of individual matches; the closest comparison would be to Jeopardy's individual rounds, of which there were also six in the Watson tournament: two Jeopardy Rounds, two Double Jeopardy Rounds, and two Final Jeopardy Rounds. Watson had its fair share of difficulties: counting based on money won during each round, Watson won three, tied one (the first Jeopardy Round), and lost two (the second Jeopardy Round and the first Final Jeopardy Round). Nor was this quite, as you say, IBM's first demonstration of Watson — just the first to appear as an official match on Jeopardy!; in fact, Watson competed against many former human Jeopardy contestants to help IBM improve its many previous iterations of Watson.

The rest of this comment is basically addressed in my original post and subsequent comments. The basic argument seems to be that "Watson is the first significant step towards a computer that truly processes human language." Watson is nowhere near the first step towards being able to do some processing of natural language, there having been attempts and significant advances made for 50 years. (Just take a look at Google Translate for evidence of the results.) Nor does it seem to innovate any qualitatively new methods, but just improves on existing ones: Chapter 22 of Russell & Norvig, the standard textbook on AI, offers the basic blueprint for how to create a Watson.

As for a computer that actually understands natural language, equivalent to human capability: no computer comes close today, and neither does Watson, which was the main point my post was meant to illustrate. Artificial intelligence researchers — at least, those not directly involved in this project or with some other incentive for exaggerating its results — seem to agree that this is an improvement upon existing methods, and a novel show of what those methods can achieve, but nothing close to an architecture for general understanding of natural language.

Comments are closed.

About Us

Contact

Praise

Donate

Watson, Can You Hear Me? (The Significance of the “Jeopardy” AI win)

Futurisms