Levesque - The Winograd Schema Challenge
Topics covered
Levesque - The Winograd Schema Challenge
Topics covered
Abstract that typed English text, despite its limitations, will be a rich
enough medium.
In this paper, we present an alternative to the Turing Test
that has some conceptual and practical advantages. A Wino- 2 The trouble with Turing
grad schema is a pair of sentences that differ only in one or
two words and that contain a referential ambiguity that is re- The Turing Test does have some troubling aspects, however.
solved in opposite directions in the two sentences. We have First, note the central role of deception. Consider the case
compiled a collection of Winograd schemas, designed so that of a future intelligent machine trying to pass the test. It must
the correct answer is obvious to the human reader, but can- converse with an interrogator and not just show its stuff, but
not easily be found using selectional restrictions or statistical fool her into thinking she is dealing with a person. This is
techniques over text corpora. A contestant in the Winograd just a game, of course, so it’s not really lying. But to imitate
Schema Challenge is presented with a collection of one sen-
a person well without being evasive, the machine will need
tence from each pair, and required to achieve human-level
accuracy in choosing the correct disambiguation. to assume a false identity (to answer “How tall are you?”
or “Tell me about your parents.”). All other things being
equal, we should much prefer a test that did not depend on
1 Introduction chicanery of this sort. Or to put it differently, a machine
should be able to show us that it is thinking without having
The well-known Turing Test was first proposed by Alan Tur- to pretend to be somebody or to have some property (like
ing (1950) as a practical way to defuse what seemed to him being tall) that it does not have.
to be a pointless argument about whether or not machines We might also question whether a conversation in English
could think. In a nutshell, he proposes that instead of ask- is the right sort of test. Free-form conversations are no doubt
ing such a vague question and then getting caught up in a the best way to get to know someone, to find out what they
debate about what it means to really be thinking, we should think about something, and therefore that they are thinking
focus on observable behaviour and ask whether a machine about something. But conversations are so adaptable and can
would be capable of producing behaviour that we would say be so wide-ranging that they facilitate deception and trick-
required thought in people. The sort of behaviour he had in ery.
mind was participating in a natural conversation in English Consider, for example, ELIZA (Weizenbaum 1966), where
over a teletype in what he calls the Imitation Game. The a program (usually included as part of the normal Emacs dis-
idea, roughly, is that if an interrogator were unable to tell tribution), using very simple means, was able to fool some
after a long, free-flowing and unrestricted conversation with people into believing they were conversing with a psychia-
a machine whether she was dealing with a person or a ma- trist. The deception works at least in part because we are
chine, then we should be prepared to say that the machine extremely forgiving in terms of what we will accept as legit-
was thinking. Requiring more of the machine, such that as imate conversation. A Rogerian psychiatrist may say very
that it look a certain way, or be biological, or have a certain little except to encourage a patient to keep on talking, but it
causal history, is just arbitrary chauvinism. may be enough, at least for a while.
It is not our intent to defend Turing’s argument here (but Consider also the Loebner competition (Shieber 1994), a
see the Discussion section below). For our purposes, we restricted version of the Turing Test that has attracted con-
simply accept the argument and the emphasis Turing places siderable publicity. In this case, we have a more balanced
on intelligent behaviour, counter to critics such as Searle conversation taking place than with ELIZA. What is strik-
(2008). We also accept that typed English text is a suffi- ing about transcripts of these conversations is the fluidity of
cient medium for displaying intelligent behaviour, counter the responses from the subjects: elaborate wordplay, puns,
to critics such as Harnad (1989). That is, assuming that any jokes, quotations, clever asides, emotional outbursts, points
sort of behaviour is going to be judged sufficient for show- of order. Everything, it would seem, except clear and direct
ing the presence of thinking (or understanding, or intelli- answers to questions. And how is an interrogator supposed
gence, or whatever appropriate mental attribute), we assume to deal with this evasiveness and determine whether or not
there is any real comprehension behind the verbal acrobat- • A: Norway’s most famous painting, “The Scream” by Ed-
ics? More conversation. “I’d like to get back to what you vard Munch, was recovered Saturday.
said earlier.” Short conversations are usually inconclusive; B: Edvard Munch painted “The Scream.”
unsurprisingly, the Loebner competition gives judges only This is on the right track, in our opinion. Getting the correct
5 minutes to determine whether or not they are conversing answers (no and yes above, respectively), clearly requires
with a person or a computer (Christian 2011), not nearly some thought. Moreover, like the captcha, but unlike the
enough to get through the dust cloud of (largely canned) Turing Test, an evasive subject cannot hide behind verbal
small talk and jokes that the winning programs usually have. maneuvers. Also, in terms of a research challenge, incre-
Even with long conversations, two interrogators looking at mental progress on the RTE is possible: we can begin with
the same transcript may disagree on the final verdict. Grad- simple lexical analyses of the words in the sentences, and
ing the test, in other words, is problematic. then progress all the way to applying arbitrary amounts of
How can we steer research in a more constructive direc- world knowledge to the task.
tion, away from deception and trickery? One possibility is There are two interrelated problems with this challenge.
something like the captcha (von Ahn et al. 2003). The idea First it rests on a somewhat unclear notion of entailment.
here is that a distorted image of a multidigit number is pre- Of course a precise definition of this concept exists (assum-
sented to a subject who is then required to identify the num- ing a precise semantics, like in logic), but subjects could not
ber. People in general can easily pass the test in seconds, be expected to know or even understand it. The researchers
but current computer programs have quite a hard a time of it instead explain to subjects that “T entails H if, typically,
(cheating aside).1 a human reading T would infer that H is most likely true”
So this test does, at least for now, distinguish people from (Dagan, Glicksman, and Magnini 2006). The fact that we
machines very well. The question is whether this test could need to predict what humans would do, and indeed how they
play the role of the Turing Test. Passing the test clearly would reason about what is “likely” to be true, which forces
involves some form of cognitive activity in people, but it the RTE challenge to rest on a nonmonotonic notion of en-
is doubtful whether it is thinking in the full-bodied sense tailment, is troubling. We know, in fact, that what is likely
that Turing had in mind, the touchstone of human-level in- to follow from a set of premises can vary widely, given the
telligence. We can imagine a sophisticated automated digit semantics of a particular nonmonotonic logic and on the for-
classifier, perhaps one that has learned from an enormous malization of a background theory (Hanks and McDermott
database of distorted digits, doing as well as people on the 1987).
test. The behaviour of the program may be ideal; but the Moreover, entailment may not always coincide with hu-
scope of what we are asking it to do may be too limited to man judgement of what is most likely true under certain cir-
draw a general conclusion. cumstances. What if the second (B) above was this:
3 Recognizing Textual Entailment • B: The recovered painting was worth more than $1000.
In general, what we are after is a new type of Turing Test Technically, this is not an entailment of (A), although it is
that has these desirable features: very likely to be judged true! Of course, subjects can be
trained in advance to help sort out issues like this, but it
• it involves the subject responding to a broad range of En- would still be preferable for a practical test not to depend
glish sentences; on such a delicate logical concern.
• native English-speaking adults can pass it easily;
• it can be administered and graded without expert judges; The second problem is perhaps related to the difficulty
• no less than with the original Turing Test, when people of getting a firm handle on the more problematic aspects of
pass the test, we would say they were thinking. the RTE notion of entailment: In practice (see (Roemmele,
Bejan, and Gordon 2011)), the RTE challenge has focused
One promising proposal is the recognizing textual entail- more on inferences that are necessarily true due to the mean-
ment (RTE) challenge (Dagan, Glicksman, and Magnini ing of the text fragment than on default inferences. This re-
2006; Bobrow et al. 2007; Rus et al. 2007). In this case, sults in a challenge that is easier than might be imagined
a subject is presented with a series of yes-no questions con- from the original description of the RTE challenge. Exam-
cerning whether one English sentence (A), called the text ples of text and hypothesis pairs used in the development set
(T), entails another (B), called the hypothesis (H). Two ex- for the 2010 RTE challenge include the following, cited in
ample pairs adapted from (Dagan, Glicksman, and Magnini (Majumdar and Bhattacharyya 2010):
2006) illustrate the form:
• A: Arabic television Al-Jazeera said Tuesday the kidnap-
• A: Time Warner is the world’s largest media and internet pers of a U.S. woman journalist abducted in Baghdad had
company. threatened to kill her if female prisoners in Iraq were not
B: Time Warner is the world’s largest company. freed within 72 hours .... Al-Jazeera reiterates its rejection
1 and condemnation of all forms of violence targeting
Cheating will always be a problem. The story with captchas is journalists and demands the release of the US journalist
that one program was able to decode them by presenting them on Jill Carroll, the station said.
a web page as a puzzle to be solved by unwitting third parties be- B: Jill Carroll was abducted in Iraq.
fore they could gain access to a free porn site! Any test, including
anything we propose here, needs to be administered in a controlled
setting to be informative. and
• A: At least 35 people were killed and 125 injured in three Perhaps the hardest item to justify even informally from
explosions targeting tourists in Egypt’s Sinai desert region the requirements in the previous section is that thinking is
late Thursday, an Egyptian police source said. required to get a correct answer with high probability. Al-
B: At least 30 people were killed in the blasts. though verbal dodges are not possible like in the original
Some reasoning ability and background knowledge — in Turing Test, how do we know that there is not some trick that
arithmetic, in geography — are necessary to get these ques- a programmer could exploit, for example, the word order in
tions correct. Nevertheless, it seems overly generous to rate the sentence or the choice of vocabulary, or some other sub-
a system as being on par with human intelligence on the ba- tle feature of English expressions? Might there not be some
sis of its ability to do well on a challenge of this difficulty. unintended bias in the way the questions are formulated that
Certainly, this seems to be far below the difficulty level of could help a program answer without any comprehension?
what Turing was proposing. This is where the fourth requirement comes in. In the
What we propose in this paper is a variant of the RTE first example, the special word is “big” and its alternate is
that we call the Winograd Schema (or WS) challenge. It “small;” and in the second example, the special word is
requires subjects to answer binary questions, and appeals to “given” and its alternate is “received.” These alternate words
world knowledge and default reasoning abilities, but without only show up in alternate versions of the two questions:
depending on an explicit notion of entailment. • The trophy doesn’t fit in the brown suitcase because it’s
too small. What is too small?
4 The Winograd Schema Challenge Answer 0: the trophy
A WS is a small reading comprehension test involving a sin- Answer 1: the suitcase
gle binary question. Two examples will illustrate: (There is an extensive discussion of the spatial reasoning in-
• The trophy doesn’t fit in the brown suitcase because it’s volved in these disambiguations in (Davis 2012).)
too big. What is too big?
• Joan made sure to thank Susan for all the help she had
Answer 0: the trophy received. Who had received the help?
Answer 1: the suitcase
Answer 0: Joan
• Joan made sure to thank Susan for all the help she had Answer 1: Susan
given. Who had given the help?
With this fourth feature, we can see that clever tricks involv-
Answer 0: Joan
ing word order or other features of words or groups of words
Answer 1: Susan
will not work. Contexts where “give” can appear are sta-
We take it that the correct answers here are obvious. In each tistically quite similar to those where “receive” can appear,
of the questions, we have the following four features: and yet the answer must change. This helps make the test
Google-proof: having access to a large corpus of English
1. Two parties are mentioned in a sentence by noun phrases. text would likely not help much (assuming, that answers to
They can be two males, two females, two inanimate ob- the questions have not yet been posted on the Web, that is)!
jects or two groups of people or objects. The claim is that doing better than guessing requires sub-
2. A pronoun or possessive adjective is used in the sen- jects to figure out what is going on: for example, a failure to
tence in reference to one of the parties, but is also fit is caused by one of the objects being too big and the other
of the right sort for the second party. In the case of being too small, and they determine which is which.
males, it is “he/him/his”; for females, it is “she/her/her” The need for thinking is perhaps even more evident in a
for inanimate object it is “it/it/its,” and for groups it is much more difficult example, a variant of which was first
“they/them/their.” presented by Terry Winograd (Winograd 1972), for whom
we have named the schema:2
3. The question involves determining the referent of the pro-
noun or possessive adjective. Answer 0 is always the first The town councillors refused to give the angry demon-
party mentioned in the sentence (but repeated from the strators a permit because they feared violence. Who
sentence for clarity), and Answer 1 is the second party. feared violence?
Answer 0: the town councillors
4. There is a word (called the special word) that appears in Answer 1: the angry demonstrators
the sentence and possibly the question. When it is re-
placed by another word (called the alternate word), every- Here the special word is “feared” and its alternate is “advo-
thing still makes perfect sense, but the answer changes. cated” as in the following:
We will explain the fourth feature in a moment. But note that The town councillors refused to give the angry demon-
like the RTE there are no limitations on what the sentences strators a permit because they advocated violence.
can be about, or what additional noun phrases or pronouns Who advocated violence?
they can include. Ideally, the vocabulary would be restricted Answer 0: the town councillors
enough that even a child would be able to answer the ques- Answer 1: the angry demonstrators
tion, like in the two examples above. (We will return to this
2
point in the Incremental Progress section below.) See also the discussion of this in (Pylyshyn 1984).
It is wildly implausible that there would be statistical or their answers, a random WS test can be constructed, ad-
other properties of the special word or its alternate that ministered, and graded in a fully automated way. An expert
would allow us to flip from one answer to the other in this judge is not required to interpret the results.
case. This was the whole point of Winograd’s example! You
need to have background knowledge that is not expressed in 6 What is obvious?
the words of the sentence to be able to sort out what is go- The most problematic aspect of this proposed challenge is
ing on and decide that it is one group that might be fearful coming up with a list of appropriate questions. Like the
and the other group that might be violent. And it is precisely RTE, candidate questions will need to be tested empirically
bringing this background knowledge to bear that we infor- before they are used in a test. We want normally-abled adults
mally call thinking. The fact that we are normally not aware whose first language is English to find the answers obvious.
of the thinking we are doing in figuring this out should not But what do we mean by “obvious”? There are two specific
mislead us; using what we know is the only explanation that pitfalls that we need to avoid.
makes sense of our ability to answer here.
6.1 Pitfall 1
5 A library in standard format The first pitfall concerns questions whose answers are in a
In constructing a WS, it is critical to find a pair of questions certain sense too obvious. These are questions where the
that differ in one word and satisfy the four criteria above. choice between the two parties can be made without consid-
In building a library of suitable questions, it is convenient ering the relationship between them expressed by the sen-
therefore to assemble them in a format that lists both the spe- tence. Consider the following WS:
cial word and its alternate. Here is the first example above The women stopped taking the pills because they were
in this format: h i. Which individuals were h i?
The trophy doesn’t fit in the brown suitcase because it’s Answer 0: the women
too h i. What is too h i? Answer 1: the pills
Answer 0: the trophy special: pregnant
Answer 1: the suitcase alternate: carcinogenic
special: big In this case, because only the women can be pregnant and
alternate: small only the pills can be carcinogenic, the questions can be
answered by ignoring the sentence completely and merely
The h i in a WS is a placeholder for the special word or its al- finding the permissible links between the answers and the
ternate, given in the first and second rows of the table below special word (or its alternate). In linguistics terminology,
the line. A WS includes both the question and the answer: the anaphoric reference can be resolved using selectional
Answer 0 (the first party in the sentence) is the correct an- restrictions alone. Because selectional restrictions like this
swer when the special word replaces the h i and Answer 1 might be learned by sampling a large enough corpus (that is,
(the second party) is the correct answer when the alternate by confirming that the word “pregnant” occurs much more
word is used. often close to “women” than close to “pills”), we should
While a WS involves a pair of questions that have oppo- avoid this sort of question.
site answers, it is not necessary that the special word and its Along similar lines, consider the following WS:
alternate be opposites (like “big” and “small”). Here are two
examples where this is not the case: The racecar zoomed by the school bus because it was
going so h i. What was going so h i?
• Paul tried to call George on the phone, but he wasn’t h i.
Who wasn’t h i? Answer 0: the racecar
Answer 1: the school bus
Answer 0: Paul
Answer 1: George special: fast
alternate: slow
special: successful
In principle, both a racecar and a school bus can be going
alternate: available
fast. However, the association between racecars and speed
• The lawyer asked the witness a question, but he was re- is much stronger, and again this can provide a strong hint
luctant to h i it. Who was reluctant? about the answer to the question. So it is much better to
Answer 0: the lawyer alter the example to something like the following:
Answer 1: the witness The delivery truck zoomed by the school bus because it
was going so h i. What was going h i?
special: repeat
alternate: answer Answer 0: the delivery truck
Answer 1: the school bus
In putting together an actual test for a subject, we would
want to choose randomly between the special word and its special: fast
alternate. Since each WS contains the two questions and alternate: slow
This pitfall can also be avoided by only using examples with 7 Incremental Progress
randomly chosen proper names of people (like Joan/Susan
or Paul/George, above) where there is no chance of connect- In the end, what a subject will consider to be obvious will
ing one of the names to the special word or its alternate. depend to a very large extent on what he or she knows. We
can construct examples where very little needs to be known,
like the trophy example, or this one:
6.2 Pitfall 2
The man couldn’t lift his son because he was so h i.
The second and more troubling pitfall concerns questions Who was h i?
whose answers are not obvious enough. Informally, a good
question for a WS is one that an untrained subject (your Aunt Answer 0: the man
Edna, say) can answer immediately. Answer 1: his son
But to say that an answer is obvious does not mean that
special: weak
the other answer has to be logically inconsistent. It is pos- alternate: heavy
sible that in a bizarre town, the councillors are advocating
violence and choose to deny a permit as a way of express- At the other extreme, we have examples like the town coun-
ing this. It is also possible that angry demonstrators could cillor one proposed by Winograd. Unlike with the RTE, the
nonetheless fear violence and that the councillors could use “easier” questions are not easier because they can be an-
this as a pretext to deny them a permit. But these interpre- swered in a more superficial way (using, for example, only
tations are farfetched and will not trouble your Aunt Edna.3 statistical properties of the individual words). Rather, they
So they will not cause us statistical difficulties except per- differ on the background knowledge assumed. Consider, for
haps with language experts asked to treat the example as an example, this intermediate case:
object of professional interest.
To see what can go wrong with a WS, however, let us The large ball crashed right through the table because
it was made of h i. What was made of h i?
consider an example that is a “near-miss.” We start with the
following: Answer 0: the ball
Frank was jealous when Bill said that he was the winner Answer 1: the table
of the competition. Who was the winner?
special: steel
Answer 0: Frank alternate: styrofoam
Answer 1: Bill
For adults who know what styrofoam is, this WS is obvious.
So far so good, with “jealous” as the special word and Bill as But for individuals who may have only heard the word a few
the clear winner. The difficulty is to find an alternate word times, there could be a problem.
that points to Frank as the obvious winner. Consider this: A major advantage of the WS challenge is that it al-
lows for incremental progress. Like the RTE, it can be
Frank was pleased when Bill said that he was the winner staged: we can have libraries of questions suitable for any-
of the competition.
one who is at least ten-years old (like the trophy one),
The trouble here is that it is not unreasonable to imagine all the way up to questions that are more “university-
Frank being pleased because Bill won (and similarly for level” (like the town councillor one). To get a feel for
“happy” or “overjoyed”). The sentence is too ambiguous some of the possibilities, we include a number of addi-
to be useful. If we insist on using a WS along these lines, tional examples in the Appendix at the end of the paper;
here is a better version: a collection of more than 100 examples can be found at
[Link]
Frank felt h i when his longtime rival Bill revealed that
In addition, the schema can be grouped according to do-
he was the winner of the competition. Who was the
winner? main. Some examples involve reasoning about knowledge
and communication; others involve temporal reasoning or
Answer 0: Frank physical reasoning. Researchers can choose to work on ex-
Answer 1: Bill amples in a particular domain, and to take a test restricted to
that domain.
special: vindicated
alternate: crushed To help ensure that researchers can make progress on the
WS challenge at first, we propose to make publicly available
In this case, it is advisable to include the information that well beforehand a list of all the words that will appear in a
Bill was a longtime rival of Frank to make it more apparent test. (Of course, we would include both the special words
that Frank was the winner.4 and their alternates, although only one of them will be se-
lected at random when the test is administered.) For a test
3
Similarly, there is a farfetched reading where a small trophy with 50 questions, which should be enough to rule out mere
would not “fit” in a big suitcase in the sense of fitting closely, the guessing, 500 words (give or take proper names) should be
way a big shoe is not the right fit for a small foot. sufficient. A test with 50 questions should only take a person
4
However, the vocabulary is perhaps too rich now. 25 minutes or so to complete.
8 Summary of a Winograd Schema Mary infuriated Jane because she had stolen a tennis
To summarize: A Winograd schema is a pair of sentences racket.
Mary scolded Jane because she had stolen a tennis
differing in only one or two words and containing an ambi- racket.
guity that is resolved in opposite ways in the two sentences
and that requires the use of world knowledge and reasoning which directly map onto the Winograd schema structure.
for its resolution. It should satisfy the following constraints: Much of this body of work focusses on exploring dis-
course coherence by classifying verbs by their role in a dis-
1. It should be easily disambiguated by the human reader. course context, since a verb’s role can often give clues as
Ideally, this should be so easy that the reader does not to whether pronouns refer to the subject or an object of a
even notice that there is an ambiguity; a “System 1” ac- previous sentence. For example, in the discourse fragment
tivity, in Kahneman’s terminology (Kahneman 2011).
John infuriated Bill. He . . .
2. It should not be solvable by simple techniques such as
readers usually associate He with John; while in the dis-
selectional restrictions.
course fragment
3. It should be Google-proof; that is, there should be no ob- John scolded Bill. He . . .
vious statistical test over text corpora that will reliably
disambiguate these correctly. readers usually associate He with Bill. Readers expect the
second sentence in the former fragment to elaborate on how
The proposed challenge would involve presenting a pro- John infuriated Bill, while they expect the second sentence
gram claiming to intelligence with one sentence from every in the latter fragement to explain why John scolded Bill; that
pair out of a hidden corpus of Winograd schemas. To pass is, what Bill had done to elicit the scolding. Goikoetxea,
the challenge, the program would have to achieve near hu- among others, discusses the implicit causal relational mean-
man levels of success; presumably close to 100%, if con- ing in certain classes of verbs.
straint (1) above has been satisfied by the corpus designers. Hobbs is exemplary in noting the need for commonsense
The strengths of the challenge, as an alternative to the Tur- world knowledge to understand these sentences; however,
ing test are that it is clear-cut, in that the answer to each the general focus of these researchers is on linguistic tech-
schema is a binary choice; vivid, in that it is obvious to non- niques. None have suggested using such pairs of sentences
experts that a program that fails to get the right answers has to test system comprehension and intelligence.
serious gaps in its understanding; and difficult, in that it is Datasets for testing understanding: Many datasets have
far beyond the current state of the art. been created to test systems’ ability to reason. The best
known of these are the RTE datasets.5 As discussed in Sec-
9 Related Work tion 2, the text and hypothesis pairs appear to focus on rel-
Alternatives to the Turing Test: (Cohen 2004), (Dennett atively shallow reasoning. The FRACAS dataset (Cooper et
1998), (Ford and Hayes 1995), and (Whitby 1996) are al. 1996) covers a greater range of entailment than the RTE
among those who have argued against viewing the Turing datasets but is quite weak on anaphora reference, containing
Test as the ultimate test of artificial intelligence. Cohen only a few such examples.
has suggested several alternatives to the Turing Test, in- Variants of RTE: Choice of Plausible Alternatives (COPA)
cluding a system capable of producing a five-page report on (Roemmele, Bejan, and Gordon 2011) is a proposed vari-
any arbitrary topic and systems capable of learning world ant of the RTE challenge that focusses on choosing between
knowledge through reading text. Unlike the WS Challenge, two alternatives that ask about causes or consequences of a
no definitive guidelines for success are given; passing the statement. Examples of COPA queries are:
test would seem to rely on humna judgement. Dennett has Premise: I knocked on my neighbor’s door. What hap-
observed that disambiguating Winograd-like sentences re- pened as a result?
quires the sort of world knowledge and ability to reason we Alternative 1: My neighbor invited me in.
associate with intelligence, but has not expanded this obser- Alternative 2: My neighbor left his house.
vation into a proposal for an alternative to the Turing Test. A and
very different approach to testing intelligent systems, which Premise: The man fell unconscious. What was the
uses principles of minimum length learning to develop a test cause of this?
applicable to any intelligence is presented in (Hernandez- Alternative 1: The assailant struck the man in the head.
Orallo and Dowe 2010). Alternative 2: The assailant took the man’s wallet.
Winograd Sentences: Sentences similar to Winograd Like the WS challenge, COPA emphasizes relatively deep
Schema have been discussed by Hobbs (1979), Caramazza reasoning. However, COPA’s dataset includes problems that
et al. (1977), Goikoetxea et al. (2008), and Rohde (2008). are less clear-cut than the WS schema. Determining success
An example of Rohde’s is:
5
Mary scolded Sue. She kicked her. Recent RTE competitions have been organized by
NIST. The site for the most recent 2011 competition,
She in the second sentence can refer to Mary (Mary scolded [Link] links to previ-
Sue, and and top of that Mary kicked Sue) or to Susan (Mary ous years’ challenges. Papers describing systems, results, and
scolded Sue because Sue kicked Mary). samples of data are freely available; however, the data itself is
Carmazza et al. give examples of sentence pairs such as: generally available only to registered participants.
in COPA therefore requires a more standard NLP method- WS challenge, introducing what is what is often called the
ology: annotating examples, training on the annotation set, knowledge-based approach (Brachman and Levesque 2004,
developing a human gold standard. In addition, COPA is Chap. 1): explicitly representing knowledge in a formal lan-
narrower than the WS challenge: it focuses on causality, but guage, and providing procedures to reason with that knowl-
makes no attempt to cover the broad range of human reason- edge. While this approach still faces tremendous scientific
ing. It is not intended to supplant the Turing Test. hurdles, we believe it remains the most likely path to suc-
cess. That is, we believe that in order to pass the WS Chal-
10 Discussion lenge, a system will need to have commonsense knowledge
about space, time, physical reasoning, emotions, social con-
10.1 Turing, Searle, and Behaviours structs, and a wide variety of other domains. Indeed, we
The claim of this paper in its strongest form might be this: hope that the WS Challenge will spur new research into rep-
with a very high probability, anything that answers correctly resentations of commonsense knowledge.
a series of these questions (without having extracted any However, nothing in the WS challenge insists on this ap-
hints from the text of this paper, of course) is thinking in proach, and we would expect NLP researchers to try differ-
the full-bodied sense we usually reserve for people. ent approaches. Statistical approaches toward natural lan-
To defend this claim, however, we would have to defend guage processing (Manning and Schütze 1999) have be-
a philosophical position that Turing sought to avoid with his come increasingly popular since the 1990s. Virtually all en-
original Turing Test. So like Turing, it is best to make a trants to competitions like TREC ( [Link] and
weaker claim: with a very high probability, anything that RTE have statistical components at their core; this is true
answers correctly is engaging in behaviour that we would even for natural language programs that emphasize the im-
say shows thinking in people. Whether or not a subject that portance of knowledge representation and reasoning, such
passes the test is really and truly thinking is the philosophi- as the DARPA Machine Reading Program (Strassel et al.
cal question that Turing sidesteps. 2010), (Etzioni, Banko, and Cafarella 2006). The successes
Not everyone agrees with Turing. Searle (2008) attempts of the last several decades in such NLP tasks as text summa-
to show with his well-known Chinese Room thought exper- rization and question-answering have been based on statisti-
iment that it is possible for people to get the observable be- cal NLP.
haviour right (in a way that would cover equally well the These successes have been on limited tasks and gener-
original Turing Test, an RTE test, and our WS challenge), ally do not extend to the type of deep reasoning that we be-
but without having the associated mental attributes. How- lieve is required to solve the WS Challenge. But if statistical
ever, in our opinion (Levesque 2009), his argument is vacu- approaches over large corpora —- to gather commonsense
ous: in particular, it is highly unlikely that a system without knowledge or to learn patterns of pronoun referents — work
understanding that can accurately prescribe such complex better, so be it. The WS Challenge is agnostic about this
behavior can be realized within the size of our universe. matter. This agnosticism also means that we do not intend
On a related theme, Hawkins and Blakeslee (Hawkins and to provide training annotations.
Blakeslee 2004) suggest that AI has focussed too closely on
getting the behaviour right and that this has prevented it from
seeing the importance of what happens internally even when 10.3 Natural vs. Artificial Examples
there is no external behaviour. The result, they argue, is a
research programme that is much too behavioristic. (Searle The trend in natural-language processing challenges, such
makes a similar point.) See also (Cohen 2004). as RTE, TREC, and Machine Reading has been toward texts
In our opinion, this is a misreading of Turing and of AI occurring naturally, such as newspaper articles and blog
research. Observable intelligent behaviour is indeed the ul- data. In contrast, the Winograd Schema set of examples is
timate goal according to Turing, but things do not stop there. artificially constructed. However, we feel quite confident
The goal immediately raises a fundamental question: what that the issues that arise in solving the Winograd schemas
sorts of computational mechanisms can possibly account for in our collection come up as well in interpreting naturally
the production of this behaviour? And this question may occurring text. Indeed, it is sometimes possible to find sen-
well be answered in a principled and scientific way by pos- tences in natural text that can easily be turned into Wino-
tulating and testing for a variety of internal schemes and ar- grad schemas. Consider the following sentence from Jane
chitectures. For example, what are we to make of a person Austen’s Emma:
who quietly reads a book with no external behaviour other Her mother had died too long ago for her to have more
than eye motion and turning pages? There can be a consider- than an indistinct remembrance of her caresses; and
able gap between the time a piece of background knowledge her place had been taken by an excellent woman as
is first acquired and the time it is actually needed to condi- governess, who had fallen little short of a mother in af-
tion behaviour, such as producing the answer to a WS. fection.
10.2 Knowledge-based vs. Statistical Approaches This can be turned into the following WS schema:
The computational architecture articulated by John Mc- Emma’s mother had died long ago, and her h i by an
Carthy (McCarthy 1959) was perhaps the first to offer a excellent woman as governess. Whose h i by the gov-
plausible story about how to approach something like the erness?
Answer 0: Emma’s mother Appendix A: Corpus of Winograd schemas
Answer 1: Emma
This appendix gives some examples of the more
special: place had been taken than 100 additional Winograd schemas available at
alternate: education had been managed [Link] 6 In
the interests of space, we have adopted a more compact
Note also that disambiguating the second and third oc- format. In some cases where we were concerned that the
curences of “her” in the original quotation, referring respec- schema might not be Google-proof, we have done some
tively to Emma and to Emma’s mother, requires inference experiments with searches using Google’s count of result
and world knowledge no less deep; however, these do not pages. These counts, however, are notoriously unreliable
seem to be easily transformable into Winograd schemas. (Lapata and Keller 2005), so these “experiments” should be
The difficulty is that there are certain conventions in text taken with several grains of salt.
in general, and probably more specific conventions in the
works of particular authors, which can be exploited by a sys- 1. John couldn’t see the stage with Billy in front of him be-
tem that attempts at no comprehension, but merely uses sta- cause he is so [short/tall]. Who is so [short/tall]?
tistical knowledge. For example, Hobbs (1979) cites stud- Answers: John/Billy.
ies that show that in naturally-occurring text, an ambiguous
2. Tom threw his schoolbag down to Ray after he reached the
pronoun more often refers to the subject of the preceding
[top/bottom] of the stairs. Who reached the [top/bottom]
sentence than the object. More exact figures can doubtless
of the stairs?
be determined from studies of individual authors.
Answers: Tom/Ray.
While we are not opposed to the use of statistical meth-
ods, we do not believe that systems that use statistics alone, 3. Although they ran at about the same speed, Sue beat Sally
in the absence of world knowledge and any method that sim- because she had such a [good/bad] start. Who had a
ulates reasoning, are conforming to the spirit of the test. [good/bad] start?
Artificially constructing examples allows the test designer Answers: Sue/Sally.
to prevent test takers from using knowledge-free statistical
4. The sculpture rolled off the shelf because it wasn’t [an-
methods.
chored/level]. What wasn’t [anchored/level]?
The major disadvantage of using a hand-crafted test set
Answers: The sculpture/the shelf.
is that it can be expensive to construct large test sets. This
might be a problem if we were intending to construct large 5. Sam’s drawing was hung just above Tina’s and it did look
sets at very frequent intervals —e.g., if we were envision- much better with another one [below/above] it. Which
ing holding a yearly competition with large training and test looked better?
sets. But since we don’t envision doing that, and since the Answers: Sam’s drawing/Tina’s drawing.
labor involved in constructing a small dataset of around 100
6. Anna did a lot [better/worse] than her good friend Lucy
examples is on the order of one or two weeks of work, we
on the test because she had studied so hard. Who studied
do not consider this to be much of an issue.
hard?
Answers: Anna/Lucy
10.4 Conclusion
7. The firemen arrived [after/before] the police because they
Like Turing, we believe that getting the behaviour right is were coming from so far away. Who came from far away?
the primary concern in developing an artificially intelligent Answers: The firemen/the police.
system. We further agree that English comprehension in
the broadest sense is an excellent indicator of intelligent be- 8. Frank was upset with Tom because the toaster he had
haviour. Where we have a slight disagreement with Turing is [bought from/sold to] him didn’t work. Who had
whether a free-form conversation in English is the right ve- [bought/sold] the toaster?
hicle. Our WS challenge does not allow a subject to hide be- Answers: Frank/Tom.
hind a smokescreen of verbal tricks, playfulness, or canned 9. Jim [yelled at/comforted] Kevin because he was so upset.
responses. Assuming a subject is willing to take a WS test Who was upset?
at all, much will be learned quite unambiguously about the
Answers: Jim/Kevin.
subject in a few minutes. What we have proposed here is cer-
tainly less demanding than an intelligent conversation about 10. The sack of potatoes had been placed [above/below] the
sonnets (say), as imagined by Turing; it does, however, offer bag of flour, so it had to be moved first. What had to be
a test challenge that is less subject to abuse. moved first?
Answers: The sack of potatoes/the bag of flour.
11 Acknowledgements: 11. Pete envies Martin [because/although] he is very success-
ful. Who is very successful?
An earlier version of this paper, with Hector Levesque as
Answers: Martin/Pete.
sole author, was presented at Commonsense-2011. We thank
Ray Jackendoff, Mitch Marcus, and the anonymous review- 6
Thanks to Pat Levesque and reviewers for help with the first
ers of this paper for their helpful suggestions and comments. several examples and to Stavros Vassos for general discussion.
12. I spread the cloth on the table in order to [protect/display] 26. Sam tried to paint a picture of shepherds with sheep, but
it. To [protect/display] what? they ended up looking more like [dogs/golfers]. What
Answers: the table/the cloth. looked like [dogs/golfers]?
13. Sid explained his theory to Mark but he couldn’t Answer: The sheep/the shepherds.
[convince/understand] him. Who did not [con- 27. Thomson visited Cooper’s grave in 1765. At that date he
vince/understand] whom? had been [dead/travelling] for five years. Who had been
Answers: Sid did not convince Mark/Mark did not un- [dead/travelling] for five years?
derstand Sid. Answers: Cooper/Thomson
14. Susan knew that Ann’s son had been in a car accident, 28. Tom’s daughter Eva is engaged to Dr. Stewart, who is his
[so/because] she told her about it. Who told the other partner. The two [doctors/lovers] have known one another
about the accident? for ten years. What two people have known one another
Answers: Susan/Ann. for ten years?
15. The drain is clogged with hair. It has to be Answers: Tom and Dr. Stewart / Eva and Dr. Stewart.
[cleaned/removed]. What has to be [cleaned/removed]? 29. The actress used to be named Terpsichore, but she
Answers: The drain/the hair. changed it to Tina a few years ago, because she figured
it was [easier/too hard] to pronounce. Which name was
16. My meeting started at 4:00 and I needed to catch the
[easier/too hard] to pronounce?
train at 4:30, so there wasn’t much time. Luckily,
it was [short/delayed], so it worked out. What was Answers: Tina/Terpsichore.
[short/delayed]? 30. Sara borrowed the book from the library because she
Answers: The meeting/the train. needs it for an article she is working on. She
[reads/writes] when she gets home from work. What does
17. There is a pillar between me and the stage, and I can’t
Sara [read/write] when she gets home from work/
[see/see around] it. What can’t I [see/see around]?
Answers: The book/the article.
Answers: The stage/the pillar.
31. Fred is the only man still alive who remembers my great-
18. Ann asked Mary what time the library closes,
grandfather. He [is/was] a remarkable man. Who [is/was]
[but/because] she had forgotten. Who had forgotten?
a remarkable man?
Answers: Mary/Ann.
Answers: Fred/my great-grandfather.
19. Bob paid for Charlie’s college education, but now Char-
32. Fred is the only man alive who still remembers my father
lie acts as though it never happened. He is very
as an infant. When Fred first saw my father, he was twelve
[hurt/ungrateful]. Who is [hurt/ungrateful]?
[years/months] old. Who was twelve [years/months] old?
Answers: Bob/Charley
Answers: Fred/my father.
20. At the party, Amy and her friends were [chatting/barking];
33. There are too many deer in the park, so the park service
her mother was frantically trying to make them stop. It
brought in a small pack of wolves. The population should
was very strange behavior. Who was behaving strangely?
[increase/decrease] over the next few years. Which popu-
Answers: Amy’s mother/Amy and her friends. lation will [increase/decrease]?
21. The dog chased the cat, which ran up a tree. It waited at Answers: The wolves/the deer.
the [top/bottom] Which waited at the [top/bottom]?
34. Archaeologists have concluded that humans lived in La-
Answers: The cat/the dog. puta 20,000 years ago. They hunted for [deer/evidence]
22. Sam and Amy are passionately in love, but Amy’s parents on the river banks. Who hunted for [deer/evidence]?
are unhappy about it, because they are [snobs/fifteen]. Answers: The prehistoric humans/the archaeologists.
Who are [snobs/fifteen]?
35. The scientists are studying three species of fish that have
Answers: Amy’s parents/Sam and Amy. recently been found living in the Indian Ocean. They
23. Mark told Pete many lies about himself, which Pete in- [appeared/began] two years ago. Who or what [ap-
cluded in his book. He should have been more [truth- peared/began] two years ago?
ful/skeptical]. Who should have been more [truth- Answers: The fish/the scientists.
ful/skeptical]?
36. The journalists interviewed the stars of the new movie.
Answers: Mark/Pete. They were very [cooperative/persistent], so the interview
24. Since it was raining, I carried the newspaper [over/in] my lasted for a long time. Who was [cooperative/persistent]?
backpack to keep it dry. What was I trying to keep dry? Answers: The stars/the journalists.
Answers: The backpack/the newspaper. 37. I couldn’t find a spoon, so I tried using a pen to stir my
25. Jane knocked on Susan’s door, but she didn’t [answer/get coffee. But that turned out to be a bad idea, because it got
an answer]. Who didn’t [answer/get an answer]? full of [ink/coffee]. What got full of [ink/coffee]?
Answers: Susan/Jane. Answers: The coffee/the pen.
Comment: The statistical associations give the backward Levesque, H. 2009. Is it Enough to get the Behaviour Right?
answer here: “ink” is associated with “pen” and “coffee” In Proceedings of the Twenty-first International Joint Con-
is associated with “coffee”. Of course, a contestant could ference on Artificial Intelligence. San Mateo, Calif.: Morgan
use a backward rule here: Since the challenge designers Kaufmann.
have excluded examples where statistics give the right an- Majumdar, D., and Bhattacharyya, P. 2010. Lexical Based
swer, if you find a statistical relation, guess that the an- Text Entailment System for Main Task of RTE6. In Pro-
swer runs opposite to it. But that seems very risky. ceedings, Text Analysis Conference, NIST.
Manning, C., and Schütze, H. 1999. Foundations of Statisti-
References cal Natural Language Processing. Cambridge, Mass.: MIT
Bobrow, D.; Condoravdi, C.; Crouch, R.; de Paiva, V.; Kart- Press.
tunen, L.; King, T.; Mairn, B.; Price, L.; and Zaenen, A. McCarthy, J. 1959. Programs with Common Sense. In
2007. Precision-focussed Textual Inference. In Proc. Work- Proceedings of the Teddington Conference on the Mecha-
shop on Textual Entailment and Paraphrasing. nization of Thought Processes. London: Her Majesty’s Sta-
Brachman, R., and Levesque, H. 2004. Knowledge Repre- tionery Office.
sentation and Reasoning. Morgan Kaufman. Pylyshyn, Z. 1984. Computation and Cognition: Toward a
Christian, B. 2011. Mind vs. Machine. Atlantic Monthly. Foundation for Cognitive Science. Cambridge, Mass.: MIT
March 2011. Press.
Cohen, P. 2004. If not the Turing Test, Then What? In Pro- Roemmele, M.; Bejan, C.; and Gordon, A. 2011. Choice
ceedings of the Nineteenth National Conference on Artificial of Plausible Alternatives: An Evaluation of Commonsense
Intelligence. Menlo Park, Calif.: AAAI Press. Causal Reasoning. In Proceedings, International Sympo-
Cooper, R.; Crouch, D.; Eijckl, J. V.; Fox, C.; Genabith, sium on Logical Formalizations of Commonsense Reason-
J. V.; Japars, J.; Kamp, H.; Milward, D.; Pinkal, M.; Poesio, ing.
M.; and Pulman, S. 1996. A Framework for Computational Rus, V.; McCarthy, P.; McNamara, D.; and Graesser, A.
Semantics (FraCaS). Technical report, The FraCaS Consor- 2007. A Study of Textual Entailment. International Journal
tium. of Artificial Intelligence Tools 17.
Dagan, I.; Glicksman, O.; and Magnini, B. 2006. The PAS- Shieber, S. 1994. Lessons from a Restricted Turing Test.
CAL recognising textual entailment challenge. In Machine Communications of the ACM 37(6):70–78.
Learning Challenges: LNAI 3944. Springer Verlag. Strassel, S.; Adams, D.; Goldberg, H.; Herr, J.; Keesing, R.;
Davis, E. 2012. Qualitative Spatial Reasoning in Interpret- Oblinger, D.; Simpson, H.; Schrag, R.; and Wright, J. 2010.
ing Text and Narrative. Spatial Cognition and Computation. The DARPA Machine Reading Program - Encouraging Lin-
Forthcoming. guistic and Reasoning Research with a Series of Reading
Dennett, D. 1998. Can Machines Think? In Mather, G.; Tasks. In International Conference on Language Resources
Verstraten, F.; and Anstis, S., eds., The Motion Aftereffect. and Evaluation (LREC).
MIT Press. von Ahn, L.; Blum, M.; Hopper, N.; and Langford, J. 2003.
Etzioni, O.; Banko, M.; and Cafarella, M. 2006. Machine CAPTCHA: Using Hard AI Problems for Security. In
Reading. In Proceedings of the Twenty-First National Con- Eurocrypt-2003, 294–311.
ference on Artificial Intelligence. Menlo Park, Calif.: AAAI Weizenbaum, J. 1966. ELIZA — A Computer Program
Press. for the Study of Natural Language Communication between
Ford, K., and Hayes, P. 1995. Turing Test Considered Harm- Man and Machine. Communications of the ACM 9(1):36–
ful. In Proceedings of the Fourteenth International Joint 45.
Conference on Artificial Intelligence, 972–977. San Mateo, Whitby, B. 1996. Why the Turing Test is AI’s Biggest Blind
Calif.: Morgan Kaufmann. Alley. In Millican, P., and Clark, A., eds., Machine and
Hanks, S., and McDermott, D. 1987. Nonmonotonic Logic Thought. Oxford University Press.
and Temporal Projection. Artificial Intelligence 33(3):379– Winograd, T. 1972. Understanding Natural Language. New
412. York: Academic Press.
Hawkins, J., and Blakeslee, S. 2004. On Intelligence. New
York: Times Books.
Hernandez-Orallo, J., and Dowe, D. L. 2010. Measur-
ing Universal Intelligence: Toward an Anytime Intelligence
Test. Artificial Intelligence 174(18):1508–1539.
Kahneman, D. 2011. Thinking, Fast and Slow. Farrar,
Straus, and Giroux.
Lapata, M., and Keller, F. 2005. Web-based Models for
Natural Language Processing. ACM Transactions on Speech
and Language Processing 2(1).