Does the Turing Test demonstrate or not?

The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters

Citation Stuart M. Shieber. Does the Turing Test demonstrate intelligence or not? In Proceedings of the Twenty-First National Conference on (AAAI-06), Boston, MA, 16-20 July 2006.

Published Version http://www.aaai.org/Library/AAAI/2006/aaai06-245.php

Citable link http://nrs.harvard.edu/urn-3:HUL.InstRepos:2252596

Terms of Use This article was downloaded from Harvard University’s DASH repository, and is made available under the terms and conditions applicable to Other Posted Material, as set forth at http:// nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of- use#LAA Stuart M. Shieber. 2006. Does the Turing Test Demonstrate Intelligence or Not?. In Proceedings of the Twenty-First National Conference on Artificial Intelligence (AAAI-06), Boston, MA, 16–20 July. l cnwegdi yppro hc hssmayi based is summary this which Copyright on appear). paper to my (Shieber, in acknowledged own ple Turing’s stimuli”. verbal verbal of of sequence sequence sensible a a a to an “produce as of to responses it adequacy ability characterizes the the of of (1981) test test Block a behavior. heart, verbal its agent’s at is, Test Turing The Test Turing the of Essence The cut. be can Test Turing op- the two of sufficiency the the between of knot views Gordian practi- posing a the from Thus, difference conceivable standpoint. under no cal proof make to statistical as is of slight weakening so one this Crucially, to criterion assumptions. the proof realizability of weak logical weakening of slight very one that a from argue allow we will as I long so intelligent. is conditions agent extra the passing these from that Test logically follow Turing cannot a Tur- it a Therefore, under Test. subject a ing by exhibited properties ex- beyond behavioral intelligence the some for necessary that logically showing are conditions on tra rely intelligence determining for have 2004). (Shieber, that 1950 Test since but debate Turing constant contradictory under the at- mutually been towards that two attitudes appear) these well-founded to reconcile (Shieber, to length audience at tempts presented philosophical have a I that for argument an audience gence attributing for condition sufficient a intelligence. as Test flawed Turing the hopelessly that is holds wisdom questioned, philosophical the continually current been and stopped; has (Turing, Test hasn’t the conversation article of the appropriateness seminal published, Turing’s was moment other 1950) the the On from in- “philosophical says. hand, (1985) an a Dennett like as intelligence, stopper” seems conversation for inspiration humans criterion behavior re- of verbal defining controvertible that because intelligence part a from artificial in indistinguishable arises as of centrality Its history served search. early has the Test throughout Turing The ec wwaa.r) l ihsreserved. rights All (www.aaai.org). gence h ruet gis h ufiinyo h uigTest Turing the of sufficiency the against arguments The intelli- artificial an for summarize I article, short this In ∗ rtflyakoldeb eeec h ag ubro peo- of number large the reference by acknowledge gratefully I

c 06 mrcnAscainfrAtfiilIntelli- Artificial for Association American 2006, osteTrn etDmntaeItliec rNot? or Intelligence Demonstrate Test Turing the Does Introduction can ervae yteTrn Test, Turing the by revealed be [email protected] tatM Shieber M. Stuart 3Ofr tet—245 — Street Oxford 33 abig,M 02138 MA Cambridge, avr University Harvard tns fteTrn eta rtro o nelgnepro- intelligence for this: criterion like appropri- a something the as ceeds Test underlies Turing that the indication of syllogism an ateness The is behavior intelligence. verbal of sensible produce to operational ability and clear responses. machine’s more the of make forced “sensibility” the constitutes to what and serve confederate merely human each. forced choice the with interactions of repeated verbal determining introduction of in at basis The chance the if on than and seeing which better is no person of which do a can goal judge entities, the a choices two with between machine, game a terms imitation in couched an is of test the of presentation original Turing’s view: (Newman this Test the of descriptions et u sw ilse vnti enepeaino Premise of reinterpretation insufficient. this is even see, 2 will we as ought a But premise as feat. the interpreted Thus, be run.) Turing to be reason, tests this multiple (For have would occasion. typewrit- responses” rare) on verbal (astronomically of monkeys sequence on Even sensible a chance. “produce might of ers result the be might premise. this of intelli- denial for a criterion on a based as oth- is Test gence Turing (and the his of and repudiation intelligence”, ers’) of conception Test “Turing Conclusion: 2: Premise 1: Premise a epne oasqec fvra tml,te tis it then stimuli, verbal of sequence intelligent. a to responses se- bal a to responses verbal stimuli. of verbal of sequence quence sensible a duces that idea the on founded is Test Turing the base, at Thus, time every machine” a pre- be will consideration. That must proper “It without machine. saying a are not them really and vent they man sometimes a that with and quite dealing judge times, to of . has . number . jury convincing. each a will reasonably that it suppose is and better pretence it, had We the to put if to questions pass pretend answering only to by has machine man, the a that be is test the of idea The o ntne asn uigTs nasnl occasion single a on Test Turing a passing instance, For the as one second the as such premise a to refers Block intelligent. is it then ∗ fa gn rdcsasnil euneo ver- of sequence sensible a produces agent an If pro- it then Test, Turing a passes agent an If hrfr,i naetpse uigTest, Turing a passes agent an if Therefore, eea capacity general tal. et 92 codwith accord 1952) , o noccasional an not , The Argument Against a Behaviorist Test is based on the fact that the Aunt Bertha machine is expo- The anti-behaviorist argument against the Turing Test as a nentially large, that is, its size is exponential in the length sufficient condition for intelligence was apparently first pro- of the conversation. Objection 8 leads to his “amended posed in sketch form by Shannon & McCarthy (1956, page neo-Turing-Test conception”: “Intelligence is the capacity vi): “A disadvantage of the Turing definition of thinking is to emit sensible sequences of responses to stimuli, so long that it is possible, in principle, to design a machine with a as this is accomplished in a way that averts exponential ex- complete set of arbitrarily chosen responses to all possible plosion of search.” (Emphasis in original.) It is not exactly input stimuli. . . . With a suitable dictionary such a machine clear what “exponential explosion of search” is intended to would surely satisfy Turing’s definition but does not reflect indicate in general. In the case of the Aunt Bertha machine, our usual intuitive concept of thinking.” exponentiality surfaces in the size of the machine, not the In “Psychologism and ”, Block (1981) time complexity of the search. Further, the aspect of the presents the argument in its most fully worked out form. Aunt Bertha machine that conflicts with our intuitions about Imagine (with Block) a hypothetical machine that stores intelligence is its reliance upon memorization. Removing a tree of interactions providing a sensible response for the possibility of exponential storage amounts to a prohibi- 1 each possible interrogator’s input in each possible conver- tion against memorization. Consequently, an appropriate sational context of up to, say, one hour long. (These re- rephrasing of Premise 2 is sponses might be modeled on those that Block’s fictional The compact conception of intelligence: If an agent has Aunt Bertha would have given.) Such a tree would unde- the capacity to produce a sensible sequence of verbal re- niably be large, but processing in it would be conceptually sponses to a sequence of verbal stimuli, whatever they straightforward. By hypothesis, such an “Aunt Bertha ma- may be, and without requiring storage exponential in the chine” would pass a Turing Test of up to one hour, because length of the sequence, then the agent is intelligent. its responses would be indistinguishable from that of Aunt Again, Block notes that this additional condition is psychol- Bertha, whose responses it recorded. Such a machine is ogistic in mentioning a nonbehavioral condition, viz., that clearly not intelligent—Block (1981) says it “has the intelli- the manner of the processing must avert combinatorial ex- gence of a toaster”, and Shannon and McCarthy would pre- plosion of storage. He claims that insofar as the condition is sumably agree—by the same token that the teletype that the psychologistic, a Turing Test cannot test for it. interrogator interacts with in conversation with the human To summarize, Block’s Aunt Bertha argument forces us confederate in a Turing Test is not intelligent; it is merely to pay up on two psychologistic promissory notes. For the the conduit for some other person’s intelligence, the human purely behavioral Turing Test to demonstrate intelligence, confederate. Similarly, the Aunt Bertha machine is merely it must suffice as a demonstration of the antecedent of the the conduit for the intelligence of Aunt Bertha. Yet just as compact conception of intelligence, that is, it must indicate surely, it can pass a Turing Test, and more, has the capacity a general capacity to produce a sensible sequence of verbal to pass arbitrary Turing Tests of up to an hour. The mere responses and it must demonstrate compactness of storage logical possibility of an Aunt Bertha machine is sufficient of the agent. to undermine premise 2, even under a reinterpretation as re- quiring a general capacity. The Interactive Proof Alternative It seems to me that Shannon, McCarthy, and Block are right in principle: Such a machine is conceptually possible; Turning to the capacity issue first, there is certainly no de- hence the Turing Test is not logically sufficient as a condi- ductive move that allows one to go from observation of the tion of intelligence. Let us suppose this view is correct and, passing of one or more Turing Tests to a conclusion of a gen- as Block argues, some further criterion is needed regarding eral capacity; the monkeys and typewriters argument shows the manner in which the machine works. Some further cri- that. This is the Humean problem of induction. But it does terion is needed, but how much of a criterion is that, and can not follow that there is no method of reasoning from the for- the Turing Test test for it? Although Block calls this fur- mer to the latter. I will argue that the powerful notion of an ther internal property “nonbehavioral”, I will argue that the interactive proof, taken from theoretical , mere behavior of passing a Turing Test can reveal the prop- is exactly such a reasoning method. Furthermore, as dis- erty. Borrowing an idea from theoretical computer science, cussed below, Turing Tests bear some of the tell-tale signs I argue that the Turing Test can be viewed as an interac- of interactive proofs that have been investigated in the com- tive proof not only of the fact of sensible verbal behavior, puter science literature. but of a capacity to generate sensible verbal behavior, and Interactive proofs are protocols designed to convince a to do so “in the right way”. Assuming some extraordinarily verifier conventionally denoted V that a prover P has certain knowledge or abilities, which we will think of as being en- weak conditions on physical realizability, any Turing-Test– 2 passing agent must possess a sufficient property to vitiate capsulated in an assertion s. In a classical (deductive) proof Block’s argument. In summary, Block’s arguments are not 1For this reason, adding this extra condition to the conception sufficient to negate the Turing Test as a criterion of intelli- of intelligence is not ad hoc. It amounts to saying, in a precise way, gence, at least under a very slight weakening of the notion that the agent must have the capacity to produce sensible responses of “criterion”. without having memorized them. Block pursues a number of potential objections to his ar- 2For convenience in reference, we will refer to V and P using gument, the most significant of which (his “Objection 8”) gendered pronouns “she” and “he” respectively. system, P would merely reveal a deductive proof of s, which inputs is selected where t > ts (the prover outperforms the V then verifies. This provides V with knowledge of s and sample threshold on the sample), yet tp < tl (the subspace is perhaps other knowledge implicit in the proof. Interactive smaller than the definitional threshold). Using the method proofs augment classical proof systems by adding notions of of Chernoff bounds (see, e.g., Chapter 5 of the text by Mot- randomization and interaction between prover and verifier. wani (1995)), it can be shown that the probability of a false 2 (The interaction implicit in classical proof systems — P’s −ck (tl −ts) positive is Pr[t > ts] < e where c = . Thus, it has presenting V with the proof — is essentially trivial.) Inter- 2(1−tl ) action is added by allowing V and P to engage in rounds of the behavior of an interactive proof: As the number of sam- k message-passing. Randomization is introduced in two ways: ples increases, the probability of a false positive decreases First, the verifier may make use of random bits in construct- exponentially. ing her messages. Second, she may be required to be sat- It is important to realize that the probabilities of error that isfied with a probabilistic notion of proof. When we state we are talking about can be literally astronomically small. that P proves s with an interactive proof, we mean (implic- For the bounds that we have been talking about, if we let k be, say, 300, the false positive probability is on the order of itly) that s has been proved but with a certain determinable 10 residual probability of error. That is, the verifier may need to 1 in 10 ; at that rate, if a population the size of all humanity were tested, one would expect to see no false positives. At be satisfied with some small and quantifiable chance that the 27 protocol indicates that s is true when in fact it is not, or vice k = 1000, the false positive rate of some 1 in 10 is truly versa. The residual error is the reason that moving to a no- astronomically small. tion of interactive proofs is a weakening relative to a view as In summary, a protocol in which we run k Turing Tests a deductive proof. The fact that the residual error can rapidly and receive sensible responses on an appropriate fraction be made vanishingly small through repeated protocols is the provides exponentially strong evidence that the agent sat- reason that the weakening is referred to as “very slight”. isfies the antecedent of the capacity conception, that is, has a general capacity to produce sensible responses to verbal The Turing Test as an Interactive Proof of Capacity stimuli, whatever they may be. One may quibble with the various bounds proposed. (Should tl be 50%? 80%? 99%?) I view the Turing Test as an interactive proof for the Varying them does not change the basic character of the ar- antecedent of the capacity conception of intelligence, that gument. It merely adjusts the number of samples needed is, it is a proof that P “has the capacity to produce a sensible before the knee in the exponential curve. sequence of verbal responses to a sequence of verbal stimuli, Thus, under the notion of proof provided by interactive whatever they may be”. Consider the space of all possible proofs, the Turing Test can provide a proof of a general ca- sequences of verbal stimuli. Let the fraction of this space pacity to produce a sensible sequence of verbal responses to for which P generates sensible responses be t . An agent p a sequence of verbal stimuli, whatever they may be. It can without a general capacity to produce sensible sequences therefore unmask the monkeys on typewriters. of responses would fail to do so on some nontrivial fraction of this space. Block notes that a 100% criterion is neither The Turing Test as an Interactive Proof of necessary nor appropriate. One wants to be able to “ask of a system that fails the test whether the failure really does Compactness indicate that the system lacks the disposition to pass the The interactive proof approach provides leverage for demon- test.” Indeed, people put under similar tests would at least strating compactness as well. When all we know is the occasionally perform in such a way that a judge might deem agent’s performance on a fixed sample of stimuli, the stor- the responses not sensible. So there is some percentage age requirements to generate these responses is linear in the tl, less than 100%, such that if an agent produced sensible length of the stimuli. But the size of any fixed fraction of the sequences of responses on that percentage of the space (that space of possible stimuli is exponential in their length. By is, tp > tl), we can attribute a general capacity, sufficient for being able to reason from the sample to the fraction of the the antecedent of the capacity conception. Let us say, for the space as a whole — as the interactive proof approach allows sake of argument, that tl = 1/2. Thus, if an agent produces — we can conclude that an agent using a memorization strat- sensible responses to 50% of the space of possible verbal egy (as the Aunt Bertha machine) would require exponential stimuli, we will consider it to have a general capacity to storage capacity to achieve this performance. Conversely, produce such responses. Importantly, we are not saying that any agent not possessing exponential storage capacity would the agent must merely produce sensible responses to 50% of fail the interactive proof. some subsample of possible stimuli that we confront it with, Nonetheless, how can a Turing Test reveal that the ma- but with 50% of all possible stimuli, in a counterfactual chine doesn’t have exponential storage capacity? The com- sense, whether we ever test it with these stimuli or not. pact conception would require that the agent pass Turing Suppose we sample k sequences of verbal stimuli uni- tests at least logarithmic in its storage capacity. Thus, with- formly from this space, and test some agent as to whether it out bounding its storage capacity, we can’t bound the length generates sensible sequences of responses to them. Suppose of the Test we would need. There is no purely logical ar- further that the agent does so on t of these stimuli, where gument against this possibility; the Aunt Bertha argument t is greater than a sample threshold ts > tl (say, ts = 3/4). shows this. Some further assumption must be made to pay Can we conclude that the agent has a general capacity as the compactness promissory note. I now turn to how weak defined above? A false positive occurs when a sample of k an assumption is required. Suppose we could bound the information capacity of the agent that can’t pass a singleton Turing Test can do so by universe. Then any physically realizable agent that could simultaneously participating as judge in another. pass Turing Tests whose length exceeded the logarithm of this amount would satisfy the compact conception. We Conclusion would be able to bound the length of the Turing Test re- I have argued that the Turing Test is appropriately viewed quired under the compact conception, at least for any agent not as a deductive or inductive proof but as an interactive that is no larger than the universe. (And of course, no agent proof of the intelligence of a subject-under-test. This view is larger than the universe.) We will call this length bound is evidenced both by the similarity in form between Turing the critical Turing Test length. One might worry that the crit- Tests and interactive proof protocols and by the sharing of ical Turing Test length might be centuries or millennia long. important properties between Turing Tests and interactive Without going into detail (which I have provided else- proofs. where (Shieber, to appear)), the information capacity of the In so doing, I provide a counterargument against Block’s universe can be estimated based on the holographic princi- demonstration that the Turing Test is not a sufficient crite- ple (regarding which see the survey by Bousso (2002) for rion of intelligence. The counterargument requires a (very a review) and estimates of the time since the Big Bang. To- slight) weakening of the conditions required of the Turing 120 gether, these lead to a reasonable upper bound of some 10 Test — weakening the notion of proof (from classical deduc- 200 bits. Rounding up by 80 orders of magnitude, call it 10 . tive proof to interactive proof with its exponentially small Descending now from these ethereal considerations to the residual error probability) and strengthening the notion of concrete goal of analyzing the Turing Test conceptions of possible agent (from one of logical possibility to one with a intelligence, under the compact conception, we would re- trivial realizability requirement). These weakenings are suf- quire an agent with this literally astronomical storage capac- ficiently mild that they can be seen as providing foundation ity to have a capacity to pass Turing Tests of on the order of for the view that the Turing Test is a sound sufficient condi- 200 log2 10 ≈ 670 bits. The entropy of English is about one tion for intelligence. Block is right, yet Dennett may be too. bit per character or five bits per word (Shannon, 1951), so we It merits pointing out that this view of the Turing Test is require a critical Turing Test length of around 670 characters consonant with (though by no means implicit in) Turing’s or 140 words. At a natural speaking rate of some 200 words view of the Test as presented in his writings. His view of per minute, a conversation of less than a minute would there- the Test as being statistical in nature and his pragmatic ori- fore unmask a Turing-Test subject whose performance, like entation toward its efficacy are of a piece with its status as that of the Aunt Bertha machine, is based on memorization. an interactive rather than classical proof. In essence, I have argued for the following recasting of the basic syllogism supporting the sufficiency of the Turing References Test: Block, N. 1981. Psychologism and behaviorism. Philosoph- Premise 1: If an agent passes k rounds of a Turing Test of at ical Review XC(1):5–43. least one minute in length, then (with probability of error Bousso, R. 2002. The holographic principle. Re- exponentially small in k) it has the capacity to produce views of Modern Physics 74:825–874. Available as hep- a sensible sequence of verbal responses to a sequence of th/0203101. verbal stimuli that is logarithmic in the storage capacity of the agent, whatever they may be. Dennett, D. 1985. Can machines think? In Shafto, M., ed., How We Know. San Francisco, CA: Harper & Row. Premise 2: If an agent has the capacity to produce a sensi- 121–145. ble sequence of verbal responses to a sequence of verbal stimuli that is logarithmic in the storage capacity of the Motwani, R. 1995. Randomized Algorithms. Cambridge, agent, whatever they may be, then it is intelligent. England: Cambridge University Press. Newman, M. H. A.; Turing, A. M.; Jefferson, S. G.; and Conclusion: Therefore, if an agent passes k rounds of a Braithwaite, R. B. 1952. Can automatic calculating ma- Turing Test of at least one minute in length, then (with chines be said to think? Radio interview, recorded 10 probability of error exponentially small in k) it is intelli- January 1952 and broadcast 14 and 23 January 1952. Tur- gent. ing Archives reference number B.6. As contributory evidence for this view of the Turing Test Shannon, C. E., and McCarthy, J., eds. 1956. Automata as an interactive proof, I note that the Turing Test shares Studies. Princeton, NJ: Princeton University Press. various other properties with interactive proofs, such as Shannon, C. E. 1951. Prediction and entropy of printed Nontransferability: Turing tests, like other interactive English. Bell Systems Technical Journal 30(1):50–64. proofs, provide proof only to the verifier, and not third parties. This property is familiar to those of us in the Shieber, S. M. 2004. The Turing Test: Verbal Behavior as natural-language-processing field as the “cooked demo”. the Hallmark of Intelligence. MIT Press. Lack of closure under composition: Turing Tests, like Shieber, S. M. To appear. The Turing test as interactive Nousˆ other interactive proofs, may lose their proof character- proof. . istics under composition. For example, Turing Tests are Turing, A. M. 1950. Computing machinery and intelligence. subject to a “man-in-the-middle attack”, by which an LIX(236):433–460.